Show understanding of the limitations of using a file-based approach for the storage and retrieval of data

Cambridge IGCSE/A‑Level Computer Science – Core Concepts

8.1 Database Concepts – Limitations of a File‑Based Approach

Why many programmes start with files but soon need a DBMS

  • Each application creates its own files – simple for tiny, single‑user programmes.
  • As the amount of data, the number of programmes and the number of users grow, the drawbacks become serious.

Key Limitations

LimitationTypical Impact on an Application
Data redundancy & inconsistencySame information stored in several files. Updating one copy does not automatically update the others → contradictory records.
Limited data accessFiles are usually accessed sequentially. Finding a particular record often requires scanning the whole file (≈ O(n) reads).
Lack of data independenceAny change to the file layout (e.g., adding a field) forces a change in every programme that reads or writes the file.
Weak integrity controlNo built‑in primary‑key, foreign‑key or domain constraints. The programmer must write extra code to prevent illegal data.
Concurrency problemsWhen several users edit the same file at the same time, updates can be lost or the file can become corrupted; file‑system locks are coarse‑grained.
Coarse security & access controlPermissions are applied to whole files only – you cannot hide individual records from a particular user.
Poor scalability & performanceSearches become slower as the file grows; backup, restore and recovery are labour‑intensive.
No standard query languageEvery query requires custom code; there is no declarative language such as SQL to retrieve, join or aggregate data.

Mathematical View of Search Cost

For a file containing n records:

  • Sequential search (no index): average reads = (n + 1) / 2 → O(n).
  • With a balanced index (e.g., B‑tree of branching factor b): average reads ≈ logb n → O(log n).

Illustrative Example – Student Management System

students.txt – personal details (ID, name, address, DOB)

enrolments.txt – course enrolments (StudentID, CourseCode, Semester)

If a student moves house, the address must be edited in students.txt. Because the address is also duplicated in enrolments.txt (for reporting purposes), forgetting to update the second file creates an inconsistency that can cause wrong mailing lists or inaccurate fee calculations.

Real‑World Consequences

  1. Inconsistent reports – financial statements derived from duplicated transaction files may not match.
  2. Data loss – corruption of a single file can erase all copies of a particular data item.
  3. High maintenance cost – any change to the file format forces a rewrite of every dependent programme.
  4. Limited decision support – without a query language, ad‑hoc analysis is slow and error‑prone.

Comparison with a DBMS

AspectFile‑Based ApproachDatabase Management System (DBMS)
RedundancyHigh – duplicate records in many filesLow – data stored once, referenced by keys
Access speedLinear scan (O(n))Indexed access (O(log n)) + query optimiser
Data integrityManual checks in codeBuilt‑in constraints (PK, FK, CHECK, NOT NULL)
ConcurrencyCoarse file locks; prone to lost updatesTransaction management (ACID), row‑level locking
ScalabilityPoor – performance degrades quicklyGood – handles large volumes & many simultaneous users
SecurityFile‑level permissions onlyGranular privileges (user, role, column)
Query capabilityCustom code per querySQL – powerful, declarative, portable
Backup & recoveryManual copy of each fileAutomated transaction logs, point‑in‑time restore

Suggested Classroom Diagram (description)

Figure 1 – Left: several independent files (students.txt, enrolments.txt, grades.txt) accessed directly by three programmes. Right: a single logical database managed by a DBMS; the same three programmes send SQL statements to the DBMS layer, which handles storage, indexing and concurrency.


1 Information Representation

1.1 Binary, Hexadecimal & Binary Prefixes

PrefixSymbolDecimal factorBinary factor (IEC)
kilok10³ = 1 000kibi (Ki) = 2¹⁰ = 1 024
megaM10⁶ = 1 000 000mebi (Mi) = 2²⁰ = 1 048 576
gigaG10⁹ = 1 000 000 000gibi (Gi) = 2³⁰ = 1 073 741 824

1.2 Text Representation

  • ASCII – 7 bits per character (128 symbols). Suitable for basic English text.
  • Unicode (UTF‑8) – variable length (1–4 bytes) covering > 1 million characters; backward compatible with ASCII.

1.3 Multimedia – Bitmap vs. Vector

AspectBitmap (Raster)Vector
Storage modelPixel‑by‑pixel array (width × height × bits‑per‑pixel)Mathematical primitives (lines, curves, shapes)
ScalabilityPoor – enlargement causes pixelationExcellent – resolution‑independent
Typical formatsBMP, PNG, JPEG, GIFSVG, EPS, PDF
Typical use‑casesPhotographs, detailed imagesLogos, icons, CAD drawings

Worked Example – Bitmap Size

Calculate the size of an 800 × 600 colour image stored as a 24‑bit bitmap (no compression).

Width × Height × Bits‑per‑pixel = 800 × 600 × 24 = 11 520 000 bits

Bytes = 11 520 000 ÷ 8 = 1 440 000 bytes

KiB = 1 440 000 ÷ 1 024 ≈ 1 406 KiB

MiB = 1 406 ÷ 1 024 ≈ 1.37 MiB

1.4 Compression Techniques

  • Run‑Length Encoding (RLE) – lossless; replaces consecutive identical symbols with a count.
  • Huffman coding – lossless; variable‑length codes based on symbol frequencies (used in ZIP, PNG).
  • JPEG – lossy image compression; uses Discrete Cosine Transform and quantisation.
  • MP3 / AAC – lossy audio compression; exploits psychoacoustic masking.

Exercise: Given the character frequencies a:40, b:30, c:20, d:10, draw the Huffman tree and compute the average bits per character.


2 Communication & Networks

2.1 Network Types & Topologies

  • LAN – Local Area Network (e.g., Ethernet, Wi‑Fi) covering a single building or campus.
  • WAN – Wide Area Network (e.g., the Internet) spanning cities or continents.
  • Topologies – star, bus, ring, mesh, hybrid; affect cabling, fault tolerance and performance.

2.2 IP Addressing & Subnetting

IPv4 example: 192.168.10.45/24

  1. Subnet mask for /24 = 255.255.255.0.
  2. Network address = 192.168.10.0.
  3. Broadcast address = 192.168.10.255.
  4. Usable host range = 192.168.10.1 – 192.168.10.254.

IPv6 – 128‑bit addresses written in hexadecimal groups, e.g., 2001:0db8:85a3:0000:0000:8a2e:0370:7334. No need for NAT; subnetting is expressed by a prefix length (commonly /64).

2.3 Client‑Server vs. Peer‑to‑Peer (P2P)

CharacteristicClient‑ServerPeer‑to‑Peer
ControlCentralised serversEach node can act as both client and server
ScalabilityAdd more servers or upgrade hardwareScales naturally as more peers join
Typical examplesWeb browsing, email, online bankingBitTorrent, early Skype, file‑sharing networks

2.4 OSI vs. TCP/IP Model

TCP/IP LayerCorresponding OSI Layer(s)Typical Protocols
ApplicationApplication, Presentation, SessionHTTP, HTTPS, SMTP, FTP, DNS
TransportTransportTCP, UDP
InternetNetworkIP, ICMP, ARP
LinkData Link, PhysicalEthernet (IEEE 802.3), Wi‑Fi (IEEE 802.11)

2.5 Cloud Computing – Short Case Study

Scenario: A school wants to host its Learning Management System (LMS).

  • SaaS (Software‑as‑a‑Service) – Use a ready‑made LMS such as Google Classroom. The provider manages hardware, OS, middleware and the application. The school pays a subscription and focuses on content creation.
  • IaaS (Infrastructure‑as‑a‑Service) – Rent virtual machines from AWS, Azure or Google Cloud, install a custom LMS (e.g., Moodle). The school controls the OS and software but does not maintain physical servers.

Discussion points for students: cost, maintenance responsibility, data ownership, scalability, and security implications.


3 Hardware & Logic

3.1 CPU Architecture & the Fetch‑Decode‑Execute Cycle

  1. Fetch – Read the next instruction from memory address held in the Program Counter (PC).
  2. Decode – Control Unit interprets the opcode and determines required operands.
  3. Execute – Arithmetic Logic Unit (ALU) performs the operation; registers are read/written.
  4. Store result – Write the outcome back to a register or memory location.
  5. Update PC (increment or branch) and repeat.

3.2 Pipelining

A pipeline overlaps the stages of the fetch‑decode‑execute cycle so that multiple instructions are processed simultaneously.

InstructionWithout PipelineWith 5‑stage Pipeline (F‑D‑E‑M‑W)
1F‑D‑E‑M‑WF
2— — — — —D
3— — — — —E
4— — — — —M
5— — — — —W

After the pipeline is filled, one instruction completes each clock cycle (ideal CPI = 1), dramatically increasing throughput.

3.3 Parallel Processing Classification (Flynn’s Taxonomy)

  • SISD – Single Instruction, Single Data (classic sequential processor).
  • SIMD – Single Instruction, Multiple Data (e.g., graphics processors, vector units).
  • MISD – Multiple Instruction, Single Data (rare, used in some fault‑tolerant systems).
  • MIMD – Multiple Instruction, Multiple Data (multi‑core CPUs, clusters).

3.4 Logic Gates & Boolean Algebra

GateSymbolTruth Table
AND&A B | Y
0 0 → 0
0 1 → 0
1 0 → 0
1 1 → 1
ORA B | Y
0 0 → 0
0 1 → 1
1 0 → 1
1 1 → 1
NOT¬A | Y
0 → 1
1 → 0
NANDInverse of AND
NORInverse of OR
XORY = 1 when A ≠ B

Students should be able to combine gates to implement Boolean expressions, simplify using De Morgan’s laws, and draw corresponding circuit diagrams.


4 Processor Fundamentals (Optional Extension)

4.1 Interrupts & Exception Handling

  • An interrupt signals the CPU to suspend the current programme, save its state, and execute an interrupt‑service routine (ISR).
  • Types: hardware (e.g., I/O ready) and software (e.g., divide‑by‑zero).
  • Ensures responsive I/O and enables multitasking.

4.2 Memory Hierarchy

  • Registers – fastest, smallest.
  • Cache (L1, L2, L3) – stores recently used data/instructions.
  • Main memory (RAM) – larger but slower.
  • Secondary storage (HDD/SSD) – persistent, much slower.

4.3 Virtual Memory & Paging

  • Logical address space is mapped onto physical memory using page tables.
  • Allows programmes to use more memory than physically available and provides isolation between processes.


Summary

File‑based storage is simple to implement but suffers from redundancy, poor access speed, weak integrity, limited concurrency, coarse security and a lack of powerful query capabilities. These limitations motivate the use of a Database Management System, which provides data independence, indexing, transaction control, granular security and a standard language (SQL) for data manipulation. Understanding these trade‑offs is essential for designing robust, scalable information systems that meet the requirements of the Cambridge IGCSE/A‑Level Computer Science syllabus.