Cambridge Syllabus Notes

10.3 Files – Why Files Are Needed

Learning Outcomes (AO1‑AO3)

AO1: Define key terminology – file, secondary storage, persistence, directory, file format.
AO2: Analyse situations where data must outlive program execution or exceed main‑memory capacity, and justify the choice of file‑based storage.
AO3: Design a simple file‑handling solution, write correct pseudocode (or Java/Python) including proper exception handling, and evaluate its suitability.

What Is a File?

A file is a named collection of related data stored on secondary storage (hard‑disk, SSD, USB stick, cloud drive, etc.). Unlike variables that live only in RAM, a file remains after the computer is powered down and can be accessed by many programmes or users.

Why Files Are Needed – Core Reasons

Persistence – Data survives program termination and power loss.
Capacity – Files can hold far more information than the limited RAM of a computer.
Data Sharing – Multiple programmes or users can read/write the same file, enabling collaboration and modular design.
Organisation – Files are placed in directories (folders) that provide a logical hierarchy.
Backup & Recovery – Files can be copied, archived, or restored to protect against loss or corruption.
Interoperability – Standard file formats (CSV, JSON, XML, etc.) allow exchange between different operating systems and applications.

Variables vs. Files – Detailed Comparison

Aspect	Variables (RAM)	Files (Secondary Storage)
Lifetime	Temporary – disappears when the programme ends or the computer powers down.	Persistent – remains until explicitly deleted.
Capacity	Limited by available RAM (typically a few GB).	Limited by disk space (tens or hundreds of GB, even TB).
Accessibility	Only the running programme can access.	Any authorised programme or user can access.
Speed of Access	Fast (nanoseconds – microseconds).	Slower (micro‑ to milliseconds); I/O overhead must be considered.
Typical Use	Intermediate calculations, temporary buffers.	Long‑term storage, data exchange, logging, configuration files.

Quantitative Example – When a File Becomes Necessary

Assume a programme must process n records, each of size s bytes, and the RAM available to the programme is R bytes.

M = n × s          // total memory required
if (M > R) {
    // cannot hold all data in RAM → use external storage
    store data in a file
}

Example: n = 5 000 000 records, s = 120 bytes → M = 600 MB. If the computer has only R = 256 MB free RAM, the programme must read/write portions of the data to a file.

Illustrative Scenario – Student‑Record System

Start‑up: Open students.csv and read each line into an in‑memory list.
Session: The user adds, edits or deletes records; changes are kept in RAM for fast response.
Shut‑down: Overwrite students.csv with the updated list, ensuring the data persists for the next session.

This scenario demonstrates:

Persistence – data is saved after the programme ends.
Data sharing – any authorised user can open the same CSV file.
Volume handling – the file can grow far beyond the RAM available.

File Organisation & Access Methods

Cambridge expects students to recognise the three principal ways of organising a file:

Serial (or flat) files – records are stored one after another with no internal structure. Simple to implement but inefficient for searching.
Sequential access – the programme reads records in order from the start to the end. Suitable for batch processing (e.g., payroll).
Random (direct) access – each record can be located directly via its address or an index. Used when frequent look‑ups are required (e.g., a student‑ID database).
Hash‑based files – a hash function maps a key to a physical location, giving near‑constant‑time retrieval. Mentioned in the syllabus for advanced designs.

File Formats & Data Representation

Choosing the right format is part of AO2 (analysis) and AO3 (design).

Text files – human‑readable; delimiters (comma, tab, space) separate fields (CSV, TSV). Ideal for simple data exchange.
Structured text – JSON, XML – encode hierarchical data and are widely supported by web services.
Binary files – store data in the machine’s native binary representation; more compact and faster to read/write but not human‑readable.
When deciding, consider:
- Nature of the data (numeric vs. textual)
- Need for portability (text formats are more portable)
- Size and performance requirements (binary often wins on large data sets)

Security & Integrity of Files

Although the syllabus does not require full cryptography, students should be aware of basic safeguards:

Validation – check that the file conforms to the expected format before processing (e.g., correct number of fields, proper delimiters).
Checksums / hashes – generate a simple checksum (e.g., MD5, SHA‑1) when writing a file and verify it on read to detect corruption.
Permissions – set read/write permissions so that only authorised users can modify the file.
Backup strategy – keep a copy of critical files on a separate medium or cloud service.

File‑Handling Basics (AO3)

The pseudocode below follows the structure required for Paper 4 and explicitly uses language‑specific exception handling.

# Pseudocode – open, read, write, close with proper exception handling
TRY
    OPEN file "students.txt" FOR READ AS inFile
EXCEPT IOError AS e
    DISPLAY "Error opening file for reading: " + e.message
    STOP
END TRY

WHILE NOT EOF(inFile) DO
    READ line FROM inFile
    PROCESS line               # e.g., store in a list
END WHILE

CLOSE inFile

# ... make changes to the in‑memory list ...

TRY
    OPEN file "students.txt" FOR WRITE AS outFile
EXCEPT IOError AS e
    DISPLAY "Error opening file for writing: " + e.message
    STOP
END TRY

FOR EACH record IN list DO
    WRITE record TO outFile
END FOR

CLOSE outFile

Key points to remember

Open the file in the correct mode (read, write, append).
Always check that the open operation succeeded – in most languages this is done with try/catch (Java) or try/except (Python).
Close the file (or use a “with/open” construct) to flush buffers and release the OS handle.
Handle possible I/O exceptions such as file‑not‑found, permission denied, or disk full.

Common Pitfalls & How to Avoid Them

Pitfall	Consequence	Mitigation
Forgetting to close a file	Data may remain in buffer, leading to loss or file‑lock.	Use a `finally` block or language‑specific “with/open” construct.
Hard‑coding file paths	Program fails on another machine or OS.	Use relative paths or configuration files; employ OS‑independent separators.
Assuming the file exists	Runtime error (FileNotFoundException).	Check existence with `File.exists()` or handle the exception.
Reading a large file entirely into RAM	Out‑of‑memory crash.	Process the file line‑by‑line or in blocks (streaming).
Ignoring character‑encoding issues	Corrupted text (e.g., � symbols).	Specify UTF‑8 (or required) encoding when opening.

Performance Considerations

Buffering – reading/writing in blocks (e.g., 4 KB buffers) reduces the number of costly disk accesses.
Sequential vs. random I/O – sequential access is usually faster because the disk head moves less; random access is justified only when frequent look‑ups outweigh the overhead.
Block size – matching the block size to the underlying storage (e.g., 512 B or 4 KB) can improve throughput.

Practical Checklist for Paper 4 (Programming)

Choose a language supported by the exam (Java, Python, or Pascal).
Set up a simple IDE or text editor with syntax highlighting.
Wrap all file I/O in try‑catch (Java) or try‑except (Python) blocks.
Test with:
- Empty file
- File containing the maximum expected records
- Missing or read‑only file
Document:
- Purpose of each file
- Format (e.g., CSV – fields separated by commas)
- Assumptions and limitations (e.g., “no field contains a comma”)
Provide sample input and expected output in the answer script.

Summary

Files give **persistent**, **large‑capacity**, **shareable** storage that variables cannot provide.
They enable organised data management, backup, and cross‑platform interoperability.
Understanding *why* files are needed underpins later topics such as file‑organisation techniques, format choice, security checks, and database design.

Suggested diagram (not shown): Flowchart – Start → Open file → Load data into RAM → Process → Write back → Close file → End.

Show understanding of why files are needed

10.3 Files – Why Files Are Needed