Show understanding of why files are needed

10.3 Files – Why Files Are Needed

Learning Outcomes (AO1‑AO3)

  • AO1: Define key terminology – file, secondary storage, persistence, directory, file format.
  • AO2: Analyse situations where data must outlive program execution or exceed main‑memory capacity, and justify the choice of file‑based storage.
  • AO3: Design a simple file‑handling solution, write correct pseudocode (or Java/Python) including proper exception handling, and evaluate its suitability.

What Is a File?

A file is a named collection of related data stored on secondary storage (hard‑disk, SSD, USB stick, cloud drive, etc.). Unlike variables that live only in RAM, a file remains after the computer is powered down and can be accessed by many programmes or users.

Why Files Are Needed – Core Reasons

  1. Persistence – Data survives program termination and power loss.
  2. Capacity – Files can hold far more information than the limited RAM of a computer.
  3. Data Sharing – Multiple programmes or users can read/write the same file, enabling collaboration and modular design.
  4. Organisation – Files are placed in directories (folders) that provide a logical hierarchy.
  5. Backup & Recovery – Files can be copied, archived, or restored to protect against loss or corruption.
  6. Interoperability – Standard file formats (CSV, JSON, XML, etc.) allow exchange between different operating systems and applications.

Variables vs. Files – Detailed Comparison

AspectVariables (RAM)Files (Secondary Storage)
LifetimeTemporary – disappears when the programme ends or the computer powers down.Persistent – remains until explicitly deleted.
CapacityLimited by available RAM (typically a few GB).Limited by disk space (tens or hundreds of GB, even TB).
AccessibilityOnly the running programme can access.Any authorised programme or user can access.
Speed of AccessFast (nanoseconds – microseconds).Slower (micro‑ to milliseconds); I/O overhead must be considered.
Typical UseIntermediate calculations, temporary buffers.Long‑term storage, data exchange, logging, configuration files.

Quantitative Example – When a File Becomes Necessary

Assume a programme must process n records, each of size s bytes, and the RAM available to the programme is R bytes.

M = n × s // total memory required

if (M > R) {

// cannot hold all data in RAM → use external storage

store data in a file

}

Example: n = 5 000 000 records, s = 120 bytesM = 600 MB. If the computer has only R = 256 MB free RAM, the programme must read/write portions of the data to a file.

Illustrative Scenario – Student‑Record System

  1. Start‑up: Open students.csv and read each line into an in‑memory list.
  2. Session: The user adds, edits or deletes records; changes are kept in RAM for fast response.
  3. Shut‑down: Overwrite students.csv with the updated list, ensuring the data persists for the next session.

This scenario demonstrates:

  • Persistence – data is saved after the programme ends.
  • Data sharing – any authorised user can open the same CSV file.
  • Volume handling – the file can grow far beyond the RAM available.

File Organisation & Access Methods

Cambridge expects students to recognise the three principal ways of organising a file:

  • Serial (or flat) files – records are stored one after another with no internal structure. Simple to implement but inefficient for searching.
  • Sequential access – the programme reads records in order from the start to the end. Suitable for batch processing (e.g., payroll).
  • Random (direct) access – each record can be located directly via its address or an index. Used when frequent look‑ups are required (e.g., a student‑ID database).
  • Hash‑based files – a hash function maps a key to a physical location, giving near‑constant‑time retrieval. Mentioned in the syllabus for advanced designs.

File Formats & Data Representation

Choosing the right format is part of AO2 (analysis) and AO3 (design).

  • Text files – human‑readable; delimiters (comma, tab, space) separate fields (CSV, TSV). Ideal for simple data exchange.
  • Structured text – JSON, XML – encode hierarchical data and are widely supported by web services.
  • Binary files – store data in the machine’s native binary representation; more compact and faster to read/write but not human‑readable.
  • When deciding, consider:

    • Nature of the data (numeric vs. textual)
    • Need for portability (text formats are more portable)
    • Size and performance requirements (binary often wins on large data sets)

Security & Integrity of Files

Although the syllabus does not require full cryptography, students should be aware of basic safeguards:

  • Validation – check that the file conforms to the expected format before processing (e.g., correct number of fields, proper delimiters).
  • Checksums / hashes – generate a simple checksum (e.g., MD5, SHA‑1) when writing a file and verify it on read to detect corruption.
  • Permissions – set read/write permissions so that only authorised users can modify the file.
  • Backup strategy – keep a copy of critical files on a separate medium or cloud service.

File‑Handling Basics (AO3)

The pseudocode below follows the structure required for Paper 4 and explicitly uses language‑specific exception handling.

# Pseudocode – open, read, write, close with proper exception handling

TRY

OPEN file "students.txt" FOR READ AS inFile

EXCEPT IOError AS e

DISPLAY "Error opening file for reading: " + e.message

STOP

END TRY

WHILE NOT EOF(inFile) DO

READ line FROM inFile

PROCESS line # e.g., store in a list

END WHILE

CLOSE inFile

# ... make changes to the in‑memory list ...

TRY

OPEN file "students.txt" FOR WRITE AS outFile

EXCEPT IOError AS e

DISPLAY "Error opening file for writing: " + e.message

STOP

END TRY

FOR EACH record IN list DO

WRITE record TO outFile

END FOR

CLOSE outFile

Key points to remember

  • Open the file in the correct mode (read, write, append).
  • Always check that the open operation succeeded – in most languages this is done with try/catch (Java) or try/except (Python).
  • Close the file (or use a “with/open” construct) to flush buffers and release the OS handle.
  • Handle possible I/O exceptions such as file‑not‑found, permission denied, or disk full.

Common Pitfalls & How to Avoid Them

PitfallConsequenceMitigation
Forgetting to close a fileData may remain in buffer, leading to loss or file‑lock.Use a finally block or language‑specific “with/open” construct.
Hard‑coding file pathsProgram fails on another machine or OS.Use relative paths or configuration files; employ OS‑independent separators.
Assuming the file existsRuntime error (FileNotFoundException).Check existence with File.exists() or handle the exception.
Reading a large file entirely into RAMOut‑of‑memory crash.Process the file line‑by‑line or in blocks (streaming).
Ignoring character‑encoding issuesCorrupted text (e.g., � symbols).Specify UTF‑8 (or required) encoding when opening.

Performance Considerations

  • Buffering – reading/writing in blocks (e.g., 4 KB buffers) reduces the number of costly disk accesses.
  • Sequential vs. random I/O – sequential access is usually faster because the disk head moves less; random access is justified only when frequent look‑ups outweigh the overhead.
  • Block size – matching the block size to the underlying storage (e.g., 512 B or 4 KB) can improve throughput.

Practical Checklist for Paper 4 (Programming)

  • Choose a language supported by the exam (Java, Python, or Pascal).
  • Set up a simple IDE or text editor with syntax highlighting.
  • Wrap all file I/O in try‑catch (Java) or try‑except (Python) blocks.
  • Test with:

    • Empty file
    • File containing the maximum expected records
    • Missing or read‑only file

  • Document:

    • Purpose of each file
    • Format (e.g., CSV – fields separated by commas)
    • Assumptions and limitations (e.g., “no field contains a comma”)

  • Provide sample input and expected output in the answer script.

Summary

  • Files give persistent, large‑capacity, shareable storage that variables cannot provide.
  • They enable organised data management, backup, and cross‑platform interoperability.
  • Understanding *why* files are needed underpins later topics such as file‑organisation techniques, format choice, security checks, and database design.

Suggested diagram (not shown): Flowchart – Start → Open file → Load data into RAM → Process → Write back → Close file → End.