Cambridge Syllabus Notes

13.2 File Organisation and Access

File organisation describes how records are stored on secondary storage (disk, tape, SSD, etc.). The chosen organisation determines the speed and method by which a program can retrieve, update or delete a record. The Cambridge 9618 syllabus expects students to understand four main access methods, the situations in which each is appropriate, and the practical issues such as file type, record layout, error handling and hashing details.

1 Overview of Access Methods

Sequential (serial) access – records are read in the order they appear on the medium.
Random (direct) access – the address of any record can be calculated and the record is fetched immediately.
Indexed access – an auxiliary index maps a key value to the physical address of a record.
Hash‑based access – a hash function converts a key into a location (or bucket) where the record is stored.

2 Text vs Binary Files

Two fundamental file types are used in programming:

Text files store data as human‑readable characters (usually ASCII or UTF‑8).
- Each record is typically terminated by a newline character.
- I/O functions such as readline() or gets() are used.
- Variable‑length records are natural, but random access is difficult because the byte offset of a record is not known in advance.
Binary files store data as raw bytes.
- Records have a defined layout (fixed‑length or a length field).
- Functions such as read(), write() or fseek() allow direct positioning.
- More compact and faster I/O, but data is not directly readable without a program.

The choice of text or binary influences the access method that can be used efficiently and the way errors (e.g., end‑of‑file, read‑fail) must be handled.

3 Sequential (Serial) Access

Typical operations

Open the file (check for success).
Read the first record.
Read the next record repeatedly until the required record is found or EOF is reached.
Append new records at the end of the file.
Close the file.

Advantages

Works with both fixed‑ and variable‑length records (especially text files).
No extra storage overhead.
Very simple to implement in any language.

Disadvantages

Finding a particular record may require scanning up to n records → O(n) time.
Updating or deleting a record in the middle of a large file can be costly because surrounding data may need to be shifted or a new file must be created.

Typical error handling

Check the return value of the open call – abort or request a retry if the file cannot be opened.
After each read, test for EOF or a read‑error flag.
When writing, verify that the number of bytes written matches the expected record size.

Example (pseudo‑code)

OPEN file FOR READ
IF NOT SUCCESS THEN
    DISPLAY "Cannot open file"
    STOP
END IF

WHILE NOT EOF(file) DO
    READ record
    IF READ_ERROR THEN
        DISPLAY "Read error – abort"
        STOP
    END IF
    IF record.id = targetID THEN
        DISPLAY record
        EXIT LOOP
    END IF
END WHILE

CLOSE file

4 Random (Direct) Access

Direct access is possible when every record occupies the same number of bytes (fixed‑length) and the file is opened in binary mode.

Address calculation

\[ \text{byte\_address} = \text{base\_address} + r \times \text{record\_size} \]

where r is the zero‑based record number.

Advantages

Immediate retrieval of any record → O(1) time.
Ideal for applications that frequently look up records by their position (e.g., random‑access logs).

Disadvantages

Only works with fixed‑length records; variable‑length records require an index.
Insertion or deletion usually requires moving large blocks of data to keep records contiguous.

Typical error handling

Validate that the calculated address is within the file size before seeking.
Check the return status of seek and read operations.
Handle “record not found” when the calculated position lies beyond the last record.

Example (pseudo‑code)

recordSize = 128                     // bytes
target = 57                          // 58th record (0‑based)
address = baseAddress + target*recordSize

IF NOT SEEK(file, address) THEN
    DISPLAY "Seek failed – invalid record number"
    STOP
END IF

READ record
IF READ_ERROR THEN
    DISPLAY "Read error"
    STOP
END IF

DISPLAY record

5 Indexed Access

An index is a separate, ordered structure that stores pairs (key, address). The index is searched first; the resulting address is then used for a direct read of the actual record.

Types of indexes

Primary index – built on a unique key (e.g., student number). Each key appears once.
Secondary index – built on a non‑unique attribute (e.g., surname). May point to a list of matching addresses.

Typical steps

Search the index for the required key (usually binary search).
Obtain the disk address (or block number) of the record.
Perform a direct read of that address.

Advantages

Fast look‑up even for very large files – O(log m) where m is the number of index entries.
Supports variable‑length records because the index stores the exact address.
Multiple indexes can be created for different keys, allowing flexible queries.

Disadvantages

Extra storage is required for the index (typically 5‑10 % of the data file).
The index must be kept synchronised with the data file after inserts, deletes or updates.

Typical error handling

Check that the index file opens successfully.
Detect “key not found” after the binary search and report it gracefully.
Validate the address returned by the index before seeking the data file.

Example (pseudo‑code)

# index is a sorted array of (key, address) pairs
pos = binarySearch(index, targetKey)

IF pos = NOT_FOUND THEN
    DISPLAY "Record not found"
    STOP
END IF

address = index[pos].address
IF NOT SEEK(dataFile, address) THEN
    DISPLAY "Invalid address in index"
    STOP
END IF

READ record
IF READ_ERROR THEN
    DISPLAY "Read error"
    STOP
END IF

DISPLAY record

6 Hash‑Based Access

Hashing converts a search key into a location using a hash function. The location is usually a bucket in a hash table stored on disk.

Key concepts

Hash function h(k) – maps key k to an integer in the range 0 … (B‑1), where B is the number of buckets.
Collision handling – two keys may map to the same bucket. Common strategies are:
- Separate chaining – each bucket holds a linked list (or an array) of entries.
- Open addressing – the record is placed in another bucket found by probing (linear, quadratic or double hashing).
Load factor (α) – the ratio n / B (records per bucket).
- Typical desirable range: 0.5 ≤ α ≤ 0.75.
- If α grows too large, performance degrades and a rehash (create a larger table and re‑insert all records) is required.

Separate chaining – example (pseudo‑code)

function hash(key):
    return (key MOD bucketCount)

bucket = hash(targetKey)

FOR each entry IN bucketList[bucket] DO
    IF entry.key = targetKey THEN
        SEEK dataFile TO entry.address
        READ record
        DISPLAY record
        EXIT
    END IF
END FOR

DISPLAY "Record not found"

Open addressing – linear probing (pseudo‑code)

function hash(key):
    return (key MOD bucketCount)

function probe(key, i):
    return (hash(key) + i) MOD bucketCount   # linear probing

i = 0
bucket = probe(targetKey, i)

WHILE bucket NOT EMPTY DO
    IF bucket.key = targetKey THEN
        SEEK dataFile TO bucket.address
        READ record
        DISPLAY record
        EXIT
    END IF
    i = i + 1
    bucket = probe(targetKey, i)
END WHILE

DISPLAY "Record not found"

Advantages

Average‑case look‑up time close to O(1) when the load factor is kept low.
Works with variable‑length records (the bucket stores the address of each record).
Insertion is fast – only the appropriate bucket (or probe sequence) is examined.

Disadvantages

Performance degrades as the table fills; re‑hashing is expensive.
Extra storage for the hash table and any overflow structures (chains or probe metadata).
Choosing a good hash function is critical; a poor function leads to many collisions.

Typical error handling

Validate that the hash function returns a value within 0 … B‑1.
Detect a full table (no empty bucket found after B probes) and trigger a re‑hash.
Check for I/O errors when seeking to the address stored in a bucket.

7 Comparison of Access Methods

Method	Typical use‑case	Search complexity	Record‑size requirement	Storage overhead
Sequential	Full‑file scans (e.g., payroll, batch reports)	O(n)	Fixed or variable (text or binary)	None
Random (Direct)	Look‑ups by record number or offset	O(1)	Fixed length (binary)	None
Indexed	Searches by key value (e.g., customer ID, employee number)	O(log m)	Fixed or variable	Index blocks ≈ 5‑10 % of data
Hash‑based	Very fast look‑ups where a good hash function exists; frequent inserts	≈ O(1) average (depends on load factor)	Fixed or variable	Hash table + possible overflow structures

8 Choosing an Access Method – Checklist

When a problem statement is given, evaluate the following characteristics and match them to the most suitable method.

Characteristic	Recommended method(s)
All records must be processed (e.g., generate a report)	Sequential access (text or binary)
Records are of identical size and are addressed by position (e.g., log entry number)	Random (direct) access
Frequent look‑ups by a unique key; records may be variable length	Primary index (or secondary index if non‑unique)
Look‑ups by a non‑unique attribute with many matches (e.g., surname)	Secondary index (points to a list) or hash with chaining
Very high insert/delete rate; keys are known and a good hash function exists	Hash‑based access (choose open addressing for low‑memory, chaining for simplicity)
File must be portable across platforms and human‑readable	Text file + sequential access (or index stored separately as a text file)
Storage space is at a premium	Direct access (no extra structures) or carefully sized hash table

9 Common Error‑Handling Practices for File I/O

Open – always test the return code; if it fails, display a clear message and abort or request an alternative file.
Read / Write – compare the number of bytes actually transferred with the expected size; on mismatch, treat it as an error.
Seek / Position – verify that the calculated address is within 0 … fileSize‑1. If seek fails, re‑initialise the file pointer or rebuild the index/hash.
End‑of‑File – distinguish between a legitimate EOF (normal termination of a sequential scan) and an unexpected early EOF (corrupt file).
Exception handling – in languages that support exceptions (Java, Python), catch IOError, FileNotFoundException etc., and provide a user‑friendly fallback.

10 Link‑Forward to Related Syllabus Topics

13.1 User‑Defined Data Types
Custom types (enumerated, record/structure, class) let programmers group related fields – essential when designing the record layout for any of the access methods above. See the “User‑Defined Data Types” notes for examples in Python, Java and Pseudocode.

13.3 Floating‑Point Representation
Understanding IEEE‑754 binary floating‑point is required for AO2 questions that involve numerical data stored in files. The representation influences rounding error and storage size (single‑ vs double‑precision). Refer to the “Floating‑Point Representation” section for details.

12.3 Program Testing & Maintenance
After choosing an access method, appropriate testing (unit, integration, black‑box) must be planned. A simple test plan for a file‑handling module is provided in the “Testing & Maintenance” notes.

Show understanding of methods of file access

13.2 File Organisation and Access

1 Overview of Access Methods

2 Text vs Binary Files

3 Sequential (Serial) Access

Typical operations

Advantages

Disadvantages

Typical error handling

Example (pseudo‑code)

4 Random (Direct) Access

Address calculation

Advantages

Disadvantages

Typical error handling

Example (pseudo‑code)

5 Indexed Access

Types of indexes

Typical steps

Advantages

Disadvantages

Typical error handling

Example (pseudo‑code)

6 Hash‑Based Access

Key concepts

Separate chaining – example (pseudo‑code)

Open addressing – linear probing (pseudo‑code)

Advantages

Disadvantages

Typical error handling

7 Comparison of Access Methods

8 Choosing an Access Method – Checklist

9 Common Error‑Handling Practices for File I/O

10 Link‑Forward to Related Syllabus Topics

11 Suggested Diagram (for classroom use)

Show understanding of methods of file access

13.2 File Organisation and Access

1 Overview of Access Methods

2 Text vs Binary Files

3 Sequential (Serial) Access

Typical operations

Advantages

Disadvantages

Typical error handling

Example (pseudo‑code)

4 Random (Direct) Access

Address calculation

Advantages

Disadvantages

Typical error handling

Example (pseudo‑code)

5 Indexed Access

Types of indexes

Typical steps

Advantages

Disadvantages

Typical error handling

Example (pseudo‑code)

6 Hash‑Based Access

Key concepts

Separate chaining – example (pseudo‑code)

Open addressing – linear probing (pseudo‑code)

Advantages

Disadvantages

Typical error handling

7 Comparison of Access Methods

8 Choosing an Access Method – Checklist

9 Common Error‑Handling Practices for File I/O

10 Link‑Forward to Related Syllabus Topics

11 Suggested Diagram (for classroom use)

13.2 File Organisation and Access

1 Overview of Access Methods

2 Text vs Binary Files

3 Sequential (Serial) Access

4 Random (Direct) Access

5 Indexed Access

6 Hash‑Based Access

7 Comparison of Access Methods

8 Choosing an Access Method – Checklist

9 Common Error‑Handling Practices for File I/O

10 Link‑Forward to Related Syllabus Topics

11 Suggested Diagram (for classroom use)