Show understanding of methods of file access

13.2 File Organisation and Access

File organisation describes how records are stored on secondary storage (disk, tape, SSD, etc.). The chosen organisation determines the speed and method by which a program can retrieve, update or delete a record. The Cambridge 9618 syllabus expects students to understand four main access methods, the situations in which each is appropriate, and the practical issues such as file type, record layout, error handling and hashing details.

1 Overview of Access Methods

  • Sequential (serial) access – records are read in the order they appear on the medium.
  • Random (direct) access – the address of any record can be calculated and the record is fetched immediately.
  • Indexed access – an auxiliary index maps a key value to the physical address of a record.
  • Hash‑based access – a hash function converts a key into a location (or bucket) where the record is stored.

2 Text vs Binary Files

Two fundamental file types are used in programming:

  • Text files store data as human‑readable characters (usually ASCII or UTF‑8).
    • Each record is typically terminated by a newline character.
    • I/O functions such as readline() or gets() are used.
    • Variable‑length records are natural, but random access is difficult because the byte offset of a record is not known in advance.
  • Binary files store data as raw bytes.
    • Records have a defined layout (fixed‑length or a length field).
    • Functions such as read(), write() or fseek() allow direct positioning.
    • More compact and faster I/O, but data is not directly readable without a program.

The choice of text or binary influences the access method that can be used efficiently and the way errors (e.g., end‑of‑file, read‑fail) must be handled.

3 Sequential (Serial) Access

Typical operations

  1. Open the file (check for success).
  2. Read the first record.
  3. Read the next record repeatedly until the required record is found or EOF is reached.
  4. Append new records at the end of the file.
  5. Close the file.

Advantages

  • Works with both fixed‑ and variable‑length records (especially text files).
  • No extra storage overhead.
  • Very simple to implement in any language.

Disadvantages

  • Finding a particular record may require scanning up to n records → O(n) time.
  • Updating or deleting a record in the middle of a large file can be costly because surrounding data may need to be shifted or a new file must be created.

Typical error handling

  • Check the return value of the open call – abort or request a retry if the file cannot be opened.
  • After each read, test for EOF or a read‑error flag.
  • When writing, verify that the number of bytes written matches the expected record size.

Example (pseudo‑code)

OPEN file FOR READ
IF NOT SUCCESS THEN
    DISPLAY "Cannot open file"
    STOP
END IF

WHILE NOT EOF(file) DO
    READ record
    IF READ_ERROR THEN
        DISPLAY "Read error – abort"
        STOP
    END IF
    IF record.id = targetID THEN
        DISPLAY record
        EXIT LOOP
    END IF
END WHILE

CLOSE file

4 Random (Direct) Access

Direct access is possible when every record occupies the same number of bytes (fixed‑length) and the file is opened in binary mode.

Address calculation

\[ \text{byte\_address} = \text{base\_address} + r \times \text{record\_size} \]

where r is the zero‑based record number.

Advantages

  • Immediate retrieval of any record → O(1) time.
  • Ideal for applications that frequently look up records by their position (e.g., random‑access logs).

Disadvantages

  • Only works with fixed‑length records; variable‑length records require an index.
  • Insertion or deletion usually requires moving large blocks of data to keep records contiguous.

Typical error handling

  • Validate that the calculated address is within the file size before seeking.
  • Check the return status of seek and read operations.
  • Handle “record not found” when the calculated position lies beyond the last record.

Example (pseudo‑code)

recordSize = 128                     // bytes
target = 57                          // 58th record (0‑based)
address = baseAddress + target*recordSize

IF NOT SEEK(file, address) THEN
    DISPLAY "Seek failed – invalid record number"
    STOP
END IF

READ record
IF READ_ERROR THEN
    DISPLAY "Read error"
    STOP
END IF

DISPLAY record

5 Indexed Access

An index is a separate, ordered structure that stores pairs (key, address). The index is searched first; the resulting address is then used for a direct read of the actual record.

Types of indexes

  • Primary index – built on a unique key (e.g., student number). Each key appears once.
  • Secondary index – built on a non‑unique attribute (e.g., surname). May point to a list of matching addresses.

Typical steps

  1. Search the index for the required key (usually binary search).
  2. Obtain the disk address (or block number) of the record.
  3. Perform a direct read of that address.

Advantages

  • Fast look‑up even for very large files – O(log m) where m is the number of index entries.
  • Supports variable‑length records because the index stores the exact address.
  • Multiple indexes can be created for different keys, allowing flexible queries.

Disadvantages

  • Extra storage is required for the index (typically 5‑10 % of the data file).
  • The index must be kept synchronised with the data file after inserts, deletes or updates.

Typical error handling

  • Check that the index file opens successfully.
  • Detect “key not found” after the binary search and report it gracefully.
  • Validate the address returned by the index before seeking the data file.

Example (pseudo‑code)

# index is a sorted array of (key, address) pairs
pos = binarySearch(index, targetKey)

IF pos = NOT_FOUND THEN
    DISPLAY "Record not found"
    STOP
END IF

address = index[pos].address
IF NOT SEEK(dataFile, address) THEN
    DISPLAY "Invalid address in index"
    STOP
END IF

READ record
IF READ_ERROR THEN
    DISPLAY "Read error"
    STOP
END IF

DISPLAY record

6 Hash‑Based Access

Hashing converts a search key into a location using a hash function. The location is usually a bucket in a hash table stored on disk.

Key concepts

  • Hash function h(k) – maps key k to an integer in the range 0 … (B‑1), where B is the number of buckets.
  • Collision handling – two keys may map to the same bucket. Common strategies are:
    • Separate chaining – each bucket holds a linked list (or an array) of entries.
    • Open addressing – the record is placed in another bucket found by probing (linear, quadratic or double hashing).
  • Load factor (α) – the ratio n / B (records per bucket).
    • Typical desirable range: 0.5 ≤ α ≤ 0.75.
    • If α grows too large, performance degrades and a rehash (create a larger table and re‑insert all records) is required.

Separate chaining – example (pseudo‑code)

function hash(key):
    return (key MOD bucketCount)

bucket = hash(targetKey)

FOR each entry IN bucketList[bucket] DO
    IF entry.key = targetKey THEN
        SEEK dataFile TO entry.address
        READ record
        DISPLAY record
        EXIT
    END IF
END FOR

DISPLAY "Record not found"

Open addressing – linear probing (pseudo‑code)

function hash(key):
    return (key MOD bucketCount)

function probe(key, i):
    return (hash(key) + i) MOD bucketCount   # linear probing

i = 0
bucket = probe(targetKey, i)

WHILE bucket NOT EMPTY DO
    IF bucket.key = targetKey THEN
        SEEK dataFile TO bucket.address
        READ record
        DISPLAY record
        EXIT
    END IF
    i = i + 1
    bucket = probe(targetKey, i)
END WHILE

DISPLAY "Record not found"

Advantages

  • Average‑case look‑up time close to O(1) when the load factor is kept low.
  • Works with variable‑length records (the bucket stores the address of each record).
  • Insertion is fast – only the appropriate bucket (or probe sequence) is examined.

Disadvantages

  • Performance degrades as the table fills; re‑hashing is expensive.
  • Extra storage for the hash table and any overflow structures (chains or probe metadata).
  • Choosing a good hash function is critical; a poor function leads to many collisions.

Typical error handling

  • Validate that the hash function returns a value within 0 … B‑1.
  • Detect a full table (no empty bucket found after B probes) and trigger a re‑hash.
  • Check for I/O errors when seeking to the address stored in a bucket.

7 Comparison of Access Methods

Method Typical use‑case Search complexity Record‑size requirement Storage overhead
Sequential Full‑file scans (e.g., payroll, batch reports) O(n) Fixed or variable (text or binary) None
Random (Direct) Look‑ups by record number or offset O(1) Fixed length (binary) None
Indexed Searches by key value (e.g., customer ID, employee number) O(log m) Fixed or variable Index blocks ≈ 5‑10 % of data
Hash‑based Very fast look‑ups where a good hash function exists; frequent inserts ≈ O(1) average (depends on load factor) Fixed or variable Hash table + possible overflow structures

8 Choosing an Access Method – Checklist

When a problem statement is given, evaluate the following characteristics and match them to the most suitable method.

CharacteristicRecommended method(s)
All records must be processed (e.g., generate a report) Sequential access (text or binary)
Records are of identical size and are addressed by position (e.g., log entry number) Random (direct) access
Frequent look‑ups by a unique key; records may be variable length Primary index (or secondary index if non‑unique)
Look‑ups by a non‑unique attribute with many matches (e.g., surname) Secondary index (points to a list) or hash with chaining
Very high insert/delete rate; keys are known and a good hash function exists Hash‑based access (choose open addressing for low‑memory, chaining for simplicity)
File must be portable across platforms and human‑readable Text file + sequential access (or index stored separately as a text file)
Storage space is at a premium Direct access (no extra structures) or carefully sized hash table

9 Common Error‑Handling Practices for File I/O

  • Open – always test the return code; if it fails, display a clear message and abort or request an alternative file.
  • Read / Write – compare the number of bytes actually transferred with the expected size; on mismatch, treat it as an error.
  • Seek / Position – verify that the calculated address is within 0 … fileSize‑1. If seek fails, re‑initialise the file pointer or rebuild the index/hash.
  • End‑of‑File – distinguish between a legitimate EOF (normal termination of a sequential scan) and an unexpected early EOF (corrupt file).
  • Exception handling – in languages that support exceptions (Java, Python), catch IOError, FileNotFoundException etc., and provide a user‑friendly fallback.

10 Link‑Forward to Related Syllabus Topics

13.1 User‑Defined Data Types
Custom types (enumerated, record/structure, class) let programmers group related fields – essential when designing the record layout for any of the access methods above. See the “User‑Defined Data Types” notes for examples in Python, Java and Pseudocode.
13.3 Floating‑Point Representation
Understanding IEEE‑754 binary floating‑point is required for AO2 questions that involve numerical data stored in files. The representation influences rounding error and storage size (single‑ vs double‑precision). Refer to the “Floating‑Point Representation” section for details.
12.3 Program Testing & Maintenance
After choosing an access method, appropriate testing (unit, integration, black‑box) must be planned. A simple test plan for a file‑handling module is provided in the “Testing & Maintenance” notes.

11 Suggested Diagram (for classroom use)

Flowchart showing Sequential → read next, Direct → calculate address, Indexed → search index then read, Hash → compute bucket then follow chain
Flowchart summarising the four file‑access techniques and the typical decision points.

Create an account or Login to take a Quiz

78 views
0 improvement suggestions

Log in to suggest improvements to this note.