Show understanding of methods of file access

13.2 File Organisation and Access

File organisation describes how records are stored on secondary storage (disk, tape, SSD, etc.).

The chosen organisation determines the speed and method by which a program can retrieve, update or delete a record.

The Cambridge 9618 syllabus expects students to understand four main access methods, the situations in which each is appropriate, and the practical issues such as file type, record layout, error handling and hashing details.

1 Overview of Access Methods

  • Sequential (serial) access – records are read in the order they appear on the medium.
  • Random (direct) access – the address of any record can be calculated and the record is fetched immediately.
  • Indexed access – an auxiliary index maps a key value to the physical address of a record.
  • Hash‑based access – a hash function converts a key into a location (or bucket) where the record is stored.

2 Text vs Binary Files

Two fundamental file types are used in programming:

  • Text files store data as human‑readable characters (usually ASCII or UTF‑8).

    • Each record is typically terminated by a newline character.
    • I/O functions such as readline() or gets() are used.
    • Variable‑length records are natural, but random access is difficult because the byte offset of a record is not known in advance.

  • Binary files store data as raw bytes.

    • Records have a defined layout (fixed‑length or a length field).
    • Functions such as read(), write() or fseek() allow direct positioning.
    • More compact and faster I/O, but data is not directly readable without a program.

The choice of text or binary influences the access method that can be used efficiently and the way errors (e.g., end‑of‑file, read‑fail) must be handled.

3 Sequential (Serial) Access

Typical operations

  1. Open the file (check for success).
  2. Read the first record.
  3. Read the next record repeatedly until the required record is found or EOF is reached.
  4. Append new records at the end of the file.
  5. Close the file.

Advantages

  • Works with both fixed‑ and variable‑length records (especially text files).
  • No extra storage overhead.
  • Very simple to implement in any language.

Disadvantages

  • Finding a particular record may require scanning up to n records → O(n) time.
  • Updating or deleting a record in the middle of a large file can be costly because surrounding data may need to be shifted or a new file must be created.

Typical error handling

  • Check the return value of the open call – abort or request a retry if the file cannot be opened.
  • After each read, test for EOF or a read‑error flag.
  • When writing, verify that the number of bytes written matches the expected record size.

Example (pseudo‑code)

OPEN file FOR READ

IF NOT SUCCESS THEN

DISPLAY "Cannot open file"

STOP

END IF

WHILE NOT EOF(file) DO

READ record

IF READ_ERROR THEN

DISPLAY "Read error – abort"

STOP

END IF

IF record.id = targetID THEN

DISPLAY record

EXIT LOOP

END IF

END WHILE

CLOSE file

4 Random (Direct) Access

Direct access is possible when every record occupies the same number of bytes (fixed‑length) and the file is opened in binary mode.

Address calculation

\[

\text{byte\address} = \text{base\address} + r \times \text{record\_size}

\]

where r is the zero‑based record number.

Advantages

  • Immediate retrieval of any record → O(1) time.
  • Ideal for applications that frequently look up records by their position (e.g., random‑access logs).

Disadvantages

  • Only works with fixed‑length records; variable‑length records require an index.
  • Insertion or deletion usually requires moving large blocks of data to keep records contiguous.

Typical error handling

  • Validate that the calculated address is within the file size before seeking.
  • Check the return status of seek and read operations.
  • Handle “record not found” when the calculated position lies beyond the last record.

Example (pseudo‑code)

recordSize = 128 // bytes

target = 57 // 58th record (0‑based)

address = baseAddress + target*recordSize

IF NOT SEEK(file, address) THEN

DISPLAY "Seek failed – invalid record number"

STOP

END IF

READ record

IF READ_ERROR THEN

DISPLAY "Read error"

STOP

END IF

DISPLAY record

5 Indexed Access

An index is a separate, ordered structure that stores pairs (key, address).

The index is searched first; the resulting address is then used for a direct read of the actual record.

Types of indexes

  • Primary index – built on a unique key (e.g., student number). Each key appears once.
  • Secondary index – built on a non‑unique attribute (e.g., surname). May point to a list of matching addresses.

Typical steps

  1. Search the index for the required key (usually binary search).
  2. Obtain the disk address (or block number) of the record.
  3. Perform a direct read of that address.

Advantages

  • Fast look‑up even for very large files – O(log m) where m is the number of index entries.
  • Supports variable‑length records because the index stores the exact address.
  • Multiple indexes can be created for different keys, allowing flexible queries.

Disadvantages

  • Extra storage is required for the index (typically 5‑10 % of the data file).
  • The index must be kept synchronised with the data file after inserts, deletes or updates.

Typical error handling

  • Check that the index file opens successfully.
  • Detect “key not found” after the binary search and report it gracefully.
  • Validate the address returned by the index before seeking the data file.

Example (pseudo‑code)

# index is a sorted array of (key, address) pairs

pos = binarySearch(index, targetKey)

IF pos = NOT_FOUND THEN

DISPLAY "Record not found"

STOP

END IF

address = index[pos].address

IF NOT SEEK(dataFile, address) THEN

DISPLAY "Invalid address in index"

STOP

END IF

READ record

IF READ_ERROR THEN

DISPLAY "Read error"

STOP

END IF

DISPLAY record

6 Hash‑Based Access

Hashing converts a search key into a location using a hash function. The location is usually a bucket in a hash table stored on disk.

Key concepts

  • Hash function h(k) – maps key k to an integer in the range 0 … (B‑1), where B is the number of buckets.
  • Collision handling – two keys may map to the same bucket. Common strategies are:

    • Separate chaining – each bucket holds a linked list (or an array) of entries.
    • Open addressing – the record is placed in another bucket found by probing (linear, quadratic or double hashing).

  • Load factor (α) – the ratio n / B (records per bucket).

    • Typical desirable range: 0.5 ≤ α ≤ 0.75.
    • If α grows too large, performance degrades and a rehash (create a larger table and re‑insert all records) is required.

Separate chaining – example (pseudo‑code)

function hash(key):

return (key MOD bucketCount)

bucket = hash(targetKey)

FOR each entry IN bucketList[bucket] DO

IF entry.key = targetKey THEN

SEEK dataFile TO entry.address

READ record

DISPLAY record

EXIT

END IF

END FOR

DISPLAY "Record not found"

Open addressing – linear probing (pseudo‑code)

function hash(key):

return (key MOD bucketCount)

function probe(key, i):

return (hash(key) + i) MOD bucketCount # linear probing

i = 0

bucket = probe(targetKey, i)

WHILE bucket NOT EMPTY DO

IF bucket.key = targetKey THEN

SEEK dataFile TO bucket.address

READ record

DISPLAY record

EXIT

END IF

i = i + 1

bucket = probe(targetKey, i)

END WHILE

DISPLAY "Record not found"

Advantages

  • Average‑case look‑up time close to O(1) when the load factor is kept low.
  • Works with variable‑length records (the bucket stores the address of each record).
  • Insertion is fast – only the appropriate bucket (or probe sequence) is examined.

Disadvantages

  • Performance degrades as the table fills; re‑hashing is expensive.
  • Extra storage for the hash table and any overflow structures (chains or probe metadata).
  • Choosing a good hash function is critical; a poor function leads to many collisions.

Typical error handling

  • Validate that the hash function returns a value within 0 … B‑1.
  • Detect a full table (no empty bucket found after B probes) and trigger a re‑hash.
  • Check for I/O errors when seeking to the address stored in a bucket.

7 Comparison of Access Methods

MethodTypical use‑caseSearch complexityRecord‑size requirementStorage overhead
SequentialFull‑file scans (e.g., payroll, batch reports)O(n)Fixed or variable (text or binary)None
Random (Direct)Look‑ups by record number or offsetO(1)Fixed length (binary)None
IndexedSearches by key value (e.g., customer ID, employee number)O(log m)Fixed or variableIndex blocks ≈ 5‑10 % of data
Hash‑basedVery fast look‑ups where a good hash function exists; frequent inserts≈ O(1) average (depends on load factor)Fixed or variableHash table + possible overflow structures

8 Choosing an Access Method – Checklist

When a problem statement is given, evaluate the following characteristics and match them to the most suitable method.

CharacteristicRecommended method(s)
All records must be processed (e.g., generate a report)Sequential access (text or binary)
Records are of identical size and are addressed by position (e.g., log entry number)Random (direct) access
Frequent look‑ups by a unique key; records may be variable lengthPrimary index (or secondary index if non‑unique)
Look‑ups by a non‑unique attribute with many matches (e.g., surname)Secondary index (points to a list) or hash with chaining
Very high insert/delete rate; keys are known and a good hash function existsHash‑based access (choose open addressing for low‑memory, chaining for simplicity)
File must be portable across platforms and human‑readableText file + sequential access (or index stored separately as a text file)
Storage space is at a premiumDirect access (no extra structures) or carefully sized hash table

9 Common Error‑Handling Practices for File I/O

  • Open – always test the return code; if it fails, display a clear message and abort or request an alternative file.
  • Read / Write – compare the number of bytes actually transferred with the expected size; on mismatch, treat it as an error.
  • Seek / Position – verify that the calculated address is within 0 … fileSize‑1. If seek fails, re‑initialise the file pointer or rebuild the index/hash.
  • End‑of‑File – distinguish between a legitimate EOF (normal termination of a sequential scan) and an unexpected early EOF (corrupt file).
  • Exception handling – in languages that support exceptions (Java, Python), catch IOError, FileNotFoundException etc., and provide a user‑friendly fallback.

10 Link‑Forward to Related Syllabus Topics

13.1 User‑Defined Data Types

Custom types (enumerated, record/structure, class) let programmers group related fields – essential when designing the record layout for any of the access methods above. See the “User‑Defined Data Types” notes for examples in Python, Java and Pseudocode.

13.3 Floating‑Point Representation

Understanding IEEE‑754 binary floating‑point is required for AO2 questions that involve numerical data stored in files. The representation influences rounding error and storage size (single‑ vs double‑precision). Refer to the “Floating‑Point Representation” section for details.

12.3 Program Testing & Maintenance

After choosing an access method, appropriate testing (unit, integration, black‑box) must be planned. A simple test plan for a file‑handling module is provided in the “Testing & Maintenance” notes.

11 Suggested Diagram (for classroom use)

Flowchart showing Sequential → read next, Direct → calculate address, Indexed → search index then read, Hash → compute bucket then follow chain

Flowchart summarising the four file‑access techniques and the typical decision points.