Cambridge Notes, Past Papers, Revision Questions

13.2 File Organisation and Access

Learning Objective (AO1‑AO3)

Show understanding of the methods of file organisation and select an appropriate method of file organisation and file access for a given problem.

Why This Topic Matters (AO1‑AO3)

AO1 – Knowledge: Define file organisation, file access, primary/secondary indexes, hash functions and collision‑resolution techniques.

AO2 – Application: Analyse the performance of different organisations for given data‑sets and query patterns.

AO3 – Design & Evaluation: Choose, justify and evaluate the most suitable file‑organisation/access method for a real‑world scenario.

Key Concepts

File organisation: Physical layout of records on secondary storage (disk, SSD, magnetic tape).

File access method: Algorithm used to retrieve a record (sequential, direct, indexed, hashed).

Primary key: Unique field used to locate a record.

Index: Auxiliary structure that maps key values to record locations (byte offset or block number).

Hash function h(k): Maps a key k to a bucket number; must distribute keys uniformly.

Collision resolution: Techniques such as chaining, linear probing, quadratic probing, double hashing.

Hybrid (indexed‑sequential) organisation: Combines sequential storage with one or more indexes.

Methods of File Organisation

Method	Structure & Key Operations	Advantages	Disadvantages	Typical Use‑Cases
Sequential (Linear)	Records stored one after another in the order of insertion. Access: read from the start until the required record is found (or end‑of‑file).	Very simple to implement. Excellent for bulk loading, full‑file scans and batch processing.	Random access is `O(n)` – must read all preceding records. Insert/delete in the middle requires shifting large blocks of data.	Log files, archival data, transaction journals.
Direct (Random) Access	Record address calculated from the key: `p = (k mod m)` where m = number of slots. No separate index required; each slot holds a record or a pointer to an overflow area.	Very fast single‑record retrieval – `O(1)` average. Simple arithmetic; no index maintenance.	Collisions must be handled (chaining or open addressing). Works best when keys are uniformly distributed and records are fixed‑length.	Employee ID files, static lookup tables.
Hashed Organisation	Hash function `h(k)` maps a key to a bucket. Collision resolution: Chaining: each bucket holds a linked list of records. Open addressing: probing sequences (linear, quadratic, double hashing).	Average‑case constant‑time access (`O(1)`). Efficient for large, dynamic data sets with frequent inserts/deletes.	Performance degrades as the load factor ↑ (more collisions). Choosing a good hash function is critical; poor distribution → `O(n)` worst case.	Symbol tables, caching layers, password‑hash stores.
Indexed Organisation	Two files: data file (records) and index file (key → byte offset). Indexes may be: Simple (sorted) index: linear or binary search. B‑tree index: balanced tree; search `O(log n)`, insert/delete `O(log n)`.	Fast point look‑ups (`O(log n)` with B‑tree). Supports range queries efficiently (scan leaf nodes).	Extra storage for the index. Index must be updated on every insert/delete.	Student databases, inventory systems, file‑systems (e.g., NTFS MFT).
Indexed‑Sequential (Hybrid)	Data stored sequentially; one or more indexes point to block boundaries. Search proceeds: index → block → sequential scan within block.	Good compromise: fast point look‑ups and efficient range scans. Reduces need to read the whole file for most queries.	More complex implementation; requires periodic re‑organisation (splitting/merging blocks). Additional storage for index blocks.	Bank transaction files, library catalogues, large‑scale reporting systems.

File Access Methods (How Records Are Retrieved)

Sequential Access: Read records in order from the start; ideal for full‑file processing.

Direct (Random) Access: Compute or look up the exact location and read it immediately.

Indexed Access: Search an index first (binary search or B‑tree), then fetch the record using the pointer.

Hashed Access: Apply the hash function to obtain the bucket; resolve collisions as required.

Complexity Summary (AO2)

Method	Search	Insert	Delete	Range Query
Sequential	`O(n)`	`O(n)` (shift)	`O(n)`	`O(n)`
Direct (fixed‑length)	`O(1)`	`O(1)` (if slot free) / `O(n)` (overflow)	Varies	Not applicable
Hashed (good load factor)	`O(1)` avg, `O(n)` worst	`O(1)` avg	`O(1)` avg	Not applicable
Indexed (B‑tree)	`O(log n)`	`O(log n)`	`O(log n)`	`O(log n + k)` (k = results)
Indexed‑Sequential	`O(log n + b)` (b = block size)	`O(b)` (re‑organisation occasional)	`O(b)`	`O(log n + k)`

Choosing the Right Organisation & Access Method

Answer the checklist questions, then match the pattern to the most suitable method.

Consideration	Implication for Choice
Primary operation (search vs bulk read)	Search‑heavy → indexed or hashed; bulk‑read → sequential.
Record length (fixed vs variable)	Fixed → direct/hashed easy; variable → need pointers (index) or length fields.
Key distribution (uniform vs clustered)	Uniform → direct or hashed works well; clustered → indexed‑sequential reduces collisions.
Frequency of inserts/deletes	High → hashed or indexed‑sequential; low → pure sequential.
Need for range queries	Indexed (B‑tree) or indexed‑sequential gives `O(log n + k)` for k results.
Storage constraints	Limited extra space → sequential or direct; ample space → indexes or hash tables.

Worked Example – Selecting a File Organisation

Problem Statement

A university must store up to 200 000 student records (StudentID, Name, Course, Year, GPA). Requirements:

Fast lookup of a single student by StudentID (numeric, uniformly distributed).

Frequent updates (add, delete, modify).

Range queries to list all students in a particular Course and Year.

Names are of variable length.

Analysis

Key characteristics
- Primary key = StudentID (numeric, uniform).
- Secondary query fields = (Course, Year) – need ordered access.
- Variable‑length Name field → cannot rely on fixed‑size records for direct positioning.

Operation mix
- Point look‑ups dominate → O(log n) is acceptable.
- Updates are frequent → minimise re‑organisation cost.
- Range queries require ordered traversal of a subset.

Justification (AO2 & AO3)

Performance: Primary B‑tree gives O(log n) lookup; secondary index gives O(log n + k) for a range of k students.

Updates: Inserting a record only touches the free‑space list and the two indexes – no massive data movement.

Variable‑length support: Offsets in the indexes decouple record size from the access method.

Storage trade‑off: Extra space for two B‑tree indexes is justified by the speed gains for the expected query mix.

Diagram (description – draw in class)

Data file: sequential blocks, each containing a variable‑length record.

Primary B‑tree index: StudentID → byte offset.

Secondary B‑tree index: (Course, Year) → list of byte offsets.

Free‑space list: linked list of empty slots within the data file.

Practical Tip (Paper 4)

When implementing the above in Java or Python:

Use RandomAccessFile (Java) or mmap (Python) to read/write at specific byte offsets.

Represent a B‑tree node as a fixed‑size block (e.g., 4 KB) to simplify disk I/O.

Write a simple test plan:
1. Insert 10 000 random student records.
2. Search for 100 random StudentID values – verify O(log n) time.
3. Delete 500 records – check free‑space list updates.
4. Run a range query for a specific Course/Year – confirm correct number of results.

Summary Checklist (Exam‑Ready)

Identify the dominant operation(s) – search, update, bulk read, range query.

Determine record size (fixed/variable) and key distribution.

Match the pattern to a file‑organisation method:
- Sequential – bulk scans, low update frequency.
- Direct – fixed‑length, uniform keys, minimal updates.
- Hashed – high insert/delete, uniform keys, need O(1) average access.
- Indexed – point look‑ups & range queries, variable length, moderate updates.
- Indexed‑Sequential – mixed workload (point + range) with frequent updates.

Choose the appropriate access method (sequential, direct, indexed, hashed) that matches the organisation.

Justify your choice in terms of:
- Time complexity (average & worst case).
- Space overhead (index or hash‑table size).
- Update cost.
- Suitability for variable‑length records.

Map your answer to the assessment objectives (AO1‑AO3).

Action‑Oriented Review Checklist – Alignment with Cambridge AS & A‑Level Computer Science (9618) Syllabus (2026)

Use this table to verify that the notes cover every required sub‑topic for the “File Organisation and Access” section of the syllabus.

Syllabus Sub‑section	Present in Notes?	Action Required
13.2.1 Definition of file organisation and file access	✔	None
13.2.2 Sequential (linear) organisation	✔	None
13.2.3 Direct (random) organisation	✔	None
13.2.4 Hashed organisation – hash functions, collision resolution	✔	None
13.2.5 Indexed organisation – simple index, B‑tree index	✔	None
13.2.6 Indexed‑sequential (hybrid) organisation	✔	None
13.2.7 Complexity analysis (O‑notation) for each method	✔	None
13.2.8 Decision‑making checklist for selecting a method	✔	None
13.2.9 Worked example with justification (AO2 & AO3)	✔	None
13.2.10 Practical implementation tips for Paper 4	✔	None
13.2.11 Link to other syllabus areas (e.g., data structures, algorithms)	✘	Add a brief paragraph linking B‑trees to the “Data Structures” topic and hashing to “Algorithms”.

Link to Other Syllabus Areas (AO1‑AO3)

Data structures: B‑trees and hash tables are covered in the “Data structures” unit; understanding their node layout reinforces tree traversal concepts.

Algorithms: Collision‑resolution probing strategies and binary search on indexes illustrate algorithmic thinking and complexity analysis.

System software: File‑system implementations (e.g., NTFS MFT) use indexed‑sequential techniques – useful for the “System software” unit.

AO Mapping Table for This Topic

Content	AO1	AO2	AO3
Definitions of file organisation & access methods	✔
Characteristics of sequential, direct, hashed, indexed, indexed‑sequential files	✔
Complexity analysis (O‑notation) for each method		✔
Decision‑making checklist for selecting a method			✔
Worked example with justification		✔	✔
Practical implementation tips for Paper 4			✔

Final Quick‑Reference (Exam‑Ready)

Sequential: Simple, great for bulk reads, poor random access.

Direct: Fixed‑length records, uniform keys, O(1) lookup.

Hashed: Constant‑time average, needs good hash function, handles dynamic data.

Indexed: B‑tree gives O(log n) point & range queries; extra storage for index.

Indexed‑Sequential: Best for mixed workloads (point + range) with frequent updates.

Show understanding of the methods of file organisation and select an appropriate method of file organisation and file access for a given problem

13.2 File Organisation and Access