13.2 File Organisation and Access
Learning Objective (AO1‑AO3)
Show understanding of the methods of file organisation and select an appropriate method of file organisation and file access for a given problem.
Why This Topic Matters (AO1‑AO3)
- AO1 – Knowledge: Define file organisation, file access, primary/secondary indexes, hash functions and collision‑resolution techniques.
- AO2 – Application: Analyse the performance of different organisations for given data‑sets and query patterns.
- AO3 – Design & Evaluation: Choose, justify and evaluate the most suitable file‑organisation/access method for a real‑world scenario.
Key Concepts
- File organisation: Physical layout of records on secondary storage (disk, SSD, magnetic tape).
- File access method: Algorithm used to retrieve a record (sequential, direct, indexed, hashed).
- Primary key: Unique field used to locate a record.
- Index: Auxiliary structure that maps key values to record locations (byte offset or block number).
- Hash function
h(k): Maps a key k to a bucket number; must distribute keys uniformly. - Collision resolution: Techniques such as chaining, linear probing, quadratic probing, double hashing.
- Hybrid (indexed‑sequential) organisation: Combines sequential storage with one or more indexes.
Methods of File Organisation
| Method | Structure & Key Operations | Advantages | Disadvantages | Typical Use‑Cases |
|---|
| Sequential (Linear) | - Records stored one after another in the order of insertion.
- Access: read from the start until the required record is found (or end‑of‑file).
| - Very simple to implement.
- Excellent for bulk loading, full‑file scans and batch processing.
| - Random access is
O(n) – must read all preceding records. - Insert/delete in the middle requires shifting large blocks of data.
| Log files, archival data, transaction journals. |
| Direct (Random) Access | - Record address calculated from the key:
p = (k mod m) where m = number of slots. - No separate index required; each slot holds a record or a pointer to an overflow area.
| - Very fast single‑record retrieval –
O(1) average. - Simple arithmetic; no index maintenance.
| - Collisions must be handled (chaining or open addressing).
- Works best when keys are uniformly distributed and records are fixed‑length.
| Employee ID files, static lookup tables. |
| Hashed Organisation | - Hash function
h(k) maps a key to a bucket. - Collision resolution:
- Chaining: each bucket holds a linked list of records.
- Open addressing: probing sequences (linear, quadratic, double hashing).
| - Average‑case constant‑time access (
O(1)). - Efficient for large, dynamic data sets with frequent inserts/deletes.
| - Performance degrades as the load factor ↑ (more collisions).
- Choosing a good hash function is critical; poor distribution →
O(n) worst case.
| Symbol tables, caching layers, password‑hash stores. |
| Indexed Organisation | - Two files: data file (records) and index file (key → byte offset).
- Indexes may be:
- Simple (sorted) index: linear or binary search.
- B‑tree index: balanced tree; search
O(log n), insert/delete O(log n).
| - Fast point look‑ups (
O(log n) with B‑tree). - Supports range queries efficiently (scan leaf nodes).
| - Extra storage for the index.
- Index must be updated on every insert/delete.
| Student databases, inventory systems, file‑systems (e.g., NTFS MFT). |
| Indexed‑Sequential (Hybrid) | - Data stored sequentially; one or more indexes point to block boundaries.
- Search proceeds: index → block → sequential scan within block.
| - Good compromise: fast point look‑ups and efficient range scans.
- Reduces need to read the whole file for most queries.
| - More complex implementation; requires periodic re‑organisation (splitting/merging blocks).
- Additional storage for index blocks.
| Bank transaction files, library catalogues, large‑scale reporting systems. |
File Access Methods (How Records Are Retrieved)
- Sequential Access: Read records in order from the start; ideal for full‑file processing.
- Direct (Random) Access: Compute or look up the exact location and read it immediately.
- Indexed Access: Search an index first (binary search or B‑tree), then fetch the record using the pointer.
- Hashed Access: Apply the hash function to obtain the bucket; resolve collisions as required.
Complexity Summary (AO2)
| Method | Search | Insert | Delete | Range Query |
|---|
| Sequential | O(n) | O(n) (shift) | O(n) | O(n) |
| Direct (fixed‑length) | O(1) | O(1) (if slot free) / O(n) (overflow) | Varies | Not applicable |
| Hashed (good load factor) | O(1) avg, O(n) worst | O(1) avg | O(1) avg | Not applicable |
| Indexed (B‑tree) | O(log n) | O(log n) | O(log n) | O(log n + k) (k = results) |
| Indexed‑Sequential | O(log n + b) (b = block size) | O(b) (re‑organisation occasional) | O(b) | O(log n + k) |
Choosing the Right Organisation & Access Method
Answer the checklist questions, then match the pattern to the most suitable method.
| Consideration | Implication for Choice |
|---|
| Primary operation (search vs bulk read) | Search‑heavy → indexed or hashed; bulk‑read → sequential. |
| Record length (fixed vs variable) | Fixed → direct/hashed easy; variable → need pointers (index) or length fields. |
| Key distribution (uniform vs clustered) | Uniform → direct or hashed works well; clustered → indexed‑sequential reduces collisions. |
| Frequency of inserts/deletes | High → hashed or indexed‑sequential; low → pure sequential. |
| Need for range queries | Indexed (B‑tree) or indexed‑sequential gives O(log n + k) for k results. |
| Storage constraints | Limited extra space → sequential or direct; ample space → indexes or hash tables. |
Worked Example – Selecting a File Organisation
Problem Statement
A university must store up to 200 000 student records (StudentID, Name, Course, Year, GPA). Requirements:
- Fast lookup of a single student by
StudentID (numeric, uniformly distributed). - Frequent updates (add, delete, modify).
- Range queries to list all students in a particular
Course and Year. - Names are of variable length.
Analysis
- Key characteristics
- Primary key =
StudentID (numeric, uniform). - Secondary query fields =
(Course, Year) – need ordered access. - Variable‑length
Name field → cannot rely on fixed‑size records for direct positioning.
- Operation mix
- Point look‑ups dominate →
O(log n) is acceptable. - Updates are frequent → minimise re‑organisation cost.
- Range queries require ordered traversal of a subset.
Recommended Solution
- File Organisation: Indexed‑Sequential (Hybrid)
- Indexes:
- Primary B‑tree index on
StudentID → fast point look‑ups. - Secondary B‑tree index on
(Course, Year) → efficient range scans.
- Data file: Sequential storage of variable‑length records; each index entry stores the byte offset of the record.
- Update handling: Maintain a free‑space list (linked list of holes) inside the data file; deleted record space is reused on insert.
Justification (AO2 & AO3)
- Performance: Primary B‑tree gives
O(log n) lookup; secondary index gives O(log n + k) for a range of k students. - Updates: Inserting a record only touches the free‑space list and the two indexes – no massive data movement.
- Variable‑length support: Offsets in the indexes decouple record size from the access method.
- Storage trade‑off: Extra space for two B‑tree indexes is justified by the speed gains for the expected query mix.
Diagram (description – draw in class)
- Data file: sequential blocks, each containing a variable‑length record.
- Primary B‑tree index:
StudentID → byte offset. - Secondary B‑tree index:
(Course, Year) → list of byte offsets. - Free‑space list: linked list of empty slots within the data file.
Practical Tip (Paper 4)
When implementing the above in Java or Python:
- Use
RandomAccessFile (Java) or mmap (Python) to read/write at specific byte offsets. - Represent a B‑tree node as a fixed‑size block (e.g., 4 KB) to simplify disk I/O.
- Write a simple test plan:
- Insert 10 000 random student records.
- Search for 100 random
StudentID values – verify O(log n) time. - Delete 500 records – check free‑space list updates.
- Run a range query for a specific
Course/Year – confirm correct number of results.
Summary Checklist (Exam‑Ready)
- Identify the dominant operation(s) – search, update, bulk read, range query.
- Determine record size (fixed/variable) and key distribution.
- Match the pattern to a file‑organisation method:
- Sequential – bulk scans, low update frequency.
- Direct – fixed‑length, uniform keys, minimal updates.
- Hashed – high insert/delete, uniform keys, need
O(1) average access. - Indexed – point look‑ups & range queries, variable length, moderate updates.
- Indexed‑Sequential – mixed workload (point + range) with frequent updates.
- Choose the appropriate access method (sequential, direct, indexed, hashed) that matches the organisation.
- Justify your choice in terms of:
- Time complexity (average & worst case).
- Space overhead (index or hash‑table size).
- Update cost.
- Suitability for variable‑length records.
- Map your answer to the assessment objectives (AO1‑AO3).
Action‑Oriented Review Checklist – Alignment with Cambridge AS & A‑Level Computer Science (9618) Syllabus (2026)
Use this table to verify that the notes cover every required sub‑topic for the “File Organisation and Access” section of the syllabus.
| Syllabus Sub‑section | Present in Notes? | Action Required |
|---|
| 13.2.1 Definition of file organisation and file access | ✔ | None |
| 13.2.2 Sequential (linear) organisation | ✔ | None |
| 13.2.3 Direct (random) organisation | ✔ | None |
| 13.2.4 Hashed organisation – hash functions, collision resolution | ✔ | None |
| 13.2.5 Indexed organisation – simple index, B‑tree index | ✔ | None |
| 13.2.6 Indexed‑sequential (hybrid) organisation | ✔ | None |
| 13.2.7 Complexity analysis (O‑notation) for each method | ✔ | None |
| 13.2.8 Decision‑making checklist for selecting a method | ✔ | None |
| 13.2.9 Worked example with justification (AO2 & AO3) | ✔ | None |
| 13.2.10 Practical implementation tips for Paper 4 | ✔ | None |
| 13.2.11 Link to other syllabus areas (e.g., data structures, algorithms) | ✘ | Add a brief paragraph linking B‑trees to the “Data Structures” topic and hashing to “Algorithms”. |
Link to Other Syllabus Areas (AO1‑AO3)
- Data structures: B‑trees and hash tables are covered in the “Data structures” unit; understanding their node layout reinforces tree traversal concepts.
- Algorithms: Collision‑resolution probing strategies and binary search on indexes illustrate algorithmic thinking and complexity analysis.
- System software: File‑system implementations (e.g., NTFS MFT) use indexed‑sequential techniques – useful for the “System software” unit.
AO Mapping Table for This Topic
| Content | AO1 | AO2 | AO3 |
|---|
| Definitions of file organisation & access methods | ✔ | | |
| Characteristics of sequential, direct, hashed, indexed, indexed‑sequential files | ✔ | | |
| Complexity analysis (O‑notation) for each method | | ✔ | |
| Decision‑making checklist for selecting a method | | | ✔ |
| Worked example with justification | | ✔ | ✔ |
| Practical implementation tips for Paper 4 | | | ✔ |
Final Quick‑Reference (Exam‑Ready)
- Sequential: Simple, great for bulk reads, poor random access.
- Direct: Fixed‑length records, uniform keys, O(1) lookup.
- Hashed: Constant‑time average, needs good hash function, handles dynamic data.
- Indexed: B‑tree gives O(log n) point & range queries; extra storage for index.
- Indexed‑Sequential: Best for mixed workloads (point + range) with frequent updates.