Show understanding of the methods of file organisation and select an appropriate method of file organisation and file access for a given problem

13.2 File Organisation and Access

Learning Objective (AO1‑AO3)

Show understanding of the methods of file organisation and select an appropriate method of file organisation and file access for a given problem.

Why This Topic Matters (AO1‑AO3)

  • AO1 – Knowledge: Define file organisation, file access, primary/secondary indexes, hash functions and collision‑resolution techniques.
  • AO2 – Application: Analyse the performance of different organisations for given data‑sets and query patterns.
  • AO3 – Design & Evaluation: Choose, justify and evaluate the most suitable file‑organisation/access method for a real‑world scenario.

Key Concepts

  • File organisation: Physical layout of records on secondary storage (disk, SSD, magnetic tape).
  • File access method: Algorithm used to retrieve a record (sequential, direct, indexed, hashed).
  • Primary key: Unique field used to locate a record.
  • Index: Auxiliary structure that maps key values to record locations (byte offset or block number).
  • Hash function h(k): Maps a key k to a bucket number; must distribute keys uniformly.
  • Collision resolution: Techniques such as chaining, linear probing, quadratic probing, double hashing.
  • Hybrid (indexed‑sequential) organisation: Combines sequential storage with one or more indexes.

Methods of File Organisation

MethodStructure & Key OperationsAdvantagesDisadvantagesTypical Use‑Cases
Sequential (Linear)

  • Records stored one after another in the order of insertion.
  • Access: read from the start until the required record is found (or end‑of‑file).

  • Very simple to implement.
  • Excellent for bulk loading, full‑file scans and batch processing.

  • Random access is O(n) – must read all preceding records.
  • Insert/delete in the middle requires shifting large blocks of data.

Log files, archival data, transaction journals.
Direct (Random) Access

  • Record address calculated from the key: p = (k mod m) where m = number of slots.
  • No separate index required; each slot holds a record or a pointer to an overflow area.

  • Very fast single‑record retrieval – O(1) average.
  • Simple arithmetic; no index maintenance.

  • Collisions must be handled (chaining or open addressing).
  • Works best when keys are uniformly distributed and records are fixed‑length.

Employee ID files, static lookup tables.
Hashed Organisation

  • Hash function h(k) maps a key to a bucket.
  • Collision resolution:

    • Chaining: each bucket holds a linked list of records.
    • Open addressing: probing sequences (linear, quadratic, double hashing).

  • Average‑case constant‑time access (O(1)).
  • Efficient for large, dynamic data sets with frequent inserts/deletes.

  • Performance degrades as the load factor ↑ (more collisions).
  • Choosing a good hash function is critical; poor distribution → O(n) worst case.

Symbol tables, caching layers, password‑hash stores.
Indexed Organisation

  • Two files: data file (records) and index file (key → byte offset).
  • Indexes may be:

    • Simple (sorted) index: linear or binary search.
    • B‑tree index: balanced tree; search O(log n), insert/delete O(log n).

  • Fast point look‑ups (O(log n) with B‑tree).
  • Supports range queries efficiently (scan leaf nodes).

  • Extra storage for the index.
  • Index must be updated on every insert/delete.

Student databases, inventory systems, file‑systems (e.g., NTFS MFT).
Indexed‑Sequential (Hybrid)

  • Data stored sequentially; one or more indexes point to block boundaries.
  • Search proceeds: index → block → sequential scan within block.

  • Good compromise: fast point look‑ups and efficient range scans.
  • Reduces need to read the whole file for most queries.

  • More complex implementation; requires periodic re‑organisation (splitting/merging blocks).
  • Additional storage for index blocks.

Bank transaction files, library catalogues, large‑scale reporting systems.

File Access Methods (How Records Are Retrieved)

  • Sequential Access: Read records in order from the start; ideal for full‑file processing.
  • Direct (Random) Access: Compute or look up the exact location and read it immediately.
  • Indexed Access: Search an index first (binary search or B‑tree), then fetch the record using the pointer.
  • Hashed Access: Apply the hash function to obtain the bucket; resolve collisions as required.

Complexity Summary (AO2)

MethodSearchInsertDeleteRange Query
SequentialO(n)O(n) (shift)O(n)O(n)
Direct (fixed‑length)O(1)O(1) (if slot free) / O(n) (overflow)VariesNot applicable
Hashed (good load factor)O(1) avg, O(n) worstO(1) avgO(1) avgNot applicable
Indexed (B‑tree)O(log n)O(log n)O(log n)O(log n + k) (k = results)
Indexed‑SequentialO(log n + b) (b = block size)O(b) (re‑organisation occasional)O(b)O(log n + k)

Choosing the Right Organisation & Access Method

Answer the checklist questions, then match the pattern to the most suitable method.

ConsiderationImplication for Choice
Primary operation (search vs bulk read)Search‑heavy → indexed or hashed; bulk‑read → sequential.
Record length (fixed vs variable)Fixed → direct/hashed easy; variable → need pointers (index) or length fields.
Key distribution (uniform vs clustered)Uniform → direct or hashed works well; clustered → indexed‑sequential reduces collisions.
Frequency of inserts/deletesHigh → hashed or indexed‑sequential; low → pure sequential.
Need for range queriesIndexed (B‑tree) or indexed‑sequential gives O(log n + k) for k results.
Storage constraintsLimited extra space → sequential or direct; ample space → indexes or hash tables.

Worked Example – Selecting a File Organisation

Problem Statement

A university must store up to 200 000 student records (StudentID, Name, Course, Year, GPA). Requirements:

  • Fast lookup of a single student by StudentID (numeric, uniformly distributed).
  • Frequent updates (add, delete, modify).
  • Range queries to list all students in a particular Course and Year.
  • Names are of variable length.

Analysis

  1. Key characteristics

    • Primary key = StudentID (numeric, uniform).
    • Secondary query fields = (Course, Year) – need ordered access.
    • Variable‑length Name field → cannot rely on fixed‑size records for direct positioning.

  2. Operation mix

    • Point look‑ups dominate → O(log n) is acceptable.
    • Updates are frequent → minimise re‑organisation cost.
    • Range queries require ordered traversal of a subset.

Recommended Solution

  • File Organisation: Indexed‑Sequential (Hybrid)
  • Indexes:

    • Primary B‑tree index on StudentID → fast point look‑ups.
    • Secondary B‑tree index on (Course, Year) → efficient range scans.

  • Data file: Sequential storage of variable‑length records; each index entry stores the byte offset of the record.
  • Update handling: Maintain a free‑space list (linked list of holes) inside the data file; deleted record space is reused on insert.

Justification (AO2 & AO3)

  • Performance: Primary B‑tree gives O(log n) lookup; secondary index gives O(log n + k) for a range of k students.
  • Updates: Inserting a record only touches the free‑space list and the two indexes – no massive data movement.
  • Variable‑length support: Offsets in the indexes decouple record size from the access method.
  • Storage trade‑off: Extra space for two B‑tree indexes is justified by the speed gains for the expected query mix.

Diagram (description – draw in class)

  • Data file: sequential blocks, each containing a variable‑length record.
  • Primary B‑tree index: StudentID → byte offset.
  • Secondary B‑tree index: (Course, Year) → list of byte offsets.
  • Free‑space list: linked list of empty slots within the data file.

Practical Tip (Paper 4)

When implementing the above in Java or Python:

  • Use RandomAccessFile (Java) or mmap (Python) to read/write at specific byte offsets.
  • Represent a B‑tree node as a fixed‑size block (e.g., 4 KB) to simplify disk I/O.
  • Write a simple test plan:

    1. Insert 10 000 random student records.
    2. Search for 100 random StudentID values – verify O(log n) time.
    3. Delete 500 records – check free‑space list updates.
    4. Run a range query for a specific Course/Year – confirm correct number of results.

Summary Checklist (Exam‑Ready)

  1. Identify the dominant operation(s) – search, update, bulk read, range query.
  2. Determine record size (fixed/variable) and key distribution.
  3. Match the pattern to a file‑organisation method:

    • Sequential – bulk scans, low update frequency.
    • Direct – fixed‑length, uniform keys, minimal updates.
    • Hashed – high insert/delete, uniform keys, need O(1) average access.
    • Indexed – point look‑ups & range queries, variable length, moderate updates.
    • Indexed‑Sequential – mixed workload (point + range) with frequent updates.

  4. Choose the appropriate access method (sequential, direct, indexed, hashed) that matches the organisation.
  5. Justify your choice in terms of:

    • Time complexity (average & worst case).
    • Space overhead (index or hash‑table size).
    • Update cost.
    • Suitability for variable‑length records.

  6. Map your answer to the assessment objectives (AO1‑AO3).

Action‑Oriented Review Checklist – Alignment with Cambridge AS & A‑Level Computer Science (9618) Syllabus (2026)

Use this table to verify that the notes cover every required sub‑topic for the “File Organisation and Access” section of the syllabus.

Syllabus Sub‑sectionPresent in Notes?Action Required
13.2.1 Definition of file organisation and file accessNone
13.2.2 Sequential (linear) organisationNone
13.2.3 Direct (random) organisationNone
13.2.4 Hashed organisation – hash functions, collision resolutionNone
13.2.5 Indexed organisation – simple index, B‑tree indexNone
13.2.6 Indexed‑sequential (hybrid) organisationNone
13.2.7 Complexity analysis (O‑notation) for each methodNone
13.2.8 Decision‑making checklist for selecting a methodNone
13.2.9 Worked example with justification (AO2 & AO3)None
13.2.10 Practical implementation tips for Paper 4None
13.2.11 Link to other syllabus areas (e.g., data structures, algorithms)Add a brief paragraph linking B‑trees to the “Data Structures” topic and hashing to “Algorithms”.

Link to Other Syllabus Areas (AO1‑AO3)

  • Data structures: B‑trees and hash tables are covered in the “Data structures” unit; understanding their node layout reinforces tree traversal concepts.
  • Algorithms: Collision‑resolution probing strategies and binary search on indexes illustrate algorithmic thinking and complexity analysis.
  • System software: File‑system implementations (e.g., NTFS MFT) use indexed‑sequential techniques – useful for the “System software” unit.

AO Mapping Table for This Topic

ContentAO1AO2AO3
Definitions of file organisation & access methods
Characteristics of sequential, direct, hashed, indexed, indexed‑sequential files
Complexity analysis (O‑notation) for each method
Decision‑making checklist for selecting a method
Worked example with justification
Practical implementation tips for Paper 4

Final Quick‑Reference (Exam‑Ready)

  • Sequential: Simple, great for bulk reads, poor random access.
  • Direct: Fixed‑length records, uniform keys, O(1) lookup.
  • Hashed: Constant‑time average, needs good hash function, handles dynamic data.
  • Indexed: B‑tree gives O(log n) point & range queries; extra storage for index.
  • Indexed‑Sequential: Best for mixed workloads (point + range) with frequent updates.