13.2 File Organisation & Access – Hashing Algorithms
1. File‑access methods (syllabus context)
The Cambridge AS & A‑Level syllabus expects you to know the three basic ways of accessing a file and to be able to decide when hashing is the most appropriate technique.
- Serial (sequential) access – records are processed one after another from the start of the file. Used when the whole file must be read (e.g., printing a register).
- Random (direct) access – a record can be retrieved by its address without reading preceding records. Typical for fixed‑length records stored in an array or on a disk block.
- Hashing (a specialised form of random access) – the address is computed from a key using a hash function. Gives (average) constant‑time access, ideal for large, unordered data sets.
When to choose hashing
| Consideration | Serial / Sequential | Random (fixed‑length) | Hashing |
| Fast lookup of a single record by key? |
No – must scan whole file |
Yes, if address is known |
Yes – address is computed from the key |
| Large static data set? |
Impractical |
Possible but may waste space |
Ideal – constant‑time average access |
| Range queries or ordered output required? |
Yes |
Yes (if ordered by key) |
No – hashing is unordered |
| Very limited memory? |
Low overhead |
Low overhead |
Open addressing uses low overhead; separate chaining needs extra pointers |
2. What is hashing?
Hashing is a random‑access file organisation in which a hash function h converts a key k from a large universe U into an index (or bucket number) in the range [0, m‑1], where m is the number of slots in the hash file.
Key terminology (syllabus wording)
- Key – the attribute used to locate a record (e.g., student number).
- Bucket / slot – a position in the hash file; the syllabus calls this a “slot in a hash‑file”.
- Load factor α –
α = n / m where n is the number of stored records.
- Collision – two different keys produce the same bucket.
- Clustering – groups of occupied slots that increase the cost of probing (primary clustering for linear probing, secondary clustering for double hashing).
3. Families of hash functions
3.1 Division method
h(k) = k mod m
Simple and fast. Works well when m is a prime that does not share factors with typical key patterns.
3.2 Multiplication method
h(k) = ⌊ m × ((k × A) mod 1) ⌋
where 0 < A < 1 (commonly A ≈ (√5‑1)/2 ≈ 0.618). The fractional part of kA is multiplied by m and truncated.
3.3 Linear‑transformation (universal) method
h(k) = ((a·k + b) mod p) mod m
with p a prime > m, and constants a, b satisfying 0 < a < p and 0 ≤ b < p.
Worked examples
| Method | Parameters | Key k | Result h(k) |
| Division |
m = 101 (prime) |
1 234 567 |
1 234 567 mod 101 = 84 |
| Multiplication |
m = 101, A = 0.618 |
1 234 567 |
k·A = 762 342.306 → frac = 0.306 → 0.306 × 101 ≈ 30.9 → h(k)=30 |
| Linear‑transformation |
p = 103, m = 101, a = 7, b = 3 |
1 234 567 |
(7·k + 3) mod 103 = 58 → 58 mod 101 = 58 |
4. Collision‑resolution strategies
4.1 Open addressing (all records stored in the primary array)
When a collision occurs the algorithm probes a sequence of alternative slots until an empty one is found.
| Probing technique | Probe function hi(k) | Typical use |
| Linear probing |
(h(k) + i) mod m |
Simple, good cache performance; avoid when α > 0.7. |
| Quadratic probing |
(h(k) + c₁i + c₂i²) mod m |
Reduces primary clustering; requires prime m and careful constants. |
| Double hashing |
(h₁(k) + i·h₂(k)) mod m |
Minimises both primary and secondary clustering; h₂(k) must never be 0. |
Pseudocode – Insertion (open addressing)
function insertOpen(key, record):
i ← 0
while i < m:
slot ← probe(key, i) // one of the three probe formulas
if table[slot] is empty:
table[slot] ← (key, record)
return slot
else if table[slot].key = key:
table[slot].record ← record // update existing key
return slot
i ← i + 1
raise "Hash table full"
Pseudocode – Search (open addressing)
function searchOpen(key):
i ← 0
while i < m:
slot ← probe(key, i)
if table[slot] is empty:
return NOT_FOUND
else if table[slot].key = key:
return table[slot].record
i ← i + 1
return NOT_FOUND
4.2 Separate chaining (each bucket holds a secondary list)
Each slot contains a pointer to a linked list (or another dynamic structure) that stores all records hashing to that slot.
Pseudocode – Insertion (separate chaining)
function insertChain(key, record):
slot ← h(key)
node ← table[slot] // head of the list (may be null)
while node ≠ null:
if node.key = key:
node.record ← record // update existing key
return
node ← node.next
// key not found – prepend new node
newNode ← Node(key, record, table[slot])
table[slot] ← newNode
Pseudocode – Search (separate chaining)
function searchChain(key):
slot ← h(key)
node ← table[slot]
while node ≠ null:
if node.key = key:
return node.record
node ← node.next
return NOT_FOUND
5. Comparison of collision‑resolution methods
| Method |
Space overhead |
Average successful search |
Average unsuccessful search |
Best‑fit scenarios (syllabus guidance) |
| Linear probing |
Low – single array of size m |
½ (1 + 1/(1‑α)) |
½ (1 + 1/(1‑α)²) |
Memory‑tight applications, α ≤ 0.7. |
| Quadratic probing |
Low |
Similar to linear probing but with less primary clustering. |
Similar to linear probing. |
When clustering must be limited but array size is fixed. |
| Double hashing |
Low |
≈ 1/(1‑α) (near‑optimal for open addressing). |
≈ 1/(1‑α) (near‑optimal). |
High‑performance needs, moderate load factor, good secondary hash. |
| Separate chaining |
Higher – pointer for each record. |
1 + α/2 |
1 + α |
Dynamic data sets where α may exceed 1, or when deletions are frequent. |
6. Performance considerations (AO2)
- Load factor α – keep α < 0.75 for open addressing; chaining tolerates α > 1 but performance degrades linearly.
- Uniform distribution – a good hash function spreads keys evenly; avoid patterns that cause many keys to share the same bucket.
- Re‑hashing – when α exceeds a preset threshold, allocate a larger table (usually ≈ 2 m or the next prime) and recompute all slots using the same or a new hash function.
- Clustering effects – primary clustering (linear probing) and secondary clustering (double hashing) increase the number of probes; choose probing method accordingly.
7. Advantages and disadvantages (AO1)
| Aspect | Advantages | Disadvantages |
| Average‑case complexity |
O(1) for search, insert, delete (when α is low). |
Worst‑case O(n) if many collisions occur. |
| Memory usage |
Open addressing uses only the primary array. |
Separate chaining needs extra pointers; open addressing suffers from clustering. |
| Order of records |
Not required – ideal for unordered data. |
Cannot perform efficient range queries or ordered traversals. |
| Implementation complexity |
Simple hash functions (division) are easy to code. |
Good hash functions and probing sequences require careful design. |
8. Decision‑making framework (choose the most appropriate file‑organisation method)
- Is the problem a range‑query or does it need ordered output?
→ Use sequential/ordered file (e.g., indexed sequential access).
- Do you know the exact address of each record (fixed‑length, known offset)?
→ Use random (direct) access.
- Do you need fast lookup by an arbitrary key and the data set is large/unordered?
→ Use hashing. Then decide:
- If memory is very limited and the load factor will stay < 0.7 → open addressing (linear or double hashing).
- If the file will grow beyond m or deletions are frequent → separate chaining.
- If clustering is a concern → prefer double hashing or quadratic probing.
9. Sample examination questions
9.1 Division‑method hashing with separate chaining
Question: A hash file uses m = 13 buckets and h(k) = k mod 13. Insert the keys 27, 40, 13, 55, 68, 81. Show the bucket contents and calculate the load factor.
Answer outline:
- Hash values:
- h(27) = 1, h(40) = 1, h(13) = 0, h(55) = 3, h(68) = 3, h(81) = 3.
- Bucket contents (chaining):
- 0: 13
- 1: 27 → 40
- 3: 55 → 68 → 81
- All other buckets empty.
- Load factor
α = 6 / 13 ≈ 0.46.
9.2 Multiplication‑method hashing with open addressing (double hashing)
Question: A hash table has m = 11 slots. Primary hash h₁(k) = k mod 11, secondary hash h₂(k) = 1 + (k mod 10). Insert the keys 22, 31, 44, 55 using double hashing. Show the probe sequence for each insertion.
Answer outline:
- Key 22:
h₁=0, h₂=1+(22 mod 10)=3 → slots probed: 0 (empty) → insert at 0.
- Key 31:
h₁=9, h₂=1+(31 mod 10)=2 → slot 9 empty → insert at 9.
- Key 44:
h₁=0, h₂=1+(44 mod 10)=5 → probe 0 (occupied), then 5 (empty) → insert at 5.
- Key 55:
h₁=0, h₂=1+(55 mod 10)=6 → probe 0 (occupied), 6 (empty) → insert at 6.
13 – Other AS & A‑Level Topics (quick reference)
13.1 Information representation
- Binary, octal, hexadecimal conversion; two’s complement for signed integers.
- ASCII (7‑bit) and Unicode (UTF‑8/UTF‑16) character sets.
- Bitmap vs. vector graphics – storage implications.
- Sound sampling: sample rate, bit depth, lossless (e.g., PCM) vs. lossy (e.g., MP3) compression.
- Simple data‑compression techniques: Run‑Length Encoding, Huffman coding, JPEG basics.
13.2 Communication
- Network devices: NIC, hub, switch, router, gateway.
- LAN vs. WAN characteristics; client‑server vs. peer‑to‑peer models.
- Topologies: bus, star, ring, mesh.
- Ethernet & CSMA‑CD basics; TCP/IP stack layers.
- IP addressing (IPv4/IPv6), subnet masks, public vs. private addresses, DHCP.
- DNS, URL structure, basics of cloud computing.
13.3 Hardware
- CPU registers (PC, IR, ACC, MAR, MDR) and their roles.
- Memory types: RAM (SRAM, DRAM), ROM (PROM, EPROM, EEPROM, Flash).
- Buffers and caches – purpose and typical sizes.
- Logic gates, truth tables, combinational circuits (adders, multiplexers).
- Simple embedded‑system examples (e.g., microcontroller‑controlled traffic light).
13.4 Processor fundamentals
- Von Neumann architecture diagram.
- Fetch‑decode‑execute cycle; register‑transfer notation.
- Interrupt handling – hardware vs. software interrupts.
- Basic assembly‑language instructions: data movement, arithmetic, logical, branch, and I/O.
- Addressing modes: immediate, direct, indirect, indexed.
- Bit‑wise operations: shift, rotate, mask.
13.5 System software
- Operating‑system services: process management, memory management, file management, I/O, security.
- Utility software examples (antivirus, backup, disk‑defragmenter).
- Program libraries (static vs. dynamic, DLLs).
- Compilers vs. interpreters – translation steps and trade‑offs.
- Integrated Development Environments (IDEs) – typical features.
13.6 Security & data integrity
- Definitions: security, privacy, integrity.
- Security measures: firewalls, anti‑virus, encryption (symmetric vs. asymmetric), authentication (passwords, biometrics), access control.
- Data‑validation techniques: range checks, format checks, checksums, parity bits.
13.7 Ethics & ownership
- Professional codes of conduct (e.g., BCS Code of Conduct).
- Intellectual‑property rights: copyright, patents, trademarks.
- Software licences: GPL, MIT, commercial licences – key differences.
- Emerging ethical issues: AI bias, data privacy, environmental impact of computing.
13.8 Databases
- File‑based vs. relational DBMS – advantages of DBMS.
- Entity‑Relationship (ER) diagram conventions.
- Normalization: 1NF, 2NF, 3NF with simple examples.
- SQL basics: DDL (CREATE, ALTER), DML (SELECT, INSERT, UPDATE, DELETE).
- Key concepts: primary key, foreign key, referential integrity.
13.9 Algorithm design & problem solving
- Abstraction, decomposition, and algorithmic thinking.
- Pseudocode conventions: indentation, control structures, comments.
- Flow‑chart symbols and their meanings.
- Complexity analysis: Big‑O notation, best/average/worst cases.
- Common algorithmic techniques: iteration, recursion, searching (linear, binary), sorting (bubble, selection, insertion, merge, quick).
13.10 File organisation & access – Summary
- Serial (sequential) – simple, good for batch processing.
- Random (direct) – fixed‑length records, known offsets.
- Hashing – fast key‑based lookup; choose appropriate hash function and collision‑resolution method.
13.11 Key exam tips (AO1–AO3)
- Memorise the three file‑access methods and when each is appropriate.
- Be able to write and trace the pseudocode for insertion/search in both open addressing and separate chaining.
- Know the formulas for load factor and average‑case probe counts.
- Practice converting keys using division, multiplication, and linear‑transformation methods.
- When answering design questions, justify your choice of hash function and collision‑resolution technique based on load factor, memory limits, and expected operations (insert/delete/search).