Cambridge Notes, Past Papers, Revision Questions

13.2 File Organisation & Access – Hashing Algorithms

1. File‑access methods (syllabus context)

The Cambridge AS & A‑Level syllabus expects you to know the three basic ways of accessing a file and to be able to decide when hashing is the most appropriate technique.

Serial (sequential) access – records are processed one after another from the start of the file. Used when the whole file must be read (e.g., printing a register).

Random (direct) access – a record can be retrieved by its address without reading preceding records. Typical for fixed‑length records stored in an array or on a disk block.

Hashing (a specialised form of random access) – the address is computed from a key using a hash function. Gives (average) constant‑time access, ideal for large, unordered data sets.

When to choose hashing

Consideration	Serial / Sequential	Random (fixed‑length)	Hashing
Fast lookup of a single record by key?	No – must scan whole file	Yes, if address is known	Yes – address is computed from the key
Large static data set?	Impractical	Possible but may waste space	Ideal – constant‑time average access
Range queries or ordered output required?	Yes	Yes (if ordered by key)	No – hashing is unordered
Very limited memory?	Low overhead	Low overhead	Open addressing uses low overhead; separate chaining needs extra pointers

2. What is hashing?

Hashing is a random‑access file organisation in which a hash function h converts a key k from a large universe U into an index (or bucket number) in the range [0, m‑1], where m is the number of slots in the hash file.

Key terminology (syllabus wording)

Key – the attribute used to locate a record (e.g., student number).

Bucket / slot – a position in the hash file; the syllabus calls this a “slot in a hash‑file”.

Load factor α – α = n / m where n is the number of stored records.

Collision – two different keys produce the same bucket.

Clustering – groups of occupied slots that increase the cost of probing (primary clustering for linear probing, secondary clustering for double hashing).

3. Families of hash functions

3.1 Division method

h(k) = k mod m

Simple and fast. Works well when m is a prime that does not share factors with typical key patterns.

3.2 Multiplication method

h(k) = ⌊ m × ((k × A) mod 1) ⌋

where 0 < A < 1 (commonly A ≈ (√5‑1)/2 ≈ 0.618). The fractional part of kA is multiplied by m and truncated.

3.3 Linear‑transformation (universal) method

h(k) = ((a·k + b) mod p) mod m

with p a prime > m, and constants a, b satisfying 0 < a < p and 0 ≤ b < p.

Worked examples

Method	Parameters	Key k	Result h(k)
Division	m = 101 (prime)	1 234 567	1 234 567 mod 101 = 84
Multiplication	m = 101, A = 0.618	1 234 567	k·A = 762 342.306 → frac = 0.306 → 0.306 × 101 ≈ 30.9 → h(k)=30
Linear‑transformation	p = 103, m = 101, a = 7, b = 3	1 234 567	(7·k + 3) mod 103 = 58 → 58 mod 101 = 58

4. Collision‑resolution strategies

4.1 Open addressing (all records stored in the primary array)

When a collision occurs the algorithm probes a sequence of alternative slots until an empty one is found.

Probing technique	Probe function h_i(k)	Typical use
Linear probing	`(h(k) + i) mod m`	Simple, good cache performance; avoid when α > 0.7.
Quadratic probing	`(h(k) + c₁i + c₂i²) mod m`	Reduces primary clustering; requires prime m and careful constants.
Double hashing	`(h₁(k) + i·h₂(k)) mod m`	Minimises both primary and secondary clustering; `h₂(k)` must never be 0.

Pseudocode – Insertion (open addressing)

function insertOpen(key, record):
i ← 0
while i < m:
slot ← probe(key, i)          // one of the three probe formulas
if table[slot] is empty:
table[slot] ← (key, record)
return slot
else if table[slot].key = key:
table[slot].record ← record   // update existing key
return slot
i ← i + 1
raise "Hash table full"

Pseudocode – Search (open addressing)

function searchOpen(key):
i ← 0
while i < m:
slot ← probe(key, i)
if table[slot] is empty:
return NOT_FOUND
else if table[slot].key = key:
return table[slot].record
i ← i + 1
return NOT_FOUND

4.2 Separate chaining (each bucket holds a secondary list)

Each slot contains a pointer to a linked list (or another dynamic structure) that stores all records hashing to that slot.

Pseudocode – Insertion (separate chaining)

function insertChain(key, record):
slot ← h(key)
node ← table[slot]                 // head of the list (may be null)
while node ≠ null:
if node.key = key:
node.record ← record      // update existing key
return
node ← node.next
// key not found – prepend new node
newNode ← Node(key, record, table[slot])
table[slot] ← newNode

Pseudocode – Search (separate chaining)

function searchChain(key):
slot ← h(key)
node ← table[slot]
while node ≠ null:
if node.key = key:
return node.record
node ← node.next
return NOT_FOUND

5. Comparison of collision‑resolution methods

Method	Space overhead	Average successful search	Average unsuccessful search	Best‑fit scenarios (syllabus guidance)
Linear probing	Low – single array of size m	`½ (1 + 1/(1‑α))`	`½ (1 + 1/(1‑α)²)`	Memory‑tight applications, α ≤ 0.7.
Quadratic probing	Low	Similar to linear probing but with less primary clustering.	Similar to linear probing.	When clustering must be limited but array size is fixed.
Double hashing	Low	≈ 1/(1‑α) (near‑optimal for open addressing).	≈ 1/(1‑α) (near‑optimal).	High‑performance needs, moderate load factor, good secondary hash.
Separate chaining	Higher – pointer for each record.	1 + α/2	1 + α	Dynamic data sets where α may exceed 1, or when deletions are frequent.

6. Performance considerations (AO2)

Load factor α – keep α < 0.75 for open addressing; chaining tolerates α > 1 but performance degrades linearly.

Uniform distribution – a good hash function spreads keys evenly; avoid patterns that cause many keys to share the same bucket.

Re‑hashing – when α exceeds a preset threshold, allocate a larger table (usually ≈ 2 m or the next prime) and recompute all slots using the same or a new hash function.

Clustering effects – primary clustering (linear probing) and secondary clustering (double hashing) increase the number of probes; choose probing method accordingly.

7. Advantages and disadvantages (AO1)

Aspect	Advantages	Disadvantages
Average‑case complexity	O(1) for search, insert, delete (when α is low).	Worst‑case O(n) if many collisions occur.
Memory usage	Open addressing uses only the primary array.	Separate chaining needs extra pointers; open addressing suffers from clustering.
Order of records	Not required – ideal for unordered data.	Cannot perform efficient range queries or ordered traversals.
Implementation complexity	Simple hash functions (division) are easy to code.	Good hash functions and probing sequences require careful design.

8. Decision‑making framework (choose the most appropriate file‑organisation method)

Is the problem a range‑query or does it need ordered output?
→ Use sequential/ordered file (e.g., indexed sequential access).

Do you know the exact address of each record (fixed‑length, known offset)?
→ Use random (direct) access.

Do you need fast lookup by an arbitrary key and the data set is large/unordered?
→ Use hashing. Then decide:
- If memory is very limited and the load factor will stay < 0.7 → open addressing (linear or double hashing).
- If the file will grow beyond m or deletions are frequent → separate chaining.
- If clustering is a concern → prefer double hashing or quadratic probing.

9. Sample examination questions

9.1 Division‑method hashing with separate chaining

Question: A hash file uses m = 13 buckets and h(k) = k mod 13. Insert the keys 27, 40, 13, 55, 68, 81. Show the bucket contents and calculate the load factor.

Answer outline:

Hash values:
- h(27) = 1, h(40) = 1, h(13) = 0, h(55) = 3, h(68) = 3, h(81) = 3.

Bucket contents (chaining):
- 0: 13
- 1: 27 → 40
- 3: 55 → 68 → 81
- All other buckets empty.

Load factor α = 6 / 13 ≈ 0.46.

9.2 Multiplication‑method hashing with open addressing (double hashing)

Question: A hash table has m = 11 slots. Primary hash h₁(k) = k mod 11, secondary hash h₂(k) = 1 + (k mod 10). Insert the keys 22, 31, 44, 55 using double hashing. Show the probe sequence for each insertion.

Answer outline:

Key 22: h₁=0, h₂=1+(22 mod 10)=3 → slots probed: 0 (empty) → insert at 0.

Key 31: h₁=9, h₂=1+(31 mod 10)=2 → slot 9 empty → insert at 9.

Key 44: h₁=0, h₂=1+(44 mod 10)=5 → probe 0 (occupied), then 5 (empty) → insert at 5.

Key 55: h₁=0, h₂=1+(55 mod 10)=6 → probe 0 (occupied), 6 (empty) → insert at 6.

13 – Other AS & A‑Level Topics (quick reference)

13.1 Information representation

Binary, octal, hexadecimal conversion; two’s complement for signed integers.

ASCII (7‑bit) and Unicode (UTF‑8/UTF‑16) character sets.

Bitmap vs. vector graphics – storage implications.

Sound sampling: sample rate, bit depth, lossless (e.g., PCM) vs. lossy (e.g., MP3) compression.

Simple data‑compression techniques: Run‑Length Encoding, Huffman coding, JPEG basics.

13.2 Communication

Network devices: NIC, hub, switch, router, gateway.

LAN vs. WAN characteristics; client‑server vs. peer‑to‑peer models.

Topologies: bus, star, ring, mesh.

Ethernet & CSMA‑CD basics; TCP/IP stack layers.

IP addressing (IPv4/IPv6), subnet masks, public vs. private addresses, DHCP.

DNS, URL structure, basics of cloud computing.

13.3 Hardware

CPU registers (PC, IR, ACC, MAR, MDR) and their roles.

Memory types: RAM (SRAM, DRAM), ROM (PROM, EPROM, EEPROM, Flash).

Buffers and caches – purpose and typical sizes.

Logic gates, truth tables, combinational circuits (adders, multiplexers).

Simple embedded‑system examples (e.g., microcontroller‑controlled traffic light).

13.4 Processor fundamentals

Von Neumann architecture diagram.

Fetch‑decode‑execute cycle; register‑transfer notation.

Interrupt handling – hardware vs. software interrupts.

Basic assembly‑language instructions: data movement, arithmetic, logical, branch, and I/O.

Addressing modes: immediate, direct, indirect, indexed.

Bit‑wise operations: shift, rotate, mask.

13.5 System software

Operating‑system services: process management, memory management, file management, I/O, security.

Utility software examples (antivirus, backup, disk‑defragmenter).

Program libraries (static vs. dynamic, DLLs).

Compilers vs. interpreters – translation steps and trade‑offs.

Integrated Development Environments (IDEs) – typical features.

13.6 Security & data integrity

Definitions: security, privacy, integrity.

Security measures: firewalls, anti‑virus, encryption (symmetric vs. asymmetric), authentication (passwords, biometrics), access control.

Data‑validation techniques: range checks, format checks, checksums, parity bits.

13.7 Ethics & ownership

Professional codes of conduct (e.g., BCS Code of Conduct).

Intellectual‑property rights: copyright, patents, trademarks.

Software licences: GPL, MIT, commercial licences – key differences.

Emerging ethical issues: AI bias, data privacy, environmental impact of computing.

13.8 Databases

File‑based vs. relational DBMS – advantages of DBMS.

Entity‑Relationship (ER) diagram conventions.

Normalization: 1NF, 2NF, 3NF with simple examples.

SQL basics: DDL (CREATE, ALTER), DML (SELECT, INSERT, UPDATE, DELETE).

Key concepts: primary key, foreign key, referential integrity.

13.9 Algorithm design & problem solving

Abstraction, decomposition, and algorithmic thinking.

Pseudocode conventions: indentation, control structures, comments.

Flow‑chart symbols and their meanings.

Complexity analysis: Big‑O notation, best/average/worst cases.

Common algorithmic techniques: iteration, recursion, searching (linear, binary), sorting (bubble, selection, insertion, merge, quick).

13.10 File organisation & access – Summary

Serial (sequential) – simple, good for batch processing.

Random (direct) – fixed‑length records, known offsets.

Hashing – fast key‑based lookup; choose appropriate hash function and collision‑resolution method.

13.11 Key exam tips (AO1–AO3)

Memorise the three file‑access methods and when each is appropriate.

Be able to write and trace the pseudocode for insertion/search in both open addressing and separate chaining.

Know the formulas for load factor and average‑case probe counts.

Practice converting keys using division, multiplication, and linear‑transformation methods.

When answering design questions, justify your choice of hash function and collision‑resolution technique based on load factor, memory limits, and expected operations (insert/delete/search).

Show understanding of hashing algorithms

13.2 File Organisation & Access – Hashing Algorithms