Show understanding of hashing algorithms

13.2 File Organisation & Access – Hashing Algorithms

1. File‑access methods (syllabus context)

The Cambridge AS & A‑Level syllabus expects you to know the three basic ways of accessing a file and to be able to decide when hashing is the most appropriate technique.

  • Serial (sequential) access – records are processed one after another from the start of the file. Used when the whole file must be read (e.g., printing a register).
  • Random (direct) access – a record can be retrieved by its address without reading preceding records. Typical for fixed‑length records stored in an array or on a disk block.
  • Hashing (a specialised form of random access) – the address is computed from a key using a hash function. Gives (average) constant‑time access, ideal for large, unordered data sets.

When to choose hashing

ConsiderationSerial / SequentialRandom (fixed‑length)Hashing
Fast lookup of a single record by key?No – must scan whole fileYes, if address is knownYes – address is computed from the key
Large static data set?ImpracticalPossible but may waste spaceIdeal – constant‑time average access
Range queries or ordered output required?YesYes (if ordered by key)No – hashing is unordered
Very limited memory?Low overheadLow overheadOpen addressing uses low overhead; separate chaining needs extra pointers

2. What is hashing?

Hashing is a random‑access file organisation in which a hash function h converts a key k from a large universe U into an index (or bucket number) in the range [0, m‑1], where m is the number of slots in the hash file.

Key terminology (syllabus wording)

  • Key – the attribute used to locate a record (e.g., student number).
  • Bucket / slot – a position in the hash file; the syllabus calls this a “slot in a hash‑file”.
  • Load factor αα = n / m where n is the number of stored records.
  • Collision – two different keys produce the same bucket.
  • Clustering – groups of occupied slots that increase the cost of probing (primary clustering for linear probing, secondary clustering for double hashing).

3. Families of hash functions

3.1 Division method

h(k) = k mod m

Simple and fast. Works well when m is a prime that does not share factors with typical key patterns.

3.2 Multiplication method

h(k) = ⌊ m × ((k × A) mod 1) ⌋

where 0 < A < 1 (commonly A ≈ (√5‑1)/2 ≈ 0.618). The fractional part of kA is multiplied by m and truncated.

3.3 Linear‑transformation (universal) method

h(k) = ((a·k + b) mod p) mod m

with p a prime > m, and constants a, b satisfying 0 < a < p and 0 ≤ b < p.

Worked examples

MethodParametersKey kResult h(k)
Divisionm = 101 (prime)1 234 5671 234 567 mod 101 = 84
Multiplicationm = 101, A = 0.6181 234 567k·A = 762 342.306 → frac = 0.306 → 0.306 × 101 ≈ 30.9 → h(k)=30
Linear‑transformationp = 103, m = 101, a = 7, b = 31 234 567(7·k + 3) mod 103 = 58 → 58 mod 101 = 58

4. Collision‑resolution strategies

4.1 Open addressing (all records stored in the primary array)

When a collision occurs the algorithm probes a sequence of alternative slots until an empty one is found.

Probing techniqueProbe function hi(k)Typical use
Linear probing(h(k) + i) mod mSimple, good cache performance; avoid when α > 0.7.
Quadratic probing(h(k) + c₁i + c₂i²) mod mReduces primary clustering; requires prime m and careful constants.
Double hashing(h₁(k) + i·h₂(k)) mod mMinimises both primary and secondary clustering; h₂(k) must never be 0.

Pseudocode – Insertion (open addressing)

function insertOpen(key, record):

i ← 0

while i < m:

slot ← probe(key, i) // one of the three probe formulas

if table[slot] is empty:

table[slot] ← (key, record)

return slot

else if table[slot].key = key:

table[slot].record ← record // update existing key

return slot

i ← i + 1

raise "Hash table full"

Pseudocode – Search (open addressing)

function searchOpen(key):

i ← 0

while i < m:

slot ← probe(key, i)

if table[slot] is empty:

return NOT_FOUND

else if table[slot].key = key:

return table[slot].record

i ← i + 1

return NOT_FOUND

4.2 Separate chaining (each bucket holds a secondary list)

Each slot contains a pointer to a linked list (or another dynamic structure) that stores all records hashing to that slot.

Pseudocode – Insertion (separate chaining)

function insertChain(key, record):

slot ← h(key)

node ← table[slot] // head of the list (may be null)

while node ≠ null:

if node.key = key:

node.record ← record // update existing key

return

node ← node.next

// key not found – prepend new node

newNode ← Node(key, record, table[slot])

table[slot] ← newNode

Pseudocode – Search (separate chaining)

function searchChain(key):

slot ← h(key)

node ← table[slot]

while node ≠ null:

if node.key = key:

return node.record

node ← node.next

return NOT_FOUND

5. Comparison of collision‑resolution methods

MethodSpace overheadAverage successful searchAverage unsuccessful searchBest‑fit scenarios (syllabus guidance)
Linear probingLow – single array of size m½ (1 + 1/(1‑α))½ (1 + 1/(1‑α)²)Memory‑tight applications, α ≤ 0.7.
Quadratic probingLowSimilar to linear probing but with less primary clustering.Similar to linear probing.When clustering must be limited but array size is fixed.
Double hashingLow≈ 1/(1‑α) (near‑optimal for open addressing).≈ 1/(1‑α) (near‑optimal).High‑performance needs, moderate load factor, good secondary hash.
Separate chainingHigher – pointer for each record.1 + α/21 + αDynamic data sets where α may exceed 1, or when deletions are frequent.

6. Performance considerations (AO2)

  1. Load factor α – keep α < 0.75 for open addressing; chaining tolerates α > 1 but performance degrades linearly.
  2. Uniform distribution – a good hash function spreads keys evenly; avoid patterns that cause many keys to share the same bucket.
  3. Re‑hashing – when α exceeds a preset threshold, allocate a larger table (usually ≈ 2 m or the next prime) and recompute all slots using the same or a new hash function.
  4. Clustering effects – primary clustering (linear probing) and secondary clustering (double hashing) increase the number of probes; choose probing method accordingly.

7. Advantages and disadvantages (AO1)

AspectAdvantagesDisadvantages
Average‑case complexityO(1) for search, insert, delete (when α is low).Worst‑case O(n) if many collisions occur.
Memory usageOpen addressing uses only the primary array.Separate chaining needs extra pointers; open addressing suffers from clustering.
Order of recordsNot required – ideal for unordered data.Cannot perform efficient range queries or ordered traversals.
Implementation complexitySimple hash functions (division) are easy to code.Good hash functions and probing sequences require careful design.

8. Decision‑making framework (choose the most appropriate file‑organisation method)

  1. Is the problem a range‑query or does it need ordered output?
    → Use sequential/ordered file (e.g., indexed sequential access).
  2. Do you know the exact address of each record (fixed‑length, known offset)?
    → Use random (direct) access.
  3. Do you need fast lookup by an arbitrary key and the data set is large/unordered?
    → Use hashing. Then decide:

    • If memory is very limited and the load factor will stay < 0.7 → open addressing (linear or double hashing).
    • If the file will grow beyond m or deletions are frequent → separate chaining.
    • If clustering is a concern → prefer double hashing or quadratic probing.

9. Sample examination questions

9.1 Division‑method hashing with separate chaining

Question: A hash file uses m = 13 buckets and h(k) = k mod 13. Insert the keys 27, 40, 13, 55, 68, 81. Show the bucket contents and calculate the load factor.

Answer outline:

  1. Hash values:

    • h(27) = 1, h(40) = 1, h(13) = 0, h(55) = 3, h(68) = 3, h(81) = 3.

  2. Bucket contents (chaining):

    • 0: 13
    • 1: 27 → 40
    • 3: 55 → 68 → 81
    • All other buckets empty.

  3. Load factor α = 6 / 13 ≈ 0.46.

9.2 Multiplication‑method hashing with open addressing (double hashing)

Question: A hash table has m = 11 slots. Primary hash h₁(k) = k mod 11, secondary hash h₂(k) = 1 + (k mod 10). Insert the keys 22, 31, 44, 55 using double hashing. Show the probe sequence for each insertion.

Answer outline:

  1. Key 22: h₁=0, h₂=1+(22 mod 10)=3 → slots probed: 0 (empty) → insert at 0.
  2. Key 31: h₁=9, h₂=1+(31 mod 10)=2 → slot 9 empty → insert at 9.
  3. Key 44: h₁=0, h₂=1+(44 mod 10)=5 → probe 0 (occupied), then 5 (empty) → insert at 5.
  4. Key 55: h₁=0, h₂=1+(55 mod 10)=6 → probe 0 (occupied), 6 (empty) → insert at 6.

13 – Other AS & A‑Level Topics (quick reference)

13.1 Information representation

  • Binary, octal, hexadecimal conversion; two’s complement for signed integers.
  • ASCII (7‑bit) and Unicode (UTF‑8/UTF‑16) character sets.
  • Bitmap vs. vector graphics – storage implications.
  • Sound sampling: sample rate, bit depth, lossless (e.g., PCM) vs. lossy (e.g., MP3) compression.
  • Simple data‑compression techniques: Run‑Length Encoding, Huffman coding, JPEG basics.

13.2 Communication

  • Network devices: NIC, hub, switch, router, gateway.
  • LAN vs. WAN characteristics; client‑server vs. peer‑to‑peer models.
  • Topologies: bus, star, ring, mesh.
  • Ethernet & CSMA‑CD basics; TCP/IP stack layers.
  • IP addressing (IPv4/IPv6), subnet masks, public vs. private addresses, DHCP.
  • DNS, URL structure, basics of cloud computing.

13.3 Hardware

  • CPU registers (PC, IR, ACC, MAR, MDR) and their roles.
  • Memory types: RAM (SRAM, DRAM), ROM (PROM, EPROM, EEPROM, Flash).
  • Buffers and caches – purpose and typical sizes.
  • Logic gates, truth tables, combinational circuits (adders, multiplexers).
  • Simple embedded‑system examples (e.g., microcontroller‑controlled traffic light).

13.4 Processor fundamentals

  • Von Neumann architecture diagram.
  • Fetch‑decode‑execute cycle; register‑transfer notation.
  • Interrupt handling – hardware vs. software interrupts.
  • Basic assembly‑language instructions: data movement, arithmetic, logical, branch, and I/O.
  • Addressing modes: immediate, direct, indirect, indexed.
  • Bit‑wise operations: shift, rotate, mask.

13.5 System software

  • Operating‑system services: process management, memory management, file management, I/O, security.
  • Utility software examples (antivirus, backup, disk‑defragmenter).
  • Program libraries (static vs. dynamic, DLLs).
  • Compilers vs. interpreters – translation steps and trade‑offs.
  • Integrated Development Environments (IDEs) – typical features.

13.6 Security & data integrity

  • Definitions: security, privacy, integrity.
  • Security measures: firewalls, anti‑virus, encryption (symmetric vs. asymmetric), authentication (passwords, biometrics), access control.
  • Data‑validation techniques: range checks, format checks, checksums, parity bits.

13.7 Ethics & ownership

  • Professional codes of conduct (e.g., BCS Code of Conduct).
  • Intellectual‑property rights: copyright, patents, trademarks.
  • Software licences: GPL, MIT, commercial licences – key differences.
  • Emerging ethical issues: AI bias, data privacy, environmental impact of computing.

13.8 Databases

  • File‑based vs. relational DBMS – advantages of DBMS.
  • Entity‑Relationship (ER) diagram conventions.
  • Normalization: 1NF, 2NF, 3NF with simple examples.
  • SQL basics: DDL (CREATE, ALTER), DML (SELECT, INSERT, UPDATE, DELETE).
  • Key concepts: primary key, foreign key, referential integrity.

13.9 Algorithm design & problem solving

  • Abstraction, decomposition, and algorithmic thinking.
  • Pseudocode conventions: indentation, control structures, comments.
  • Flow‑chart symbols and their meanings.
  • Complexity analysis: Big‑O notation, best/average/worst cases.
  • Common algorithmic techniques: iteration, recursion, searching (linear, binary), sorting (bubble, selection, insertion, merge, quick).

13.10 File organisation & access – Summary

  • Serial (sequential) – simple, good for batch processing.
  • Random (direct) – fixed‑length records, known offsets.
  • Hashing – fast key‑based lookup; choose appropriate hash function and collision‑resolution method.

13.11 Key exam tips (AO1–AO3)

  • Memorise the three file‑access methods and when each is appropriate.
  • Be able to write and trace the pseudocode for insertion/search in both open addressing and separate chaining.
  • Know the formulas for load factor and average‑case probe counts.
  • Practice converting keys using division, multiplication, and linear‑transformation methods.
  • When answering design questions, justify your choice of hash function and collision‑resolution technique based on load factor, memory limits, and expected operations (insert/delete/search).