Show understanding of hashing algorithms

Published by Patrick Mutisya · 8 days ago

Cambridge A-Level Computer Science 9618 – File Organisation and Access: Hashing Algorithms

13.2 File Organisation and Access – Hashing Algorithms

1. What is Hashing?

Hashing is a technique for locating a record directly by applying a hash function to a key value. The result of the function is an address (or bucket number) in a file, allowing constant‑time average access.

2. The Hash Function

A hash function \$h\$ maps a key \$k\$ from a large universe \$U\$ to an index in a smaller range \$[0, m-1]\$, where \$m\$ is the number of buckets (or slots) in the file.

Typical forms include:

  • Division method: \$h(k) = k \bmod m\$
  • Multiplication method: \$h(k) = \left\lfloor m \times \left( (k \times A) \bmod 1 \right) \right\rfloor\$ where \$0 < A < 1\$.
  • Linear transformation: \$h(k) = ((a k + b) \bmod p) \bmod m\$ with \$p\$ a prime larger than \$m\$, and \$a,b\$ constants.

3. Collision Resolution Strategies

Because \$m\$ is usually much smaller than \$|U|\$, different keys can produce the same hash value. Two main families of techniques handle these collisions.

3.1 Open Addressing

All records are stored directly in the hash table. When a collision occurs, the algorithm probes other slots according to a systematic sequence until an empty slot is found.

  • Linear probing: \$h_i(k) = (h(k) + i) \bmod m\$
  • Quadratic probing: \$hi(k) = (h(k) + c1 i + c_2 i^2) \bmod m\$
  • Double hashing: \$hi(k) = (h1(k) + i \, h_2(k)) \bmod m\$

3.2 Separate Chaining

Each bucket contains a linked list (or another secondary structure) of all records that hash to the same index.

4. Comparison of Collision Resolution Methods

MethodSpace OverheadAverage Search Cost (successful)Average Search Cost (unsuccessful)Typical Use Cases
Linear probingLow (single array)\$\frac{1}{2}\left(1+\frac{1}{1-\alpha}\right)\$\$\frac{1}{2}\left(1+\frac{1}{(1-\alpha)^2}\right)\$When memory is tight and load factor \$\alpha < 0.7\$
Quadratic probingLowSimilar to linear probing, but reduces primary clusteringSimilar to linear probingWhen clustering must be limited but array size is fixed
Double hashingLowClose to optimal for open addressingClose to optimal for open addressingHigh performance requirements with moderate load factor
Separate chainingHigher (pointers or secondary structures)\$1 + \frac{\alpha}{2}\$\$1 + \alpha\$When dynamic growth is needed or load factor may exceed 1

5. Example: Hashing Student Records

Suppose we store student records keyed by a 7‑digit student number. We choose \$m = 101\$ (a prime) and use the division method:

\$h(k) = k \bmod 101\$

For the key \$k = 1234567\$:

\$h(1234567) = 1234567 \bmod 101 = 1234567 - 101 \times \left\lfloor\frac{1234567}{101}\right\rfloor = 1234567 - 101 \times 12223 = 84\$

The record is placed in bucket 84. If another student has key \$k = 9876543\$ and also hashes to 84, a collision occurs and is resolved using the chosen strategy (e.g., separate chaining).

6. Performance Considerations

  1. Load factor \$\alpha\$: \$\alpha = \frac{n}{m}\$ where \$n\$ is the number of stored records. Keeping \$\alpha\$ low (typically \$<0.75\$) maintains near‑constant access time for open addressing.
  2. Uniform distribution: A good hash function spreads keys evenly across buckets, minimising clustering.
  3. Cost of rehashing: When \$\alpha\$ exceeds a threshold, the table is often rebuilt with a larger \$m\$ and a new hash function.

7. Advantages and Disadvantages of Hashing

  • Advantages

    • Average‑case \$O(1)\$ search, insert and delete.
    • Simple implementation for direct access.
    • Effective for large, static data sets.

  • Disadvantages

    • Requires a good hash function; poor choice leads to many collisions.
    • Not ordered – cannot perform range queries efficiently.
    • Open addressing suffers from clustering; separate chaining uses extra memory for pointers.

8. Sample Examination Question

Question: A file uses separate chaining with \$m = 13\$ buckets. The hash function is \$h(k) = k \bmod 13\$. Insert the keys 27, 40, 13, 55, 68, and 81. Show the resulting bucket contents and calculate the load factor.

Answer outline:

  1. Compute each hash value:

    • \$h(27)=1\$
    • \$h(40)=1\$
    • \$h(13)=0\$
    • \$h(55)=3\$
    • \$h(68)=3\$
    • \$h(81)=3\$

  2. Bucket contents:

    • 0: 13
    • 1: 27 → 40
    • 3: 55 → 68 → 81
    • All other buckets empty.

  3. Load factor \$\alpha = \frac{n}{m} = \frac{6}{13} \approx 0.46\$.

Suggested diagram: A hash table with 13 buckets illustrating the chaining for the keys above.