Cambridge Notes, Past Papers, Revision Questions

1.3 Compression

Learning objective

Show understanding of lossy and lossless compression and justify the choice of a method for a given situation.

Why compress data?

Storage limits – devices have finite capacity (e.g., a 64 GB SD card holds many more photos when they are compressed).

Bandwidth constraints – network links can transmit only a limited number of bits per second.

Cost of transmission – many services charge per megabyte of data sent.

Battery life – moving fewer bits consumes less power on mobile devices.

Quantitative example: a 100 MB uncompressed video (≈ 800 Mbit) streamed over a 2 Mbps link would need about 7 minutes. The same video encoded with H.264 at 5 Mbps needs only ≈ 2.7 minutes, saving both time and data.

What is compression?

Compression reduces the number of bits required to represent information by exploiting redundancy or the limits of human perception. The result is a smaller compressed representation that can be stored or transmitted more efficiently.

Types of compression

Lossless compression – the original data can be reconstructed exactly.

Lossy compression – some information is permanently discarded; the reconstructed data is an approximation.

Lossless compression

Used when exact fidelity is essential (e.g., source code, legal documents, medical images, archival audio).

Common techniques

Run‑Length Encoding (RLE) – replaces consecutive identical symbols with a count.
Example: AAAAABBBCCDAA → 5A3B2C1D2A.

Huffman coding – variable‑length prefix codes based on symbol frequencies.
Entropy formula: \(H = -\sum{i=1}^{n} pi \log2 pi\).
A Huffman tree gives the optimal average code length for the given probabilities.

Lempel‑Ziv‑Welch (LZW) – builds a dictionary of repeated substrings during encoding.

DEFLATE (ZIP/GZIP) – combines LZ77 sliding‑window compression with Huffman coding.

Media‑type mapping (lossless)

Technique	Typical media / use‑case
RLE	Simple bitmap images, fax transmission, monochrome icons
Huffman	PNG image format, DEFLATE streams, JPEG‑2000 lossless mode
LZW	GIF images, early Unix compress, PDF internal streams
DEFLATE (ZIP/GZIP)	Text files, source code archives, generic file bundles (ZIP, GZIP)

Text compression example

Most text files are archived with ZIP/DEFLATE. Repetitive words and spaces are replaced by dictionary references, achieving typical ratios of 2 : 1 to 3 : 1.

Vector‑graphic compression

Vector graphics (e.g., SVG) already describe images mathematically, so they are inherently lossless. Compression focuses on reducing file‑size overhead:

Remove unnecessary whitespace, comments, and metadata.

Shorten attribute names (e.g., stroke-width → sw) where possible.

Apply a generic lossless compressor such as DEFLATE (the .svgz format).

Concrete example:

Original example.svg (plain XML) = 45 KB.

Run svgo to strip whitespace and unused definitions → 38 KB.

Compress with DEFLATE → example.svgz = 12 KB (≈ 73 % reduction).

Lossy techniques are rarely used because the visual fidelity of a vector image is defined by its mathematical description, not by pixel data.

Lossy compression

Used when a perfect replica is not required and higher compression ratios are desirable (e.g., photographs, audio, video streaming).

Typical steps – JPEG (image)

Convert RGB to YCbCr (separates luminance from chrominance).

Divide the image into 8 × 8 blocks and apply the Discrete Cosine Transform (DCT).

Quantise the DCT coefficients using a quantisation matrix – high‑frequency coefficients are rounded to zero.

Encode the remaining coefficients with run‑length and Huffman coding.

Audio compression examples

MP3 – perceptual coding discards frequencies outside human hearing.

FLAC – lossless audio format; useful for archival copies where no quality loss is acceptable.

Video compression examples

H.264 / MPEG‑4 AVC – intra‑frame DCT, inter‑frame motion compensation, and entropy coding.

HEVC (H.265) – roughly double the compression efficiency of H.264.

Media‑type mapping (lossy)

Technique	Typical media / use‑case
JPEG	Photographs, web images
PNG (lossless mode)	Line art, screenshots, images requiring transparency
MP3 / AAC	Music streaming, podcasts
FLAC	Archival audio, high‑resolution music collections
H.264 / H.265	Online video, Blu‑ray, video conferencing

Comparison of lossless and lossy methods

Aspect	Lossless	Lossy
Data integrity	Exact reconstruction	Approximate reconstruction
Typical applications	Text, source code, archives, medical imaging, archival audio (FLAC)	Photographs, streaming audio/video, web images (JPEG, MP3, H.264)
Common algorithms	RLE, Huffman, LZW, DEFLATE, PNG, JPEG‑2000 (lossless mode)	JPEG, MP3, AAC, H.264, HEVC
Typical compression ratio	2 : 1 – 3 : 1 (up to ~5 : 1 for highly redundant data)	10 : 1 – 100 : 1 or higher
Impact on quality	No loss of quality	Quality degrades as compression increases; artefacts may become visible.

Decision‑matrix checklist (exam‑friendly)

When asked to justify a compression method, tick the criteria that apply and then choose the algorithm that best satisfies them.

Criterion	Lossless needed?	Lossy acceptable?
Exact data fidelity required (e.g., legal text, medical diagnosis)	✓	✗
Very limited storage or bandwidth	✗ (or moderate)	✓ (higher ratios)
Processing power limited (e.g., embedded device)	✓ (simple RLE, Huffman)	✗ (complex transforms may be too heavy)
Human perception can hide artefacts (photos, audio, video)	✗	✓

Case studies – justification in practice

Case study 1: Archiving legal documents

Data type: plain text.

Key criteria: exact fidelity, moderate storage saving, low processing overhead.

Chosen method: ZIP (DEFLATE) – lossless, 2 : 1 – 3 : 1 ratio, fast encode/decode.

Case study 2: Streaming a live sports event

Data type: high‑definition video.

Key criteria: very limited bandwidth, low latency, viewers accept minor artefacts.

Chosen method: H.264 (MPEG‑4 AVC) – lossy, 20 : 1 – 50 : 1 ratio, efficient hardware decoding.

Case study 3: Storing MRI scans for diagnosis

Data type: grayscale medical images.

Key criteria: diagnostic accuracy (pixel‑perfect), moderate storage reduction.

Chosen method: JPEG‑2000 in lossless mode – lossless, up to 3 : 1 ratio, supports very large images.

Key points to remember

Lossless = no data loss; essential for text, code, critical images, archival audio (FLAC).

Lossy = data loss; suitable for media where human perception can tolerate approximations.

Match the method to the purpose (fidelity vs. size), the environment (bandwidth, storage, processing), and the acceptable quality level.

Use the decision‑matrix checklist in exams to structure a clear justification.

Exam‑style prompt (for practice)

“Explain why lossless compression is required for archiving legal documents and suggest a suitable algorithm. Justify your choice with reference to data integrity, typical compression ratio, and processing requirements.”

Show understanding of lossy and lossless compression and justify the use of a method in a given situation

1.3 Compression