Cambridge A-Level Computer Science 9618 – 1.1 Data Representation
1.1 Data Representation – Character Data
Learning Objective
Show understanding of, and be able to represent, character data in its internal binary form, depending on the character set used.
Why Character Representation Matters
Computers store all information as binary digits (bits). To manipulate text, a mapping between characters and binary patterns is required. This mapping is defined by a character set (or code set) and a encoding scheme.
Key Terminology
Character set – The collection of symbols that can be represented (e.g., letters, digits, punctuation).
Code point – The numerical value assigned to a character in a character set.
Encoding – The method used to translate code points into binary patterns for storage or transmission.
ASCII – A 7‑bit character set originally designed for English.
Unicode – A universal character set that can represent over a million different symbols.
UTF‑8, UTF‑16, UTF‑32 – Common Unicode encodings with different byte‑length strategies.
1. ASCII – The Foundation
ASCII (American Standard Code for Information Interchange) defines 128 characters (0–127). Each character is represented by a 7‑bit binary number; in practice it is stored in an 8‑bit byte with the most‑significant bit set to 0.
Decimal
Hex
Binary (7‑bit)
Character
32
20
010 0000
Space
48
30
011 0000
0
65
41
100 0001
A
97
61
110 0001
a
126
7E
111 1110
~
Example: The word Hi! in ASCII
H → 72₁₀ = 01001000₂
i → 105₁₀ = 01101001₂
! → 33₁₀ = 00100001₂
Stored as the byte sequence 01001000 01101001 00100001.
2. Extended ASCII (8‑bit)
Many systems needed more than 128 symbols (e.g., accented letters, graphical symbols). Various 8‑bit extensions added characters in the range 128–255. These extensions are not universal; they differ by region (e.g., ISO‑8859‑1 for Western Europe, Windows‑1252).
3. Unicode – A Global Standard
Unicode assigns a unique code point to every character in virtually all writing systems, emojis, and technical symbols. Code points are written as U+XXXX where XXXX is a hexadecimal number.
Character
Unicode name
Code point
UTF‑8 (hex)
UTF‑16 (hex)
A
LATIN CAPITAL LETTER A
U+0041
41
0041
Ω
GREEK CAPITAL LETTER OMEGA
U+03A9
CE A9
03A9
??
DE \cdot ANAGARI LETTER A
U+0905
E0 A4 85
0905
😀
GRINNING FACE
U+1F600
F0 9F 98 80
D83D DE00
4. Unicode Encoding Schemes
UTF‑8 – Variable‑length (1 to 4 bytes). Compatible with ASCII for code points 0–127.
UTF‑16 – Uses 2 bytes for most common characters; supplementary characters (U+10000 and above) use 4 bytes (surrogate pairs).
UTF‑32 – Fixed 4‑byte representation for every code point (simple but space‑inefficient).
5. Converting a Character to Binary – Step‑by‑Step
Example: Encode the character é (Latin small letter e with acute) using UTF‑8.
Find the Unicode code point: U+00E9 (hex) = 233₁₀.
Since 233 is greater than 127, UTF‑8 uses a 2‑byte pattern: 110xxxxx 10xxxxxx.
Write 233 in binary (8 bits): 11101001.
Split the bits to fit the pattern:
First 5 bits → 11101 → placed in xxxxx.
Remaining 3 bits → 001 → placed in xxxxxx (padded with leading zeros to 6 bits → 000001).