Show understanding of and be able to represent character data in its internal binary form, depending on the character set used

Published by Patrick Mutisya · 14 days ago

Cambridge A-Level Computer Science 9618 – 1.1 Data Representation

1.1 Data Representation – Character Data

Learning Objective

Show understanding of, and be able to represent, character data in its internal binary form, depending on the character set used.

Why Character Representation Matters

Computers store all information as binary digits (bits). To manipulate text, a mapping between characters and binary patterns is required. This mapping is defined by a character set (or code set) and a encoding scheme.

Key Terminology

  • Character set – The collection of symbols that can be represented (e.g., letters, digits, punctuation).
  • Code point – The numerical value assigned to a character in a character set.
  • Encoding – The method used to translate code points into binary patterns for storage or transmission.
  • ASCII – A 7‑bit character set originally designed for English.
  • Unicode – A universal character set that can represent over a million different symbols.
  • UTF‑8, UTF‑16, UTF‑32 – Common Unicode encodings with different byte‑length strategies.

1. ASCII – The Foundation

ASCII (American Standard Code for Information Interchange) defines 128 characters (0–127). Each character is represented by a 7‑bit binary number; in practice it is stored in an 8‑bit byte with the most‑significant bit set to 0.

DecimalHexBinary (7‑bit)Character
3220010 0000Space
4830011 00000
6541100 0001A
9761110 0001a
1267E111 1110~

Example: The word Hi! in ASCII

  • H → 72₁₀ = 01001000₂
  • i → 105₁₀ = 01101001₂
  • ! → 33₁₀ = 00100001₂

Stored as the byte sequence 01001000 01101001 00100001.

2. Extended ASCII (8‑bit)

Many systems needed more than 128 symbols (e.g., accented letters, graphical symbols). Various 8‑bit extensions added characters in the range 128–255. These extensions are not universal; they differ by region (e.g., ISO‑8859‑1 for Western Europe, Windows‑1252).

3. Unicode – A Global Standard

Unicode assigns a unique code point to every character in virtually all writing systems, emojis, and technical symbols. Code points are written as U+XXXX where XXXX is a hexadecimal number.

CharacterUnicode nameCode pointUTF‑8 (hex)UTF‑16 (hex)
ALATIN CAPITAL LETTER AU+0041410041
ΩGREEK CAPITAL LETTER OMEGAU+03A9CE A903A9
??

DE \cdot ANAGARI LETTER AU+0905E0 A4 850905
😀GRINNING FACEU+1F600F0 9F 98 80D83D DE00

4. Unicode Encoding Schemes

  1. UTF‑8 – Variable‑length (1 to 4 bytes). Compatible with ASCII for code points 0–127.
  2. UTF‑16 – Uses 2 bytes for most common characters; supplementary characters (U+10000 and above) use 4 bytes (surrogate pairs).
  3. UTF‑32 – Fixed 4‑byte representation for every code point (simple but space‑inefficient).

5. Converting a Character to Binary – Step‑by‑Step

Example: Encode the character é (Latin small letter e with acute) using UTF‑8.

  1. Find the Unicode code point: U+00E9 (hex) = 233₁₀.
  2. Since 233 is greater than 127, UTF‑8 uses a 2‑byte pattern: 110xxxxx 10xxxxxx.
  3. Write 233 in binary (8 bits): 11101001.
  4. Split the bits to fit the pattern:

    • First 5 bits → 11101 → placed in xxxxx.
    • Remaining 3 bits → 001 → placed in xxxxxx (padded with leading zeros to 6 bits → 000001).

  5. Insert into the pattern:

    • First byte: 11011101 = 0xDD.
    • Second byte: 10100001 = 0xA1.

  6. Resulting UTF‑8 byte sequence: DD A1 (hex) or 11011101 10100001 (binary).

6. Practical Tips for Exams

  • Memorise the ASCII range 0–127 and the binary patterns for the first 128 characters.
  • Know the UTF‑8 leading‑byte patterns:

    • 1‑byte: 0xxxxxxx
    • 2‑byte: 110xxxxx 10xxxxxx
    • 3‑byte: 1110xxxx 10xxxxxx 10xxxxxx
    • 4‑byte: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

  • When converting, always start from the Unicode code point, then apply the appropriate UTF‑8 pattern.
  • Remember that UTF‑16 uses big‑endian or little‑endian byte order; the code point itself is unchanged.
  • Practice converting between decimal, hexadecimal, and binary – the relationships \$2^{8}=256\$, \$2^{16}=65536\$ are useful for checking ranges.

7. Sample Exam Questions

  1. Give the 8‑bit binary representation of the ASCII character #.
  2. Encode the Unicode character Ω (U+03A9) in UTF‑8 and show the binary result.
  3. Explain why the string “Café” requires more than 7 bits per character when stored using Unicode.
  4. Convert the UTF‑16 big‑endian byte pair 0xD8 0x3D 0xDE 0x00 to the corresponding Unicode character (show the code point).

Suggested diagram: Flowchart showing the steps from character → Unicode code point → chosen encoding (ASCII/UTF‑8/UTF‑16) → binary representation.