Show understanding of and be able to represent character data in its internal binary form, depending on the character set used

1.1 Data Representation – Character Data

Learning objective (Cambridge AS & A‑Level)

Show understanding of, and be able to represent, character data in its internal binary form, depending on the character set used.

Why character representation matters

  • All information in a computer is stored as bits (0 or 1).
  • To read, write or transmit text we need a systematic mapping between the characters we see and the binary patterns the machine stores.
  • This mapping is defined by a character set (the collection of symbols) and an encoding scheme (the rule that turns each symbol into a binary pattern).

Key terminology

TermDefinition
Character set (code set)The complete list of symbols that may be represented (letters, digits, punctuation, emojis, etc.).
Code pointThe numeric value assigned to a character within a character set.
EncodingThe method that translates a code point into a binary pattern for storage or transmission.
ASCII7‑bit character set for basic English text (128 code points, 0–127).
UnicodeUniversal character set covering virtually every written language, symbols and emojis.
UTF‑8, UTF‑16, UTF‑32Common Unicode encodings that differ in the number of bytes used per code point.

Binary prefixes vs. decimal prefixes (syllabus requirement)

Memory‑size specifications often use decimal prefixes (kilo, mega, giga) but the computer hardware works in binary multiples. The syllabus distinguishes them as follows:

PrefixSymbolDecimal valueBinary value (2ⁿ)
kilok10³ = 1 0002¹⁰ = 1 024 (Ki)
megaM10⁶ = 1 000 0002²⁰ = 1 048 576 (Mi)
gigaG10⁹ = 1 000 000 0002³⁰ = 1 073 741 824 (Gi)

Example: 2 MiB = 2 × 2²⁰ bytes = 2 097 152 bytes, whereas 2 MB = 2 × 10⁶ bytes = 2 000 000 bytes.

Number systems used in the syllabus

SystemBaseTypical useExample conversion (45)
Binary2Low‑level data, logic circuits45₁₀ = 0010 1101₂
Octal8Legacy Unix permissions, some low‑level debugging45₁₀ = 55₈
Decimal10Human‑readable numbers45₁₀ = 45₁₀
Hexadecimal16Memory addresses, colour codes, binary‑hex shortcuts45₁₀ = 2D₁₆
BCD (Binary‑Coded Decimal)Financial & embedded systems45₁₀ = 0100 0101₂
One’s complementHistorical signed‑number representation–5 → invert 0000 0101 → 1111 1010
Two’s complementStandard signed‑number representation–5 → invert 0000 0101 → 1111 1010 + 1 = 1111 1011

Binary arithmetic (addition, subtraction, overflow)

Addition example (4‑bit)

1011₂

+ 0110₂

--------

10001₂ (5‑bit result)

If the result must fit in 4 bits, the left‑most carry is discarded → 0001₂. The discarded carry sets the overflow flag.

Subtraction example (4‑bit, using two’s complement)

0110₂ (6)

– 0011₂ (3)

--------

0011₂ (3)

To subtract, invert the subtrahend and add 1 (two’s complement). If a borrow is needed the hardware sets the borrow flag.

Signed number representations – quick checklist

  • Is the question about representing a negative integer? → use one’s or two’s complement.
  • Does the exam ask for the binary pattern that a processor would store? → use two’s complement (the standard in modern CPUs).
  • For a quick sanity check:

    • Range of an n‑bit two’s‑complement number: –2ⁿ⁻¹ … 2ⁿ⁻¹ – 1.
    • Positive numbers have the same binary as unsigned; negative numbers start with a leading 1.

ASCII – the foundation (7‑bit)

ASCII defines 128 characters (code points 0–127). Each character is stored in an 8‑bit byte; the most‑significant bit is always 0.

DecimalHex7‑bit binaryCharacter
000000 0000NUL (null)
909001 0001HT (tab)
100A001 0100LF (line‑feed)
130D001 1010CR (carriage‑return)
4830011 00000
4931011 00011
5739011 10019
6541100 0001A
6642100 0010B
905A101 1010Z
9761110 0001a
9862110 0010b
1227A111 1010z
3220010 0000Space
3321010 0001!
1267E111 1110~

Example – the word “Hi!” in ASCII

  • H → 72₁₀ = 01001000₂
  • i → 105₁₀ = 01101001₂
  • ! → 33₁₀ = 00100001₂

Stored as the three‑byte sequence 01001000 01101001 00100001.

Extended ASCII (8‑bit)

Many languages need more than 128 symbols. Various 8‑bit extensions fill the range 128–255:

  • ISO‑8859‑1 (Latin‑1) – Western European languages.
  • Windows‑1252 – Similar to ISO‑8859‑1 but with extra printable characters.
  • ISO‑8859‑5 – Cyrillic script.

Because the same byte can represent different characters on different systems, these extensions are not universal – a key reason for the adoption of Unicode.

Unicode – a global standard

Unicode assigns a unique code point to every character, symbol, emoji and control code. Code points are written as U+XXXX (hexadecimal).

CharacterUnicode nameCode pointUTF‑8 (hex)UTF‑16 (hex)
ALATIN CAPITAL LETTER AU+0041410041
ΩGREEK CAPITAL LETTER OMEGAU+03A9CE A903A9
DEVANAGARI LETTER AU+0905E0 A4 850905
😀GRINNING FACEU+1F600F0 9F 98 80D83D DE00

Unicode encoding schemes

  1. UTF‑8 – Variable length, 1‑4 bytes.

    • 1 byte: 0xxxxxxx (identical to ASCII for 0–127).
    • 2 bytes: 110xxxxx 10xxxxxx
    • 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
    • 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

  2. UTF‑16 – Uses 2 bytes for the Basic Multilingual Plane (U+0000–U+FFFF). Code points above U+FFFF are encoded as a *surrogate pair* (4 bytes).
  3. UTF‑32 – Fixed 4‑byte representation for every code point (simple but memory‑inefficient).

Step‑by‑step conversion – character to binary

Example 1: ASCII character ‘#’ (U+0023)

  1. Code point: 35₁₀ = 0x23.
  2. ASCII is 7‑bit, so binary = 0100011. Stored in an 8‑bit byte as 0010 0011.

Example 2: Unicode character ‘é’ (U+00E9) in UTF‑8

  1. Code point: U+00E9 = 233₁₀ = 1110 1001₂.
  2. 233 > 127 → use the 2‑byte UTF‑8 pattern 110xxxxx 10xxxxxx.
  3. Split the 8 bits: first 5 bits 11101, remaining 3 bits 001 (padded to 6 bits → 000001).
  4. Insert:

    • First byte: 11011101 = 0xDD.
    • Second byte: 10100001 = 0xA1.

  5. Resulting UTF‑8 byte sequence: DD A1 (binary 11011101 10100001).

Conversion tip – binary ↔ hexadecimal ↔ decimal

  • Group binary digits in fours (starting from the right) → each group = one hex digit.
  • Convert each hex digit to its decimal value (0–15). Example: 1010 1101₂ → AD₁₆ → 173₁₀.
  • Useful powers:

    • 2⁸ = 256 (one byte)
    • 2¹⁶ = 65 536 (two bytes)
    • 2²⁴ = 16 777 216 (three bytes)

Practical tips for exams (Cambridge)

  • Memorise the ASCII range 32–126 (printable characters) and the binary patterns for the first 128 code points.
  • Know the leading‑byte patterns for UTF‑8 (1‑ to 4‑byte forms) – they are a frequent mark‑scheme point.
  • When converting to UTF‑8:

    1. Start from the Unicode code point (hex is easiest).
    2. Choose the correct byte‑length pattern (based on the value).
    3. Split the binary representation into the required groups and fill the pattern.

  • For UTF‑16 remember that the code point itself does not change; only the byte order (big‑endian vs. little‑endian) may differ.
  • Practice binary addition, subtraction and recognise overflow/borrow – the exam often asks you to state whether the flag would be set.
  • Be comfortable converting between binary, octal, decimal and hexadecimal; the “group‑by‑four” rule makes hex conversion rapid, and “group‑by‑three” for octal.
  • Use the quick checklist for one’s‑ and two’s‑complement questions to avoid sign errors.

Sample exam questions (aligned with syllabus)

  1. Give the 8‑bit binary representation of the ASCII character #.
  2. Encode the Unicode character Ω (U+03A9) in UTF‑8 and show the binary result.
  3. Explain why the string “Café” cannot be stored using only 7‑bit ASCII.
  4. Convert the UTF‑16 big‑endian byte pair 0xD8 0x3D 0xDE 0x00 to the corresponding Unicode character (show the code point).
  5. Perform the 8‑bit two’s‑complement addition 0101 0011₂ + 1010 1101₂ and indicate whether overflow occurs.
  6. Express the decimal number 45 in BCD and in plain binary.
  7. Convert the octal number 73₈ to binary and hexadecimal.
  8. Subtract 0011 0101₂ from 1010 1100₂** using two’s complement and state the borrow flag.

Suggested diagram: Flowchart –> Character → Unicode code point → chosen encoding (ASCII / UTF‑8 / UTF‑16 / UTF‑32) → binary representation (grouped by bytes).