Describe the format of binary floating-point real numbers

Published by Patrick Mutisya · 14 days ago

Cambridge A-Level Computer Science 9618 – Floating‑point Numbers

13.3 Floating‑point Numbers – Representation and Manipulation

Objective

Describe the format of binary floating‑point real numbers as used in the IEEE 754 standard.

Why Floating‑point?

Floating‑point representation allows a very wide range of real numbers to be stored using a fixed number of bits. It does this by “floating” the decimal (binary) point, similar to scientific notation in base‑10.

General Form

A binary floating‑point number is expressed as

\$\text{value}=(-1)^{\text{sign}}\times 1.\text{fraction}\times 2^{\text{exponent}-\text{bias}}\$

where:

  • sign – 0 for positive, 1 for negative.
  • fraction – the mantissa (also called significand) stored without the leading 1 for normalised numbers.
  • exponent – an unsigned integer that is biased.
  • bias – a constant that allows both positive and negative exponents to be represented.

IEEE 754 Single Precision (32‑bit)

Structure of a 32‑bit word:

FieldBitsPurposeBias
Sign1Indicates sign of the number
Exponent8Encodes exponent + bias127
Fraction (Mantissa)23Stores the fractional part; an implicit leading 1 is assumed for normalised values

Normalised Numbers

When the exponent field is neither all zeros nor all ones, the number is normalised:

\$\text{value}=(-1)^{s}\times 1.f\times 2^{e-127}\$

Here \$s\$ is the sign bit, \$f\$ is the 23‑bit fraction interpreted as a binary fraction, and \$e\$ is the unsigned exponent value.

Denormalised (Subnormal) Numbers

If the exponent field is all zeros, the leading digit is 0 instead of 1:

\$\text{value}=(-1)^{s}\times 0.f\times 2^{-126}\$

Denormalised numbers fill the gap between the smallest normalised value and zero, providing gradual underflow.

Special \cdot alues

  • Exponent = 255 (all ones) and fraction = 0\$+\infty\$ (if sign = 0) or \$-\infty\$ (if sign = 1).
  • Exponent = 255 and fraction ≠ 0 → Not‑a‑Number (NaN), used for undefined results such as \$0/0\$.
  • Exponent = 0 and fraction = 0 → Signed zero (\$+0\$ or \$-0\$).

IEEE 754 Double Precision (64‑bit)

Structure of a 64‑bit word:

FieldBitsBias
Sign1
Exponent111023
Fraction (Mantissa)52

Normalised double‑precision numbers follow

\$\text{value}=(-1)^{s}\times 1.f\times 2^{e-1023}\$

Converting a Decimal Number to IEEE 754 Single Precision

  1. Write the absolute value in binary.
  2. Normalise it so that there is exactly one non‑zero digit to the left of the binary point.
  3. Determine the exponent \$e = \text{(position of binary point after normalisation)} + 127\$.
  4. Take the 23 bits to the right of the binary point as the fraction (pad with zeros if needed).
  5. Set the sign bit (0 for positive, 1 for negative).
  6. Combine the three fields into a 32‑bit pattern.

Example: Convert \$-13.625\$ to single precision

Step‑by‑step:

  1. Binary of \$13.625\$ is \$1101.101\$.
  2. Normalise: \$1.101101 \times 2^{3}\$.
  3. Exponent field: \$3 + 127 = 130 = 10000010_2\$.
  4. Fraction field: bits after the leading 1 → \$10110100000000000000000\$ (23 bits).
  5. Sign bit: \$1\$ (negative).
  6. Resulting 32‑bit pattern:

    \$1\;10000010\;10110100000000000000000\$.

Common Pitfalls

  • For very large or very small numbers, the exponent may overflow (producing \$\infty\$) or underflow (producing a denormalised number or zero).
  • Rounding errors: the fraction field can hold only a limited number of bits, so many decimal fractions cannot be represented exactly.
  • Confusing the bias with the actual exponent; always subtract the bias when evaluating the value.

Suggested diagram: layout of the 32‑bit single‑precision word showing sign, exponent, and fraction fields, with example bit pattern for –13.625.

Summary

The IEEE 754 floating‑point format stores a real number in three parts: a sign bit, a biased exponent, and a fraction (mantissa). Normalised numbers use an implicit leading 1, while denormalised numbers allow representation of values closer to zero. Special exponent patterns encode infinities, NaN, and signed zero. Understanding this format is essential for analysing precision, range, and rounding behaviour in computer programs.