13.3 Floating‑point Numbers – Format & Bias
Objective
To describe the binary floating‑point format defined by the IEEE 754 standard, to convert between decimal and binary representations (both ways), and to understand the special values and rounding rules that affect programming calculations.
Why use floating‑point?
- Provides a very wide range of real numbers with a fixed number of bits.
- Works like scientific notation in base‑10: the “point” can “float” left or right.
- Essential for most programming tasks that involve measurements, graphics, simulations, etc.
General binary floating‑point format
A floating‑point number is stored as three fields:
$$\text{value}=(-1)^{\text{sign}}\times 1.\text{fraction}\times 2^{\text{biased exponent}-\text{bias}}$$
- sign – 0 for positive, 1 for negative.
- fraction – the mantissa (or significand) without the leading 1 for normalised numbers.
- biased exponent – an unsigned integer that stores the true exponent plus a bias.
- bias – $2^{k-1}-1$, where $k$ is the number of exponent bits.
Biases and exponent ranges (single‑ vs double‑precision)
| Precision |
Bits (sign / exponent / fraction) |
Bias |
Exponent range (unbiased) |
| Single (32‑bit) |
1 / 8 / 23 |
127 |
‑126 → +127 (normalised) |
| Double (64‑bit) |
1 / 11 / 52 |
1023 |
‑1022 → +1023 (normalised) |
Normalised numbers
When the exponent field is neither all 0s nor all 1s, the number is normalised:
$$\text{value}=(-1)^{s}\times 1.f\times 2^{e-\text{bias}}$$
- $s$ – sign bit.
- $f$ – binary fraction formed from the fraction field (e.g. $f=0.101\ldots$).
- $e$ – unsigned integer value of the exponent field (the biased exponent).
Denormalised (sub‑normal) numbers
If the exponent field is all 0s, the leading digit is 0 instead of 1:
$$\text{value}=(-1)^{s}\times 0.f\times 2^{1-\text{bias}}$$
- They fill the gap between the smallest normalised value and zero, allowing a smooth under‑flow.
- Because the implicit leading 1 is absent, they have fewer significant bits of precision.
Special patterns
| Exponent / Fraction pattern |
Meaning |
Practical significance |
Example (binary pattern) |
| 11111111… (all 1s) / 000…0 |
±∞ (infinity) |
Result of overflow or division by zero (non‑zero denominator). |
Sign = 0 → 0 11111111 00000000000000000000000 |
| 11111111… / ≠ 000…0 |
NaN (Not‑a‑Number) |
Undefined result such as 0/0 or √‑1; any comparison with NaN is false. |
0 11111111 10000000000000000000001 |
| 00000000 / 000…0 |
Signed zero (+0 or –0) |
Used to preserve sign information in limits and certain algorithms. |
1 00000000 00000000000000000000000 (‑0) |
| 00000000 / ≠ 000…0 |
Denormalised (sub‑normal) numbers |
Allow representation of values smaller than the smallest normalised number. |
0 00000000 00101000000000000000000 |
Rounding rule used by IEEE 754
All arithmetic operations use the default round‑to‑nearest, ties‑to‑even rule.
- The result is rounded to the nearest representable value.
- If the value lies exactly halfway between two representable numbers, the one whose least‑significant bit is even is chosen.
Illustration (0.1 in single precision)
- Exact binary of 0.1 = 0.0001100110011… (repeating).
- Normalise: $1.1001100110011…_2 \times 2^{-4}$.
- Biased exponent = $-4+127 = 123 = 01111011_2$.
- Take 23 fraction bits →
10011001100110011001101.
- The 24th bit is 1 (a tie). The 23rd bit (LSB of the stored fraction) is 0 (even), so we keep the fraction as‑is.
- Resulting 32‑bit pattern:
0 01111011 10011001100110011001101.
Conversion: decimal → IEEE 754 (single precision)
- Write the absolute value in binary (separate integer and fractional parts).
- Normalise so that exactly one non‑zero digit is left of the binary point.
- Count how many positions the point was moved → unbiased exponent.
- Add the bias (127 for single) → biased exponent and convert to binary.
- Take the first 23 bits after the leading 1 as the fraction field; pad with zeros if fewer.
- Apply round‑to‑nearest‑even if the 24th (or any later) bit is 1.
- Set the sign bit (0 = positive, 1 = negative).
- Combine sign, biased exponent and fraction to obtain the 32‑bit pattern.
Worked example – single precision: –13.625
- Binary of 13.625: $1101.101_2$.
- Normalise: $1.101101_2 \times 2^{3}$.
- Biased exponent: $3+127=130 = 10000010_2$.
- Fraction field: bits after the leading 1 →
10110100000000000000000 (23 bits).
- Sign bit:
1 (negative).
- 32‑bit pattern:
1 10000010 10110100000000000000000.
Worked example – double precision: 0.15625
- Binary of 0.15625 = $0.00101_2$.
- Normalise: $1.01_2 \times 2^{-3}$.
- Biased exponent: $-3+1023=1020 = 01111111100_2$ (11 bits).
- Fraction field: after the leading 1 →
0100000000000000000000000000000000000000000000000000 (52 bits).
- Sign bit:
0 (positive).
- 64‑bit pattern:
0 01111111100 0100000000000000000000000000000000000000000000000000.
Conversion: binary → decimal (interpretation)
Given a 32‑bit pattern, recover the decimal value.
- Separate the three fields (sign, exponent, fraction).
- Convert the exponent field to an unsigned integer $e$.
- If $e=0$ → denormalised number (use $2^{1-\text{bias}}$ and no implicit leading 1).
- If $e=255$ → special value (±∞ or NaN) – see the table above.
- Otherwise the number is normalised; compute the unbiased exponent $E=e-\text{bias}$.
- Form the mantissa $M = 1.\text{fraction bits}$ (binary).
- Value = $(-1)^{\text{sign}} \times M \times 2^{E}$.
Example – decode 0 10000001 01000000000000000000000
- Sign = 0 → positive.
- Exponent bits =
10000001 = 129.
- Unbiased exponent $E = 129-127 = 2$.
- Fraction bits =
01000000000000000000000 → $0.0100000_2 = 0.25$.
- Mantissa $M = 1.010_2 = 1.25$.
- Value = $+1.25 \times 2^{2} = 5.0$.
Overflow, under‑flow and their consequences
- Overflow: When the unbiased exponent would be larger than the maximum normalised exponent (+127 for single, +1023 for double). The result becomes ±∞.
- Under‑flow: When the unbiased exponent would be smaller than the minimum normalised exponent (‑126 for single, ‑1022 for double). The value is represented as a denormalised number; if it is still too small, it becomes signed zero.
- Both conditions are detected by the hardware and set the appropriate special pattern (see the table).
Relevance to programming (Cambridge AO1–AO3)
- AO1 – Knowledge: Identify the three fields, the bias, and the special patterns; read or write binary data correctly.
- AO2 – Application: Predict the effect of rounding, overflow or under‑flow on the result of an algorithm.
- AO3 – Analysis: Explain why a calculation yields +∞, NaN, or a denormalised value; compare precision of single vs double in a given context.
Exam‑style checklist
- Identify the bias for the given precision and subtract it when interpreting the exponent.
- State the exponent range (e.g., ‑126 → +127 for normalised single‑precision numbers).
- Convert a decimal number to IEEE 754 (show each step, include rounding decision).
- Decode a given binary pattern to decimal, noting whether it is normalised, denormalised, or a special value.
- Explain the practical meaning of ±∞, NaN, signed zero and sub‑normals.
- Predict the result of an operation that would cause overflow or under‑flow.
- Describe the round‑to‑nearest‑even rule and give a short numeric illustration.
Summary
The IEEE 754 floating‑point format stores a real number in three parts: a sign bit, a biased exponent, and a fraction (mantissa). Normalised numbers use an implicit leading 1, while denormalised numbers allow representation of values closer to zero. Special exponent patterns encode ±∞, NaN and signed zero. Understanding the bias, exponent range, rounding rule, and the conversion process is essential for analysing precision, range, and error‑propagation in A‑Level programming tasks.