Cambridge Notes, Past Papers, Revision Questions

13.3 Floating‑point Numbers: Representation and Manipulation

Learning Objectives

AO1 – Explain how binary floating‑point numbers are stored, why many decimal fractions cannot be represented exactly, and the effect of rounding errors on program behaviour.

AO2 – Perform conversions between decimal numbers and IEEE 754 single‑ and double‑precision formats.

AO3 – Analyse the impact of rounding errors and apply strategies to mitigate them in programmes.

1. Why Floating‑point?

Integer types store only whole numbers.

Real‑world data (measurements, scientific constants, monetary values) often require fractions and a very wide range of magnitudes.

Floating‑point formats provide a normalised representation that can express both very large and very small numbers using a fixed number of bits.

2. Binary Normalised Form (32‑bit single, 64‑bit double)

A normalised floating‑point value is stored as

\$(-1)^{\text{sign}} \times 1.\text{fraction} \times 2^{\text{exponent}-\text{bias}}\$

Sign bit – 0 for positive, 1 for negative.

Implicit leading 1 – the bit left of the binary point is always 1 (except for sub‑normal numbers); therefore it is not stored.

Fraction (mantissa) – bits after the binary point.

Exponent – stored with a bias so that both positive and negative powers of two can be represented using unsigned bit patterns.

3. IEEE 754 Standard (most common in A‑Level)

Field	Single Precision (32‑bit)	Double Precision (64‑bit)
Sign bit	1	1
Exponent bits	8	11
Fraction bits	23	52
Bias	127	1023
Approximate decimal range	± 3.4 × 10³⁸	± 1.8 × 10³⁰⁸
Precision (significant decimal digits)	≈ 7	≈ 15–16

3.1 Bias and Exponent Calculation (AO2)

For a normalised number the stored exponent is

\$\text{stored\exp}= \text{actual\exp} + \text{bias}\$

Examples:

Single precision – 13.625 → normalised \$1.101101_2 \times 2^{3}\$
Actual exponent = 3, stored exponent = \$3+127=130=10000010_2\$.

Double precision – –0.15625 → normalised \$1.01_2 \times 2^{-3}\$
Actual exponent = –3, stored exponent = \$-3+1023=1020=01111111100_2\$.

4. Sub‑normal (Denormal) Numbers

When the exponent field is all zeros and the fraction is non‑zero, the leading implicit 1 is replaced by 0. The value is

\$(-1)^{\text{sign}} \times 0.\text{fraction} \times 2^{1-\text{bias}}\$

They fill the gap between the smallest normalised positive number and zero, preventing a sudden “underflow jump” to zero.

Used mainly in scientific calculations that require gradual loss of precision near zero.

5. Special Values (AO1)

Exponent pattern	Fraction pattern	Value
All 0s	All 0s	±0 (signed zero)
All 0s	Non‑zero	Sub‑normal numbers (see above)
All 1s	All 0s	±∞ (positive/negative infinity)
All 1s	Non‑zero	NaN (Not‑a‑Number – signalling or quiet)

6. Rounding Modes (AO1)

When a value cannot be represented exactly, IEEE 754 defines four rounding directions. The default in most languages is round‑to‑nearest, ties‑to‑even.

Mode	Description
Round‑to‑nearest, ties‑to‑even	Choose the nearest representable value; if exactly halfway, pick the one with an even least‑significant bit.
Round‑toward‑zero	Truncate the extra bits (also called “chop”).
Round‑toward‑+∞	Round upward (ceil) for positive numbers, truncate for negative.
Round‑toward‑‑∞	Round downward (floor) for negative numbers, truncate for positive.

Most processors expose round‑mode control bits that allow a programme to select any of the four modes; for A‑Level it is enough to know the default behaviour.

7. Converting a Decimal Number to IEEE 754 (AO2)

The following pseudocode works for both single‑ and double‑precision. Key variables are highlighted in comments.


function toIEEE754(value, precision):
# precision = "single" or "double"
if value == 0:
return signBit(value) + zeros(expBits) + zeros(fracBits)
sign = 0 if value >= 0 else 1
absVal = abs(value)
# 1. Convert integer part to binary
intPart = floor(absVal)
intBin = binary(intPart)               # e.g. 13 → "1101"
# 2. Convert fractional part to binary
frac = absVal - intPart
fracBin = ""
maxFractionBits = 64                   # enough for both precisions
while len(fracBin) < maxFractionBits:
frac *= 2
bit = floor(frac)
fracBin += str(bit)
frac -= bit
if frac == 0: break
# 3. Normalise
if intBin != "0":                      # normal case, exponent ≥ 0
shift = len(intBin) - 1
mantissa = intBin[1:] + fracBin
else:                                  # value < 1, need leading zeros
firstOne = fracBin.find('1')
shift = -(firstOne + 1)
mantissa = fracBin[firstOne+1:]
# 4. Choose bias and field sizes
if precision == "single":
bias = 127; expBits = 8; fracBits = 23
else:
bias = 1023; expBits = 11; fracBits = 52
storedExp = shift + bias
expField = toBinary(storedExp, expBits)
# 5. Fraction field – truncate then round if necessary
fracField = mantissa[:fracBits]        # truncate
nextBit = mantissa[fracBits] if len(mantissa) > fracBits else '0'
# roundingNeeded is true when the truncated part is >½ ULP
roundingNeeded = (nextBit == '1') and (any(c == '1' for c in mantissa[fracBits+1:]) or (fracField[-1] == '1'))
if roundingNeeded:
fracField = roundToNearestEven(fracField)
return sign + expField + fracField

8. Worked Examples

8.1 Single‑Precision Conversion (–13.625)

Binary form: \$13 = 11012\$, \$0.625 = 0.1012\$ → \$1101.101_2\$.

Normalise: \$1.101101_2 \times 2^{3}\$.

Sign = 1.

Exponent = \$3+127=130=10000010_2\$.

Fraction = bits after the leading 1 → 10110100000000000000000 (23 bits).

IEEE‑754 word: 1 10000010 10110100000000000000000.

8.2 Double‑Precision Conversion (+0.15625)

Binary form: \$0.00101_2\$.

Normalise: \$1.01_2 \times 2^{-3}\$.

Sign = 0.

Exponent = \$-3+1023=1020=01111111100_2\$.

Fraction = 0100000000000000000000000000000000000000000000000000 (52 bits).

IEEE‑754 word: 0 01111111100 0100000000000000000000000000000000000000000000000000.

8.3 Large Exponent Example (Overflow)

Convert \$3.4 \times 10^{38}\$ to single‑precision.

In binary this is approximately \$1.11111111111111111111111_2 \times 2^{127}\$.

Actual exponent = 127, stored exponent = \$127+127=254=11111110_2\$ (the maximum normal exponent).

Fraction is all 1s. The next exponent value (255) is reserved for infinities and NaN, so any larger magnitude would overflow to \$+\infty\$.

9. Rounding Errors in Binary Representation (AO1)

Many decimal fractions have infinite binary expansions. The classic example is 0.1:

\$0.1{10}=0.0001100110011\ldots2\$

Only a finite number of bits can be stored, so the value is rounded. In double‑precision the stored approximation is

\$0.1_{10} \approx 0.10000000000000000555\ldots\$

9.1 Adding 0.1 Ten Times – Failure Case


total = 0.0
for i in range(10):
total = total + 0.1
print(total)          # Python output

Mathematical expectation: 1.0

Actual output (IEEE 754 double): 0.9999999999999999

Each addition introduces a tiny error; after ten iterations the error becomes visible.

9.2 Corrected Version – Rounding After Accumulation


total = 0.0
for i in range(10):
total += 0.1
total = round(total, 2)      # round to two decimal places
print(total)                 # → 1.0

Rounding eliminates the accumulated error when the required precision is known (e.g., monetary values).

10. Common Pitfalls Checklist (AO3)

Never test two floating‑point numbers for exact equality (==) unless you know they were produced by the same operation.

Beware of hidden overflow/underflow when the exponent exceeds its representable range.

Remember that sub‑normal numbers have reduced precision – they are not a cure for “very small” values.

When printing results, format or round to the required number of decimal places; otherwise the display may reveal the rounding error.

If a programme changes the rounding mode (rare in A‑Level), the default “round‑to‑nearest, ties‑to‑even” no longer applies.

11. Consequences of Rounding (AO3)

Loss of precision – the stored value is only an approximation.

Accumulation – repeated operations can magnify the error.

Comparison pitfalls – direct equality may fail even when numbers are mathematically equal.

Algorithmic impact – numerical methods (e.g., solving equations, integration) may diverge or give inaccurate results if rounding is ignored.

Overflow / Underflow – exponent too large → ±∞; exponent too small → sub‑normal or zero.

12. Mitigation Strategies (AO3)

Use integer arithmetic where possible (e.g., store money in cents).

Round explicitly after each operation to the required number of decimal places (e.g., round(x, 2)).

Choose algorithms that minimise the number of floating‑point operations; for large sums consider Kahan compensated summation.

When testing equality, compare the absolute difference with a tolerance ε:
```
abs(a - b) < epsilon
```

Check for special values using language‑provided functions (isnan(), isinf()).

13. Assessment‑Focused Tasks (AO2 & AO3)

Task 1 – Summation with Rounding

Write a function sumRounded(values) that takes a list of floating‑point numbers, returns the sum rounded to two decimal places, and includes a comment explaining why the rounding step is required.


def sumRounded(values):
total = 0.0
for v in values:
total += v                # accumulation may introduce tiny errors
return round(total, 2)        # rounding removes the error for monetary output

Task 2 – Demonstrate the Error and Fix It

Using the list [0.1] * 10, show the result of a naïve sum and then the corrected version that passes a tolerance test.


# Naïve sum – fails equality test
s = sum([0.1] * 10)
print(s == 1.0)          # False
# Corrected version – tolerance based
epsilon = 1e-12
print(abs(s - 1.0) < epsilon)   # True

14. Summary Box (Quick Revision)

Floating‑point numbers consist of sign, biased exponent, and fraction fields.

Bias (127 for single, 1023 for double) lets both positive and negative powers of two be stored as unsigned bits.

Special patterns represent ±0, sub‑normals, ±∞ and NaN.

Most decimal fractions (e.g., 0.1) have infinite binary expansions → they are rounded to the nearest representable value.

Rounding errors are tiny individually but can accumulate, cause equality failures, and lead to overflow/underflow.

Mitigation: use integers when possible, round explicitly, employ tolerant comparisons, and choose numerically stable algorithms.

15. Cross‑Reference to Other Syllabus Units (AO3)

Unit 10 – Data Types & Structures: floating‑point numbers implement the real data type.

Unit 12 – Programming: loops, functions and built‑in rounding functions are used in the examples.

Unit 9 – Algorithm Design: error accumulation must be considered when designing numerical algorithms such as integration or simulation.

Unit 12.3 – Testing & Evaluation: tolerance‑based comparisons form part of systematic testing of programmes that use real numbers.

Show understanding that binary representations can give rise to rounding errors

13.3 Floating‑point Numbers: Representation and Manipulation