Show understanding that binary representations can give rise to rounding errors

Published by Patrick Mutisya · 14 days ago

Cambridge A-Level Computer Science 9618 – Floating‑point Numbers

13.3 Floating‑point Numbers: Representation and Manipulation

Learning Objective

Understand how binary floating‑point representations are constructed and why they can introduce rounding errors in calculations.

1. Why Floating‑point?

Integer representations can store only whole numbers. Real‑world data (e.g., measurements, scientific constants) often require fractional values, so computers use a floating‑point format that can represent a very wide range of magnitudes.

2. Binary Normalised Form

A binary floating‑point number is stored in the form

\$(-1)^{\text{sign}} \times 1.\text{fraction} \times 2^{\text{exponent}-\text{bias}}\$

Key points:

  • The leading digit before the binary point is always 1 (except for sub‑normal numbers); this is called the implicit leading 1.
  • The fraction (also called the mantissa or significand) stores the bits after the binary point.
  • The exponent is stored with a bias so that both positive and negative exponents can be represented using unsigned bit patterns.

3. IEEE 754 Standard (most common in A‑Level)

The two formats normally encountered are:

FieldSingle Precision (32‑bit)Double Precision (64‑bit)
Sign bit11
Exponent bits811
Fraction (mantissa) bits2352
Bias1271023
Approximate decimal range\$\pm 3.4\times10^{38}\$\$\pm 1.8\times10^{308}\$
Precision (significant decimal digits)≈7≈15–16

4. Example: Converting a Decimal to IEEE 754 Single Precision

Convert \$-13.625\$ to 32‑bit floating‑point.

  1. Write the absolute value in binary:

    • 13 = \$1101_2\$
    • 0.625 = \$0.101_2\$ (because \$0.5 + 0.125 = 0.625\$)
    • Thus \$13.625 = 1101.101_2\$

  2. Normalise: \$1101.1012 = 1.1011012 \times 2^{3}\$
  3. Sign bit = 1 (negative).
  4. Exponent = \$3 + 127 = 130 = 10000010_2\$.
  5. Fraction = bits after the leading 1: 10110100000000000000000 (padded to 23 bits).
  6. Combine: 1 10000010 10110100000000000000000.

5. Rounding Errors in Binary Representation

Many decimal fractions cannot be expressed exactly in binary. The most common example is \$0.1\$.

In binary:

\$0.1{10}=0.0001100110011\ldots2\$

The pattern \$0011\$ repeats indefinitely, so a finite number of bits must be truncated or rounded, introducing a small error.

6. Consequences of Rounding

  • Loss of precision: The stored value is only an approximation of the true decimal value.
  • Accumulation: Repeated arithmetic (e.g., summing many numbers) can magnify the tiny errors into a noticeable discrepancy.
  • Comparison pitfalls: Direct equality tests on floating‑point numbers may fail even when the numbers appear mathematically equal.

7. Illustrative Example – Adding 0.1 Ten Times

Consider the following pseudo‑code (Python‑style):

total = 0.0

for i in range(10):

total = total + 0.1

print(total)

Expected result: \$1.0\$.

Actual result (using IEEE 754 double precision): \$0.9999999999999999\$.

The tiny error from each \$0.1\$ addition accumulates, producing a result that is just shy of \$1\$.

8. Strategies to Mitigate Rounding Errors

  1. Use integer arithmetic where possible (e.g., store monetary values in cents).
  2. Apply rounding explicitly after each operation, using a defined precision (e.g., round(x, 2)).
  3. Prefer algorithms that minimise the number of operations on floating‑point values.
  4. When comparing, test whether the absolute difference is smaller than a tolerance \$\epsilon\$ rather than using ==.

9. Summary

  • Floating‑point numbers store a sign, a biased exponent, and a fraction.
  • Binary cannot represent many decimal fractions exactly, leading to rounding.
  • Rounding errors are small individually but can accumulate, affecting equality tests and numerical results.
  • Understanding the representation helps programmers write more reliable numerical code.

Suggested diagram: layout of the IEEE 754 single‑precision word showing sign, exponent, and fraction fields.