Show understanding of the consequences of a binary representation only being an approximation to the real number it represents (in certain cases)

Published by Patrick Mutisya · 14 days ago

Cambridge A-Level Computer Science 9618 – Floating‑point Numbers

13.3 Floating‑point Numbers – Representation and Manipulation

Learning Objective

Show understanding of the consequences of a binary representation only being an approximation to the real number it represents (in certain cases).

1. Why Binary Floating‑point is Needed

Integer types can store only whole numbers. Most scientific and engineering calculations require numbers with fractional parts and a very large dynamic range. Binary floating‑point provides a compact way to represent a wide range of real numbers using a fixed number of bits.

2. IEEE 754 Standard Overview

The most widely used binary floating‑point format is defined by the IEEE 754 standard. It specifies three basic components:

  • Sign bit (s) – 0 for positive, 1 for negative.
  • Exponent field (e) – stores a biased exponent.
  • Fraction (mantissa) field (f) – stores the significant digits of the number.

The value represented (ignoring special cases such as NaN and infinities) is

\$(-1)^{s}\times 1.f \times 2^{\,e-B}\$

where B is the bias (e.g., 127 for single precision, 1023 for double precision).

3. Single vs Double Precision

PrecisionTotal bitsSign bitsExponent bitsFraction bitsBiasApprox. decimal precision
Single (float)321823127≈ 7 decimal digits
Double (double)64111521023≈ 15 decimal digits

4. Converting a Decimal Fraction to Binary

Conversion is performed by repeatedly multiplying the fractional part by 2 and recording the integer part of each product.

Example: Convert \$0.1_{10}\$ to binary.

  1. 0.1 × 2 = 0.2 → integer 0
  2. 0.2 × 2 = 0.4 → integer 0
  3. 0.4 × 2 = 0.8 → integer 0
  4. 0.8 × 2 = 1.6 → integer 1 (remainder 0.6)
  5. 0.6 × 2 = 1.2 → integer 1 (remainder 0.2)
  6. … and the pattern 0011 repeats forever.

Thus \$0.1{10}=0.0001100110011\ldots{2}\$ – a non‑terminating binary fraction.

5. Approximation Errors

Because the mantissa has a limited number of bits, the binary fraction must be rounded to fit.

For a 32‑bit single‑precision float, \$0.1\$ is stored as

\$0.100000001490116119384765625_{10}\$

which differs from the true value by about \$1.49\times10^{-8}\$.

6. Consequences in Computations

6.1 Loss of Equality

Two mathematically equal expressions may produce different binary results.

float a = 0.1f + 0.2f; // ≈ 0.30000001

float b = 0.3f; // ≈ 0.29999998

if (a == b) … // false

Therefore direct comparison of floating‑point numbers using == is unsafe.

6.2 Accumulation of Rounding Error

When many operations are performed, the small rounding errors can add up.

Example: summing \$0.1\$ ten times.

double sum = 0;

for (int i = 0; i < 10; i++) sum += 0.1;

System.out.println(sum); // prints 0.9999999999999999

The result is slightly less than the expected \$1.0\$.

6.3 Overflow and Underflow

  • Overflow: result magnitude exceeds the largest representable finite value → \$+\infty\$ or \$-\infty\$.
  • Underflow: result magnitude is smaller than the smallest normalised value → subnormal numbers or \$0\$.

Both cases can silently change program behaviour if not checked.

6.4 Loss of Significance (Catastrophic Cancellation)

When subtracting two nearly equal numbers, the leading digits cancel, leaving only the less accurate low‑order bits.

Example:

\$x = 1.234567 \times 10^{8},\quad y = 1.234566 \times 10^{8}\$

Exact difference is \$100\$, but using single precision the stored values may be rounded to the same mantissa, giving a difference of \$0\$.

7. Mitigation Strategies

  • Use a tolerance when comparing floating‑point values, e.g. Math.abs(a‑b) < EPS.
  • Prefer double precision for calculations that require higher accuracy.
  • Re‑order arithmetic to minimise subtraction of nearly equal numbers.
  • When exact decimal representation is required (e.g., monetary values), use integer arithmetic (cents) or arbitrary‑precision decimal classes.
  • Check for overflow/underflow using language‑specific functions (e.g., Math.isFinite()).

8. Summary Table – Common Pitfalls

PitfallCauseTypical SymptomRemedy
Equality test failsExact binary representation differsif (a == b) evaluates falseUse epsilon tolerance
Unexpected large error after many additionsAccumulated rounding errorSum deviates from mathematical expectationAccumulate in higher precision; use Kahan summation
Result becomes infinityOverflow of exponent fieldPrinted value is “Infinity”Check magnitude before operation; scale inputs
Zero result from subtraction of close numbersCatastrophic cancellationLoss of significant digitsRe‑arrange formula; use higher precision

Suggested diagram: layout of sign, exponent and mantissa bits for single‑ and double‑precision IEEE 754 numbers.

9. Key Take‑aways

  1. Binary floating‑point stores an approximation; many decimal fractions cannot be represented exactly.
  2. Rounding errors are inevitable and can affect equality tests, accumulation, and subtraction of similar values.
  3. Understanding the format (sign, exponent, mantissa) helps predict when overflow, underflow, or loss of significance may occur.
  4. Adopt safe programming practices: use tolerances, higher precision, and algorithmic techniques that minimise error propagation.