Show understanding of the consequences of a binary representation only being an approximation to the real number it represents (in certain cases)

13.3 Floating‑point Numbers – Representation and Manipulation

Learning outcomes (as stated in the Cambridge AS & A‑Level syllabus)

  • Describe the format of binary floating‑point real numbers (sign, exponent, mantissa, bias).
  • Convert between binary floating‑point patterns and decimal values (including normalised and sub‑normal forms).
  • Normalise a binary floating‑point number and explain the role of the implicit leading 1.
  • Explain the consequences of the fact that many real numbers can only be stored as approximations.
  • Identify and avoid common pitfalls such as loss of equality, accumulation of rounding error, overflow/underflow and catastrophic cancellation.

1. Why binary floating‑point is needed

  • Integer types store only whole numbers.
  • Scientific, engineering and graphics calculations often require:

    • Fractional values (e.g. 0.001, π).
    • A very large dynamic range (from ≈10⁻³⁸ to ≈10³⁸ in single precision).

  • Binary floating‑point provides a compact, hardware‑friendly way to represent a wide range of real numbers using a fixed number of bits.

2. IEEE 754 binary floating‑point format

The IEEE 754 standard defines a floating‑point word as three fields:

FieldPurposeTypical size (single)
Sign (s)0 = positive, 1 = negative1 bit
Exponent (e)Stores a biased exponent8 bits
Fraction / mantissa (f)Bits after the binary point (the leading 1 is implicit for normalised numbers)23 bits

For a normalised number (ignoring the special cases NaN, ±∞, subnormals, signed zero) the value is

\$(-1)^{s}\times 1.f \times 2^{\,e-B}\$

where B is the bias:

\$B = 2^{k-1}-1\qquad(k = \text{number of exponent bits})\$

Thus:

  • single precision (k = 8) → B = 127
  • double precision (k = 11) → B = 1023

2.1 Normalised numbers

  • Form: 1.xxx…₂ × 2^e.
  • The leading 1 is not stored (implicit), giving one extra bit of precision.
  • Exponent field ≠ 0 and ≠ all 1’s.

2.2 Sub‑normal (denormal) numbers

  • Exponent field = 0, fraction ≠ 0.
  • Value: \$(-1)^{s}\times 0.f \times 2^{1-B}\$ (no implicit leading 1).
  • Used to fill the gap between the smallest normalised value and zero, preserving gradual underflow.

2.3 Special values

  • Signed zero: exponent = 0, fraction = 0, sign bit distinguishes +0 and –0.
  • Infinity: exponent = all 1’s, fraction = 0. Sign bit gives +∞ or –∞.
  • NaN (Not‑a‑Number): exponent = all 1’s, fraction ≠ 0. Used for undefined results (e.g. 0/0).

2.4 Bit‑layout diagrams

Single precision (32 bits)

| s | e (8) | f (23) |

|---|-----------------|------------------------------|

| 0 | 10000001 (129) | 10100000000000000000000 |

Implicit leading 1 → mantissa = 1.101000…₂.

Double precision (64 bits)

| s | e (11) | f (52) |

|---|--------------------------|-------------------------------------|

| 0 | 10000000001 (1025) | 1010000000000000000000000000000000000000000000000000 |

3. Single vs double precision – quick reference

PropertySingle (float)Double (double)
Total bits3264
Sign bits11
Exponent bits811
Fraction bits2352
Bias1271023
Approx. decimal precision≈ 7 significant digits≈ 15 significant digits
Largest finite value≈ 3.4 × 10³⁸≈ 1.8 × 10³⁰⁸
Smallest normalised positive≈ 1.2 × 10⁻³⁸≈ 2.2 × 10⁻³⁰⁸
Smallest positive (sub‑normal)≈ 1.4 × 10⁻⁴⁵≈ 5.0 × 10⁻³²⁴

4. Converting between binary floating‑point and decimal

4.1 Example – decode the 32‑bit pattern 0 10000001 10100000000000000000000

  1. Sign: s = 0 → positive.
  2. Exponent field: e = 10000001₂ = 129.
  3. Actual exponent: e − B = 129 − 127 = 2.
  4. Fraction: f = 101000…₂ → mantissa = 1.101000…₂.
  5. Convert mantissa: 1 + ½ + 0 + ⅛ = 1.625.
  6. Apply exponent: 1.625 × 2² = 6.5.

Result: 6.5 (decimal).

4.2 Example – encode the decimal value 6.5

  1. Binary form: 6.5₁₀ = 110.1₂.
  2. Normalise: 1.101₂ × 2².
  3. Sign = 0.
  4. Exponent = 2 + bias = 2 + 127 = 129 → 10000001₂.
  5. Fraction = bits after the leading 1 → 101000… (pad with zeros to 23 bits).
  6. 32‑bit word: 0 10000001 10100000000000000000000.

4.3 Converting a non‑terminating decimal fraction – 0.1

Repeatedly multiply the fractional part by 2:

0.1 × 2 = 0.2 → 0

0.2 × 2 = 0.4 → 0

0.4 × 2 = 0.8 → 0

0.8 × 2 = 1.6 → 1 (remainder 0.6)

0.6 × 2 = 1.2 → 1 (remainder 0.2)

0.2 × 2 = 0.4 → 0

… pattern 0011 repeats forever …

Thus 0.1₁₀ = 0.0001100110011…₂ – an infinite binary fraction. It must be rounded to fit the mantissa.

4.4 Approximation of 0.1 in single precision

The nearest representable 32‑bit value is

\$0.100000001490116119384765625_{10}\$

Absolute error ≈ 1.49 × 10⁻⁸.

5. Consequences of approximation in computations

5.1 Loss of equality

float a = 0.1f + 0.2f; // ≈ 0.30000001

float b = 0.3f; // ≈ 0.29999998

if (a == b) … // false – the binary values differ

Direct comparison with == is unsafe.

5.2 Accumulation of rounding error

double sum = 0;

for (int i = 0; i < 10; i++) sum += 0.1;

System.out.println(sum); // 0.9999999999999999

Each addition introduces a tiny error; the total drifts from the mathematically exact result.

5.3 Overflow and underflow

  • Overflow: exponent exceeds the largest representable value → result becomes +Infinity or -Infinity.
  • Underflow: magnitude falls below the smallest normalised value → result becomes a sub‑normal number or zero.

5.4 Catastrophic cancellation (loss of significance)

When two nearly equal numbers are subtracted, the leading digits cancel and only the less‑accurate low‑order bits remain.

\$x = 1.234567\times10^{8},\; y = 1.234566\times10^{8}\$

Exact difference = 100, but if both values are stored with only 7 decimal digits (single precision) they may round to the same mantissa, yielding a computed difference of 0.

6. Mitigation strategies

  • Use a tolerance (epsilon) when testing equality: Math.abs(a‑b) < EPS.
  • Prefer double precision for calculations that require higher accuracy.
  • Re‑order arithmetic to avoid subtracting nearly equal values.
  • For exact decimal quantities (e.g., money) use integer arithmetic (cents) or arbitrary‑precision decimal classes such as BigDecimal.
  • Detect overflow/underflow with language‑specific checks (e.g., Math.isFinite() in Java).
  • When summing many numbers, apply compensated summation (Kahan algorithm) or pairwise summation to reduce error accumulation.

7. Connection to user‑defined data types (Syllabus 13.1)

Floating‑point values can be stored inside structures or classes, reinforcing the idea of a *user‑defined* type that groups related data.

// C‑style struct

struct Measurement {

int id; // identifier

float value; // single‑precision reading

};

// Java class

class Measurement {

private final int id;

private final double value; // double‑precision reading

public Measurement(int id, double value) {

this.id = id;

this.value = value;

}

}

8. Storing floating‑point numbers in files (Syllabus 13.2)

Two common approaches:

  1. Text (human‑readable) format – numbers are written as decimal strings.

    • Portable across platforms.
    • May introduce additional rounding when the string is parsed.

  2. Binary format – the raw IEEE 754 bits are written directly.

    • Compact and loss‑less (provided the same format is used on read).
    • Endianness matters: the byte order must be the same on writer and reader, or the program must convert explicitly.

// Java – write a double in binary (big‑endian)

try (DataOutputStream out = new DataOutputStream(

new FileOutputStream("data.bin"))) {

out.writeDouble(3.141592653589793);

}

// Java – read it back

try (DataInputStream in = new DataInputStream(

new FileInputStream("data.bin"))) {

double pi = in.readDouble(); // pi == 3.141592653589793

}

9. Binary‑coded decimal (BCD) – a contrast

BCD stores each decimal digit in a separate group of bits (usually 4). It can represent decimal fractions exactly (e.g., 0.1 = 0001 0000 0000₂) but uses many more bits and is slower for arithmetic. BCD is therefore used only where exact decimal representation is essential (e.g., financial calculators), not for general scientific computation.

10. Common pitfalls – summary table

PitfallCauseTypical symptomRemedy
Equality test failsExact binary representations differif (a == b) evaluates false even though mathematically a = bCompare using a tolerance (epsilon)
Large error after many additionsAccumulated rounding errorSum deviates from expected valueAccumulate in higher precision; use Kahan or pairwise summation
Result becomes Infinity or NaNExponent overflow, division by zero, or invalid operationProgram prints “Infinity”, “-Infinity” or “NaN”Check operands before the operation; handle special values explicitly
Zero or wildly inaccurate result from subtraction of close numbersCatastrophic cancellationLoss of significant digits; sometimes exact zeroRe‑arrange the formula; use higher precision or analytical alternatives
Unexpected sign of zeroSigned zero produced by underflow or subtractionAlgorithms that test the sign of a result behave oddlyUse Math.copySign or treat +0 and –0 as equal where appropriate

11. Key take‑aways

  1. Binary floating‑point stores an approximation; many decimal fractions (e.g., 0.1) cannot be represented exactly.
  2. Rounding errors arise from the limited mantissa and can affect equality tests, accumulation of many operations, overflow/underflow, and subtraction of nearly equal numbers.
  3. Understanding the IEEE 754 layout (sign, exponent, mantissa, bias) lets you predict when and why these errors occur.
  4. Safe programming practices:

    • Use an epsilon when testing equality.
    • Prefer double precision or specialised libraries for high‑accuracy work.
    • Re‑order calculations to minimise cancellation.
    • Handle special values (NaN, ±∞, signed zero) explicitly.
    • When persisting data, choose the appropriate file format and be aware of endianness.

  5. For exact decimal arithmetic (e.g., monetary values) consider integer representations or arbitrary‑precision decimal types rather than binary floating‑point.