Cambridge Notes, Past Papers, Revision Questions

13.3 Floating‑point Numbers – Representation and Manipulation

Learning outcomes (as stated in the Cambridge AS & A‑Level syllabus)

Describe the format of binary floating‑point real numbers (sign, exponent, mantissa, bias).

Convert between binary floating‑point patterns and decimal values (including normalised and sub‑normal forms).

Normalise a binary floating‑point number and explain the role of the implicit leading 1.

Explain the consequences of the fact that many real numbers can only be stored as approximations.

Identify and avoid common pitfalls such as loss of equality, accumulation of rounding error, overflow/underflow and catastrophic cancellation.

1. Why binary floating‑point is needed

Integer types store only whole numbers.

Scientific, engineering and graphics calculations often require:
- Fractional values (e.g. 0.001, π).
- A very large dynamic range (from ≈10⁻³⁸ to ≈10³⁸ in single precision).

Binary floating‑point provides a compact, hardware‑friendly way to represent a wide range of real numbers using a fixed number of bits.

2. IEEE 754 binary floating‑point format

The IEEE 754 standard defines a floating‑point word as three fields:

Field	Purpose	Typical size (single)
Sign (s)	0 = positive, 1 = negative	1 bit
Exponent (e)	Stores a biased exponent	8 bits
Fraction / mantissa (f)	Bits after the binary point (the leading 1 is implicit for normalised numbers)	23 bits

For a normalised number (ignoring the special cases NaN, ±∞, subnormals, signed zero) the value is

\$(-1)^{s}\times 1.f \times 2^{\,e-B}\$

where B is the bias:

\$B = 2^{k-1}-1\qquad(k = \text{number of exponent bits})\$

Thus:

single precision (k = 8) → B = 127

double precision (k = 11) → B = 1023

2.1 Normalised numbers

Form: 1.xxx…₂ × 2^e.

The leading 1 is not stored (implicit), giving one extra bit of precision.

Exponent field ≠ 0 and ≠ all 1’s.

2.2 Sub‑normal (denormal) numbers

Exponent field = 0, fraction ≠ 0.

Value: \$(-1)^{s}\times 0.f \times 2^{1-B}\$ (no implicit leading 1).

Used to fill the gap between the smallest normalised value and zero, preserving gradual underflow.

2.3 Special values

Signed zero: exponent = 0, fraction = 0, sign bit distinguishes +0 and –0.

Infinity: exponent = all 1’s, fraction = 0. Sign bit gives +∞ or –∞.

NaN (Not‑a‑Number): exponent = all 1’s, fraction ≠ 0. Used for undefined results (e.g. 0/0).

2.4 Bit‑layout diagrams


Single precision (32 bits)
| s |      e (8)      |            f (23)            |
|---|-----------------|------------------------------|
| 0 | 10000001 (129)  | 10100000000000000000000      |

Implicit leading 1 → mantissa = 1.101000…₂.


Double precision (64 bits)
| s |          e (11)          |               f (52)               |
|---|--------------------------|-------------------------------------|
| 0 | 10000000001 (1025)       | 1010000000000000000000000000000000000000000000000000 |

3. Single vs double precision – quick reference

Property	Single (float)	Double (double)
Total bits	32	64
Sign bits	1	1
Exponent bits	8	11
Fraction bits	23	52
Bias	127	1023
Approx. decimal precision	≈ 7 significant digits	≈ 15 significant digits
Largest finite value	≈ 3.4 × 10³⁸	≈ 1.8 × 10³⁰⁸
Smallest normalised positive	≈ 1.2 × 10⁻³⁸	≈ 2.2 × 10⁻³⁰⁸
Smallest positive (sub‑normal)	≈ 1.4 × 10⁻⁴⁵	≈ 5.0 × 10⁻³²⁴

4. Converting between binary floating‑point and decimal

4.1 Example – decode the 32‑bit pattern `0 10000001 10100000000000000000000`

Sign: s = 0 → positive.

Exponent field: e = 10000001₂ = 129.

Actual exponent: e − B = 129 − 127 = 2.

Fraction: f = 101000…₂ → mantissa = 1.101000…₂.

Convert mantissa: 1 + ½ + 0 + ⅛ = 1.625.

Apply exponent: 1.625 × 2² = 6.5.

Result: 6.5 (decimal).

4.2 Example – encode the decimal value 6.5

Binary form: 6.5₁₀ = 110.1₂.

Normalise: 1.101₂ × 2².

Sign = 0.

Exponent = 2 + bias = 2 + 127 = 129 → 10000001₂.

Fraction = bits after the leading 1 → 101000… (pad with zeros to 23 bits).

32‑bit word: 0 10000001 10100000000000000000000.

4.3 Converting a non‑terminating decimal fraction – 0.1

Repeatedly multiply the fractional part by 2:


0.1 × 2 = 0.2 → 0
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1  (remainder 0.6)
0.6 × 2 = 1.2 → 1  (remainder 0.2)
0.2 × 2 = 0.4 → 0
… pattern 0011 repeats forever …

Thus 0.1₁₀ = 0.0001100110011…₂ – an infinite binary fraction. It must be rounded to fit the mantissa.

4.4 Approximation of 0.1 in single precision

The nearest representable 32‑bit value is

\$0.100000001490116119384765625_{10}\$

Absolute error ≈ 1.49 × 10⁻⁸.

5. Consequences of approximation in computations

5.1 Loss of equality


float a = 0.1f + 0.2f;   // ≈ 0.30000001
float b = 0.3f;          // ≈ 0.29999998
if (a == b) …           // false – the binary values differ

Direct comparison with == is unsafe.

5.2 Accumulation of rounding error


double sum = 0;
for (int i = 0; i < 10; i++) sum += 0.1;
System.out.println(sum);   // 0.9999999999999999

Each addition introduces a tiny error; the total drifts from the mathematically exact result.

5.3 Overflow and underflow

Overflow: exponent exceeds the largest representable value → result becomes +Infinity or -Infinity.

Underflow: magnitude falls below the smallest normalised value → result becomes a sub‑normal number or zero.

5.4 Catastrophic cancellation (loss of significance)

When two nearly equal numbers are subtracted, the leading digits cancel and only the less‑accurate low‑order bits remain.

\$x = 1.234567\times10^{8},\; y = 1.234566\times10^{8}\$

Exact difference = 100, but if both values are stored with only 7 decimal digits (single precision) they may round to the same mantissa, yielding a computed difference of 0.

6. Mitigation strategies

Use a tolerance (epsilon) when testing equality: Math.abs(a‑b) < EPS.

Prefer double precision for calculations that require higher accuracy.

Re‑order arithmetic to avoid subtracting nearly equal values.

For exact decimal quantities (e.g., money) use integer arithmetic (cents) or arbitrary‑precision decimal classes such as BigDecimal.

Detect overflow/underflow with language‑specific checks (e.g., Math.isFinite() in Java).

When summing many numbers, apply compensated summation (Kahan algorithm) or pairwise summation to reduce error accumulation.

7. Connection to user‑defined data types (Syllabus 13.1)

Floating‑point values can be stored inside structures or classes, reinforcing the idea of a *user‑defined* type that groups related data.


// C‑style struct
struct Measurement {
int   id;      // identifier
float value;   // single‑precision reading
};
// Java class
class Measurement {
private final int id;
private final double value;   // double‑precision reading
public Measurement(int id, double value) {
this.id = id;
this.value = value;
}
}

8. Storing floating‑point numbers in files (Syllabus 13.2)

Two common approaches:

Text (human‑readable) format – numbers are written as decimal strings.
- Portable across platforms.
- May introduce additional rounding when the string is parsed.

Binary format – the raw IEEE 754 bits are written directly.
- Compact and loss‑less (provided the same format is used on read).
- Endianness matters: the byte order must be the same on writer and reader, or the program must convert explicitly.


// Java – write a double in binary (big‑endian)
try (DataOutputStream out = new DataOutputStream(
new FileOutputStream("data.bin"))) {
out.writeDouble(3.141592653589793);
}
// Java – read it back
try (DataInputStream in = new DataInputStream(
new FileInputStream("data.bin"))) {
double pi = in.readDouble();   // pi == 3.141592653589793
}

9. Binary‑coded decimal (BCD) – a contrast

BCD stores each decimal digit in a separate group of bits (usually 4). It can represent decimal fractions exactly (e.g., 0.1 = 0001 0000 0000₂) but uses many more bits and is slower for arithmetic. BCD is therefore used only where exact decimal representation is essential (e.g., financial calculators), not for general scientific computation.

10. Common pitfalls – summary table

Pitfall	Cause	Typical symptom	Remedy
Equality test fails	Exact binary representations differ	`if (a == b)` evaluates false even though mathematically a = b	Compare using a tolerance (epsilon)
Large error after many additions	Accumulated rounding error	Sum deviates from expected value	Accumulate in higher precision; use Kahan or pairwise summation
Result becomes Infinity or NaN	Exponent overflow, division by zero, or invalid operation	Program prints “Infinity”, “-Infinity” or “NaN”	Check operands before the operation; handle special values explicitly
Zero or wildly inaccurate result from subtraction of close numbers	Catastrophic cancellation	Loss of significant digits; sometimes exact zero	Re‑arrange the formula; use higher precision or analytical alternatives
Unexpected sign of zero	Signed zero produced by underflow or subtraction	Algorithms that test the sign of a result behave oddly	Use `Math.copySign` or treat +0 and –0 as equal where appropriate

11. Key take‑aways

Binary floating‑point stores an approximation; many decimal fractions (e.g., 0.1) cannot be represented exactly.

Rounding errors arise from the limited mantissa and can affect equality tests, accumulation of many operations, overflow/underflow, and subtraction of nearly equal numbers.

Understanding the IEEE 754 layout (sign, exponent, mantissa, bias) lets you predict when and why these errors occur.

Safe programming practices:
- Use an epsilon when testing equality.
- Prefer double precision or specialised libraries for high‑accuracy work.
- Re‑order calculations to minimise cancellation.
- Handle special values (NaN, ±∞, signed zero) explicitly.
- When persisting data, choose the appropriate file format and be aware of endianness.

For exact decimal arithmetic (e.g., monetary values) consider integer representations or arbitrary‑precision decimal types rather than binary floating‑point.

Show understanding of the consequences of a binary representation only being an approximation to the real number it represents (in certain cases)

13.3 Floating‑point Numbers – Representation and Manipulation