The IEEE 754 standard defines a floating‑point word as three fields:
| Field | Purpose | Typical size (single) |
|---|---|---|
| Sign (s) | 0 = positive, 1 = negative | 1 bit |
| Exponent (e) | Stores a biased exponent | 8 bits |
| Fraction / mantissa (f) | Bits after the binary point (the leading 1 is implicit for normalised numbers) | 23 bits |
For a normalised number (ignoring the special cases NaN, ±∞, subnormals, signed zero) the value is
\$(-1)^{s}\times 1.f \times 2^{\,e-B}\$
where B is the bias:
\$B = 2^{k-1}-1\qquad(k = \text{number of exponent bits})\$
Thus:
1.xxx…₂ × 2^e.Single precision (32 bits)
| s | e (8) | f (23) |
|---|-----------------|------------------------------|
| 0 | 10000001 (129) | 10100000000000000000000 |
Implicit leading 1 → mantissa = 1.101000…₂.
Double precision (64 bits)
| s | e (11) | f (52) |
|---|--------------------------|-------------------------------------|
| 0 | 10000000001 (1025) | 1010000000000000000000000000000000000000000000000000 |
| Property | Single (float) | Double (double) |
|---|---|---|
| Total bits | 32 | 64 |
| Sign bits | 1 | 1 |
| Exponent bits | 8 | 11 |
| Fraction bits | 23 | 52 |
| Bias | 127 | 1023 |
| Approx. decimal precision | ≈ 7 significant digits | ≈ 15 significant digits |
| Largest finite value | ≈ 3.4 × 10³⁸ | ≈ 1.8 × 10³⁰⁸ |
| Smallest normalised positive | ≈ 1.2 × 10⁻³⁸ | ≈ 2.2 × 10⁻³⁰⁸ |
| Smallest positive (sub‑normal) | ≈ 1.4 × 10⁻⁴⁵ | ≈ 5.0 × 10⁻³²⁴ |
0 10000001 10100000000000000000000Result: 6.5 (decimal).
0 10000001 10100000000000000000000.Repeatedly multiply the fractional part by 2:
0.1 × 2 = 0.2 → 0
0.2 × 2 = 0.4 → 0
0.4 × 2 = 0.8 → 0
0.8 × 2 = 1.6 → 1 (remainder 0.6)
0.6 × 2 = 1.2 → 1 (remainder 0.2)
0.2 × 2 = 0.4 → 0
… pattern 0011 repeats forever …
Thus 0.1₁₀ = 0.0001100110011…₂ – an infinite binary fraction. It must be rounded to fit the mantissa.
The nearest representable 32‑bit value is
\$0.100000001490116119384765625_{10}\$
Absolute error ≈ 1.49 × 10⁻⁸.
float a = 0.1f + 0.2f; // ≈ 0.30000001
float b = 0.3f; // ≈ 0.29999998
if (a == b) … // false – the binary values differ
Direct comparison with == is unsafe.
double sum = 0;
for (int i = 0; i < 10; i++) sum += 0.1;
System.out.println(sum); // 0.9999999999999999
Each addition introduces a tiny error; the total drifts from the mathematically exact result.
+Infinity or -Infinity.When two nearly equal numbers are subtracted, the leading digits cancel and only the less‑accurate low‑order bits remain.
\$x = 1.234567\times10^{8},\; y = 1.234566\times10^{8}\$
Exact difference = 100, but if both values are stored with only 7 decimal digits (single precision) they may round to the same mantissa, yielding a computed difference of 0.
Math.abs(a‑b) < EPS.BigDecimal.Math.isFinite() in Java).Floating‑point values can be stored inside structures or classes, reinforcing the idea of a *user‑defined* type that groups related data.
// C‑style struct
struct Measurement {
int id; // identifier
float value; // single‑precision reading
};
// Java class
class Measurement {
private final int id;
private final double value; // double‑precision reading
public Measurement(int id, double value) {
this.id = id;
this.value = value;
}
}
Two common approaches:
// Java – write a double in binary (big‑endian)
try (DataOutputStream out = new DataOutputStream(
new FileOutputStream("data.bin"))) {
out.writeDouble(3.141592653589793);
}
// Java – read it back
try (DataInputStream in = new DataInputStream(
new FileInputStream("data.bin"))) {
double pi = in.readDouble(); // pi == 3.141592653589793
}
BCD stores each decimal digit in a separate group of bits (usually 4). It can represent decimal fractions exactly (e.g., 0.1 = 0001 0000 0000₂) but uses many more bits and is slower for arithmetic. BCD is therefore used only where exact decimal representation is essential (e.g., financial calculators), not for general scientific computation.
| Pitfall | Cause | Typical symptom | Remedy |
|---|---|---|---|
| Equality test fails | Exact binary representations differ | if (a == b) evaluates false even though mathematically a = b | Compare using a tolerance (epsilon) |
| Large error after many additions | Accumulated rounding error | Sum deviates from expected value | Accumulate in higher precision; use Kahan or pairwise summation |
| Result becomes Infinity or NaN | Exponent overflow, division by zero, or invalid operation | Program prints “Infinity”, “-Infinity” or “NaN” | Check operands before the operation; handle special values explicitly |
| Zero or wildly inaccurate result from subtraction of close numbers | Catastrophic cancellation | Loss of significant digits; sometimes exact zero | Re‑arrange the formula; use higher precision or analytical alternatives |
| Unexpected sign of zero | Signed zero produced by underflow or subtraction | Algorithms that test the sign of a result behave oddly | Use Math.copySign or treat +0 and –0 as equal where appropriate |
Your generous donation helps us continue providing free Cambridge IGCSE & A-Level resources, past papers, syllabus notes, revision questions, and high-quality online tutoring to students across Kenya.