Published by Patrick Mutisya · 14 days ago
Show understanding of the consequences of a binary representation only being an approximation to the real number it represents (in certain cases).
Integer types can store only whole numbers. Most scientific and engineering calculations require numbers with fractional parts and a very large dynamic range. Binary floating‑point provides a compact way to represent a wide range of real numbers using a fixed number of bits.
The most widely used binary floating‑point format is defined by the IEEE 754 standard. It specifies three basic components:
The value represented (ignoring special cases such as NaN and infinities) is
\$(-1)^{s}\times 1.f \times 2^{\,e-B}\$
where B is the bias (e.g., 127 for single precision, 1023 for double precision).
| Precision | Total bits | Sign bits | Exponent bits | Fraction bits | Bias | Approx. decimal precision |
|---|---|---|---|---|---|---|
| Single (float) | 32 | 1 | 8 | 23 | 127 | ≈ 7 decimal digits |
| Double (double) | 64 | 1 | 11 | 52 | 1023 | ≈ 15 decimal digits |
Conversion is performed by repeatedly multiplying the fractional part by 2 and recording the integer part of each product.
Example: Convert \$0.1_{10}\$ to binary.
Thus \$0.1{10}=0.0001100110011\ldots{2}\$ – a non‑terminating binary fraction.
Because the mantissa has a limited number of bits, the binary fraction must be rounded to fit.
For a 32‑bit single‑precision float, \$0.1\$ is stored as
\$0.100000001490116119384765625_{10}\$
which differs from the true value by about \$1.49\times10^{-8}\$.
Two mathematically equal expressions may produce different binary results.
float a = 0.1f + 0.2f; // ≈ 0.30000001
float b = 0.3f; // ≈ 0.29999998
if (a == b) … // false
Therefore direct comparison of floating‑point numbers using == is unsafe.
When many operations are performed, the small rounding errors can add up.
Example: summing \$0.1\$ ten times.
double sum = 0;
for (int i = 0; i < 10; i++) sum += 0.1;
System.out.println(sum); // prints 0.9999999999999999
The result is slightly less than the expected \$1.0\$.
Both cases can silently change program behaviour if not checked.
When subtracting two nearly equal numbers, the leading digits cancel, leaving only the less accurate low‑order bits.
Example:
\$x = 1.234567 \times 10^{8},\quad y = 1.234566 \times 10^{8}\$
Exact difference is \$100\$, but using single precision the stored values may be rounded to the same mantissa, giving a difference of \$0\$.
Math.abs(a‑b) < EPS.Math.isFinite()).| Pitfall | Cause | Typical Symptom | Remedy |
|---|---|---|---|
| Equality test fails | Exact binary representation differs | if (a == b) evaluates false | Use epsilon tolerance |
| Unexpected large error after many additions | Accumulated rounding error | Sum deviates from mathematical expectation | Accumulate in higher precision; use Kahan summation |
| Result becomes infinity | Overflow of exponent field | Printed value is “Infinity” | Check magnitude before operation; scale inputs |
| Zero result from subtraction of close numbers | Catastrophic cancellation | Loss of significant digits | Re‑arrange formula; use higher precision |