A normalised floating‑point value is stored as
\$(-1)^{\text{sign}} \times 1.\text{fraction} \times 2^{\text{exponent}-\text{bias}}\$
| Field | Single Precision (32‑bit) | Double Precision (64‑bit) |
|---|---|---|
| Sign bit | 1 | 1 |
| Exponent bits | 8 | 11 |
| Fraction bits | 23 | 52 |
| Bias | 127 | 1023 |
| Approximate decimal range | ± 3.4 × 10³⁸ | ± 1.8 × 10³⁰⁸ |
| Precision (significant decimal digits) | ≈ 7 | ≈ 15–16 |
For a normalised number the stored exponent is
\$\text{stored\exp}= \text{actual\exp} + \text{bias}\$
Examples:
Actual exponent = 3, stored exponent = \$3+127=130=10000010_2\$.
Actual exponent = –3, stored exponent = \$-3+1023=1020=01111111100_2\$.
\$(-1)^{\text{sign}} \times 0.\text{fraction} \times 2^{1-\text{bias}}\$
| Exponent pattern | Fraction pattern | Value |
|---|---|---|
| All 0s | All 0s | ±0 (signed zero) |
| All 0s | Non‑zero | Sub‑normal numbers (see above) |
| All 1s | All 0s | ±∞ (positive/negative infinity) |
| All 1s | Non‑zero | NaN (Not‑a‑Number – signalling or quiet) |
When a value cannot be represented exactly, IEEE 754 defines four rounding directions. The default in most languages is round‑to‑nearest, ties‑to‑even.
| Mode | Description |
|---|---|
| Round‑to‑nearest, ties‑to‑even | Choose the nearest representable value; if exactly halfway, pick the one with an even least‑significant bit. |
| Round‑toward‑zero | Truncate the extra bits (also called “chop”). |
| Round‑toward‑+∞ | Round upward (ceil) for positive numbers, truncate for negative. |
| Round‑toward‑‑∞ | Round downward (floor) for negative numbers, truncate for positive. |
Most processors expose round‑mode control bits that allow a programme to select any of the four modes; for A‑Level it is enough to know the default behaviour.
The following pseudocode works for both single‑ and double‑precision. Key variables are highlighted in comments.
function toIEEE754(value, precision):
# precision = "single" or "double"
if value == 0:
return signBit(value) + zeros(expBits) + zeros(fracBits)
sign = 0 if value >= 0 else 1
absVal = abs(value)
# 1. Convert integer part to binary
intPart = floor(absVal)
intBin = binary(intPart) # e.g. 13 → "1101"
# 2. Convert fractional part to binary
frac = absVal - intPart
fracBin = ""
maxFractionBits = 64 # enough for both precisions
while len(fracBin) < maxFractionBits:
frac *= 2
bit = floor(frac)
fracBin += str(bit)
frac -= bit
if frac == 0: break
# 3. Normalise
if intBin != "0": # normal case, exponent ≥ 0
shift = len(intBin) - 1
mantissa = intBin[1:] + fracBin
else: # value < 1, need leading zeros
firstOne = fracBin.find('1')
shift = -(firstOne + 1)
mantissa = fracBin[firstOne+1:]
# 4. Choose bias and field sizes
if precision == "single":
bias = 127; expBits = 8; fracBits = 23
else:
bias = 1023; expBits = 11; fracBits = 52
storedExp = shift + bias
expField = toBinary(storedExp, expBits)
# 5. Fraction field – truncate then round if necessary
fracField = mantissa[:fracBits] # truncate
nextBit = mantissa[fracBits] if len(mantissa) > fracBits else '0'
# roundingNeeded is true when the truncated part is >½ ULP
roundingNeeded = (nextBit == '1') and (any(c == '1' for c in mantissa[fracBits+1:]) or (fracField[-1] == '1'))
if roundingNeeded:
fracField = roundToNearestEven(fracField)
return sign + expField + fracField
10110100000000000000000 (23 bits).1 10000010 10110100000000000000000.0100000000000000000000000000000000000000000000000000 (52 bits).0 01111111100 0100000000000000000000000000000000000000000000000000.Convert \$3.4 \times 10^{38}\$ to single‑precision.
Many decimal fractions have infinite binary expansions. The classic example is 0.1:
\$0.1{10}=0.0001100110011\ldots2\$
Only a finite number of bits can be stored, so the value is rounded. In double‑precision the stored approximation is
\$0.1_{10} \approx 0.10000000000000000555\ldots\$
total = 0.0
for i in range(10):
total = total + 0.1
print(total) # Python output
Mathematical expectation: 1.0
Actual output (IEEE 754 double): 0.9999999999999999
Each addition introduces a tiny error; after ten iterations the error becomes visible.
total = 0.0
for i in range(10):
total += 0.1
total = round(total, 2) # round to two decimal places
print(total) # → 1.0
Rounding eliminates the accumulated error when the required precision is known (e.g., monetary values).
==) unless you know they were produced by the same operation.round(x, 2)).abs(a - b) < epsilon
isnan(), isinf()).Write a function sumRounded(values) that takes a list of floating‑point numbers, returns the sum rounded to two decimal places, and includes a comment explaining why the rounding step is required.
def sumRounded(values):
total = 0.0
for v in values:
total += v # accumulation may introduce tiny errors
return round(total, 2) # rounding removes the error for monetary output
Using the list [0.1] * 10, show the result of a naïve sum and then the corrected version that passes a tolerance test.
# Naïve sum – fails equality test
s = sum([0.1] * 10)
print(s == 1.0) # False
# Corrected version – tolerance based
epsilon = 1e-12
print(abs(s - 1.0) < epsilon) # True
Your generous donation helps us continue providing free Cambridge IGCSE & A-Level resources, past papers, syllabus notes, revision questions, and high-quality online tutoring to students across Kenya.