13.3 Floating‑point Numbers – Representation and Manipulation
Learning Objective
Convert binary floating‑point real numbers into denary (decimal) and convert denary real numbers into binary floating‑point form.
Why Floating‑point?
Integer representations cannot store fractions or very large/small numbers efficiently. Floating‑point notation provides a way to represent a wide range of real numbers using a fixed number of bits.
IEEE 754 Single‑Precision Format (32 bits)
The most common format examined at A‑Level is the 32‑bit single‑precision representation, which consists of three fields:
Field
Bits
Purpose
Sign (s)
1
0 for positive, 1 for negative
Exponent (e)
8
Stores the exponent with a bias of 127
Mantissa (fraction) (f)
23
Stores the fractional part of the significand (the leading 1 is implicit)
The value represented is given by the formula
\$V = (-1)^{s}\times 1.f \times 2^{\,e-127}\$
where \$1.f\$ denotes the binary number formed by an implicit leading 1 followed by the bits of the mantissa.
Step‑by‑Step Conversion: Binary → Decimal
Identify the three fields (sign, exponent, mantissa) from the 32‑bit pattern.
Convert the exponent field from binary to decimal, then subtract the bias (127) to obtain the actual exponent \$E\$.
Form the binary significand by prefixing the mantissa with an implicit leading 1: \$1.f\$.
Convert \$1.f\$ to a decimal fraction.
Apply the formula \$V = (-1)^{s}\times (1.f) \times 2^{E}\$.
Determine the sign bit \$s\$ (0 for positive, 1 for negative).
Express the absolute value in binary scientific notation: \$N = 1.f \times 2^{E}\$ where \$1 \le 1.f < 2\$.
Calculate the biased exponent: \$e = E + 127\$ and write it as an 8‑bit binary number.
Take the fractional part \$f\$ (the bits after the binary point) and fill the mantissa field with the first 23 bits; pad with zeros if fewer than 23 bits.
Combine the three fields: s | e | f.
Example 2 – Decimal to Binary
Convert \$-0.15625\$ to IEEE 754 single‑precision.
Sign bit \$s\$: 1 (negative).
Convert \$0.15625\$ to binary:
0.15625 × 2 = 0.3125 → 0
0.3125 × 2 = 0.625 → 0
0.625 × 2 = 1.25 → 1 (subtract 1 → 0.25)
0.25 × 2 = 0.5 → 0
0.5 × 2 = 1.0 → 1 (terminates)
Result: \$0.00101_{2}\$.
Normalise: \$0.00101{2}=1.01{2}\times 2^{-3}\$, so \$E = -3\$.