13.3 Floating‑point Numbers – Representation and Manipulation
Objective
Describe the format of binary floating‑point real numbers as used in the IEEE 754 standard.
Why Floating‑point?
Floating‑point representation allows a very wide range of real numbers to be stored using a fixed number of bits. It does this by “floating” the decimal (binary) point, similar to scientific notation in base‑10.
Converting a Decimal Number to IEEE 754 Single Precision
Write the absolute value in binary.
Normalise it so that there is exactly one non‑zero digit to the left of the binary point.
Determine the exponent \$e = \text{(position of binary point after normalisation)} + 127\$.
Take the 23 bits to the right of the binary point as the fraction (pad with zeros if needed).
Set the sign bit (0 for positive, 1 for negative).
Combine the three fields into a 32‑bit pattern.
Example: Convert \$-13.625\$ to single precision
Step‑by‑step:
Binary of \$13.625\$ is \$1101.101\$.
Normalise: \$1.101101 \times 2^{3}\$.
Exponent field: \$3 + 127 = 130 = 10000010_2\$.
Fraction field: bits after the leading 1 → \$10110100000000000000000\$ (23 bits).
Sign bit: \$1\$ (negative).
Resulting 32‑bit pattern:
\$1\;10000010\;10110100000000000000000\$.
Common Pitfalls
For very large or very small numbers, the exponent may overflow (producing \$\infty\$) or underflow (producing a denormalised number or zero).
Rounding errors: the fraction field can hold only a limited number of bits, so many decimal fractions cannot be represented exactly.
Confusing the bias with the actual exponent; always subtract the bias when evaluating the value.
Suggested diagram: layout of the 32‑bit single‑precision word showing sign, exponent, and fraction fields, with example bit pattern for –13.625.
Summary
The IEEE 754 floating‑point format stores a real number in three parts: a sign bit, a biased exponent, and a fraction (mantissa). Normalised numbers use an implicit leading 1, while denormalised numbers allow representation of values closer to zero. Special exponent patterns encode infinities, NaN, and signed zero. Understanding this format is essential for analysing precision, range, and rounding behaviour in computer programs.