Floating Point Numbers

“Floats” or “Floating Point Numbers”, are approximations of decimal numbers, using binary notation.

In mathematics, we commonly write long or short decimal numbers in the form:

\[\pm x {\cdot }10^{\pm y}\]

But in all computers, this is automatically converted into in binary. Specifically, it is stored in the form:

\[\pm x {\cdot } 2^{\pm y}\]

Where:

𝒙 : The Significant (a.k.a. the “mantissa” pre-1950’s) is limited to 52 bits.

𝒚 : The Exponent is limited to 11 bits.

The bit representation is stored in the the following format called “IEEE_745” a.k.a. “float64”, “Binary64” or, “double-precision” (as it is double the length of the 32 bit representations used on older computers).

Float Representation in memory in "float64"

Note there is a further complication:

The Exponent is encoded using the “offset-binary” scheme. In “offset-binary”, the sign is implicit, and does not require an extra bit to be expressed.
The Significant is encoded literally, hence the need for the sign at the beginning of the expression.

“float64” (provides between 15 and 17 significant decimal digits of precision and is the standard used in python.

“float32” provides from 6.92 (for values 1 to 2) up to 7.22 decimal digits of precision (for 2²⁴, \(\log_{10}\left(2^{24} \right ) = 7.22\). Above which there is a fixed interval error of 1)

The limitations of binary float representation are :

When checking equality between seemingly equivalent values and finding False
When more that 15 significant decimal digits are needed.

Resources

Tom Scott