Floating-Point Numbers
Floating-point numbers are a system for representing real numbers in which the position of the decimal point is encoded in the number itself.
IEEE 754
The IEEE Standard for Floating-Point Arithmetic (IEEE 754) describes a specific standard for representing numbers using a floating-point system in binary. The standard divides a sequence of bits into three parts, each of a fixed bit length: a sign bit, a mantissa and an exponent:
- binary32 (single precision, 32 bits) - the exponent is 8 bits, the mantissa is 23 bits and .
- binary64 (double precision, 64 bits) - the exponent is 11 bits, the mantissa is 52 bits and .
- binary128 (quadruple precision, 128 bits) - the exponent is 15 bits and the mantissa is 112 bits.
The value of a floating-point number is given by the following expression:
The sign bit determines whether the number is negative or non-negative. If , then the number is negative and if , then the number is non-negative. The rest is very similar to scientific notation.
The mantissa is treated as a number between and by assigning , , etc. to each of its bits.
The is added to the mantissa when performing calculations, yielding an effective range between and . This allows for one more extra bit of precision at virtually no cost.
The exponent is largely treated as a regular integer.
In practice, this only allows for positive exponents but we also want to represent negative exponents. Therefore, when performing calculations, half of the greatest integer (rounded down) which can be represented with the chosen bit-width for the exponent is subtracted from the it. Essentially, if , then we take to a negative power. If , then we take to a positive power. If , then we have .
The main advantages of this format is that it can represent both very large and very small numbers. The precision, however, drops as the number moves farther and farther away from zero.
Info: Reserved Values
A few special values are reserved for different purposes:
Sign Bit Exponent Mantissa Value or (Not a Number) The values are used to mean a very large number, not representable by the IEEE 574 floating-point format with the chosen bit width. The value indicates a result from an invalid operation such as division by zero.