Floating-Point Numbers

Floating-point numbers are a system for representing real numbers in which the position of the decimal point is encoded in the number itself.

IEEE 754

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) describes a specific standard for representing numbers using a floating-point system in binary. The standard divides a sequence of bits into three parts, each of a fixed bit length: a sign bit, a mantissa and an exponent:

binary32 (single precision, 32 bits) - the exponent is 8 bits, the mantissa is 23 bits and $K = 127$ .
binary64 (double precision, 64 bits) - the exponent is 11 bits, the mantissa is 52 bits and $K = 1023$ .
binary128 (quadruple precision, 128 bits) - the exponent is 15 bits and the mantissa is 112 bits.

The value of a floating-point number is given by the following expression:

(- 1)^{s} \times (1 + mantissa) \times 2^{exponent - K}

The sign bit $s$ determines whether the number is negative or non-negative. If $s = 1$ , then the number is negative and if $s = 0$ , then the number is non-negative. The rest is very similar to scientific notation.

The mantissa is treated as a number between $0$ and $1$ by assigning $\frac{1}{2}$ , $\frac{1}{4}$ , etc. to each of its bits.

The $1$ is added to the mantissa when performing calculations, yielding an effective range between $1$ and $2$ . This allows for one more extra bit of precision at virtually no cost.

The exponent is largely treated as a regular integer.

In practice, this only allows for positive exponents but we also want to represent negative exponents. Therefore, when performing calculations, half of the greatest integer (rounded down) which can be represented with the chosen bit-width for the exponent is subtracted from the it. Essentially, if $exponent < K$ , then we take $2$ to a negative power. If $exponent > K$ , then we take $2$ to a positive power. If $exponent = 0$ , then we have $2^{0}$ .

The main advantages of this format is that it can represent both very large and very small numbers. The precision, however, drops as the number moves farther and farther away from zero.

Info: Reserved Values

A few special values are reserved for different purposes:

Sign Bit Exponent Mantissa Value
$0$ $0 \dots 0$ $0 \dots 0$ $+ 0$
$1$ $0 \dots 0$ $0 \dots 0$ $- 0$
$0$ $1 \dots 1$ $0 \dots 0$ $+ \infty$
$1$ $1 \dots 1$ $0 \dots 0$ $- \infty$
$0$ or $1$ $1 \dots 1$ $> 0$ $NaN$ (Not a Number)

The values $\pm \infty$ are used to mean a very large number, not representable by the IEEE 574 floating-point format with the chosen bit width. The value $NaN$ indicates a result from an invalid operation such as division by zero.

Razon

Explorer

Floating-Point Numbers

Floating-Point Numbers

IEEE 754

Graph View

Table of Contents

Sign Bit	Exponent	Mantissa	Value
$0$	$0 \dots 0$	$0 \dots 0$	$+ 0$
$1$	$0 \dots 0$	$0 \dots 0$	$- 0$
$0$	$1 \dots 1$	$0 \dots 0$	$+ \infty$
$1$	$1 \dots 1$	$0 \dots 0$	$- \infty$
$0$ or $1$	$1 \dots 1$	$> 0$	$NaN$ (Not a Number)