Tutorials » IEEE 754 Floating Point Arithmetic: Algorithms and Examples

IEEE 754 Floating Point Arithmetic: Algorithms and Examples

floating point ieee 754 arithmetic algorithm binary conversion numerical computation

IEEE 754 is a standard for floating-point arithmetic used in computers. It defines how to represent floating-point numbers and perform operations like multiplication, addition, and division with them. The standard ensures consistency across different computing systems. This tutorial covers algorithms and examples for each of the operations and describes decimal to floating-point conversion and vice versa.

IEEE 754 Standard Floating Point Numbers

This tutorial attempts to provide a brief overview of the IEEE Floating-point Numbers format with the help of simple examples, without going too much into mathematical detail and notations. By the end of this tutorial, you should be able to understand what floating-point numbers are and their basic arithmetic operations such as addition, multiplication & division.

An IEEE 754 standard floating-point binary word consists of a sign bit, exponent, and a mantissa as shown in the figure below. IEEE 754 single precision floating point number consists of 32 bits of which:

1 bit = sign bit(s)
8 = Biased exponent bits (e)
23 = mantissa (m)

floating point fig1

The decimal value of a normalized floating-point number in IEEE 754 standard is represented as:

floating point fig2

which can also be expressed as:

floating point fig3

Note: “1” is hidden in the representation of IEEE 754 floating-point word, since it takes up an extra bit location and it can be avoided. It is understood that we need to append the “1” to the mantissa of a floating-point word for conversions and calculations.

For example, in Figure 1, the mantissa represented is 0101_0000_0000_0000_0000_000. In actual fact it is (1.mantissa) = 1.0101_0000_0000_0000_0000_000.

To make equation 1 clearer, let’s consider the example in Figure 1. Let’s try and represent the floating-point binary word in the form of equation and convert it to an equivalent decimal value.

Floating point binary word X1 =

floating point fig4

Sign bit (S1) = 0
Biased Exponent (E1) = 1000_0001 (2) = 129(10)
Mantissa (M1) = 0101_0000_0000_0000_0000_000

This can be expressed as:

floating point fig5

IEEE 754 Standard Floating Point Conversions

Let’s look into an example for decimal to IEEE 754 floating-point number and IEEE 754 floating-point number to decimal conversion. This will make the concept and notations of floating-point numbers much clearer.

Decimal to IEEE 754 Standard Floating Point

Let’s take a decimal number, say 286.75, and represent it in IEEE floating-point format (Single precision, 32 bit). We need to find the Sign, exponent, and mantissa bits.

Represent the Decimal number 286.75 (10) into Binary format:

$286.75_{(10)} = 100011110.11_{(2)}$
The binary number is not normalized. Normalize the binary number by shifting the decimal point such that we get a 1 at the very end (i.e., 1.m form).

We had to shift the binary point left 8 times to normalize it; the exponent value (8) should be added with bias. We now have the value of mantissa.

Note: In Floating-point numbers, the mantissa is treated as a fractional fixed-point binary number. Normalization is the process in which mantissa bits are either shifted right or to the left (add or subtract the exponent accordingly) such that the most significant bit is “1”.
$Bias = 2^{(e-1)} - 1$ , In our case e=8 (IEEE 754 format single precision).

$Bias = 2^{(8-1)} - 1 = 127$ (This is the bias value for single precision IEEE floating point format).
The biased exponent e is represented as:

$E = exponent\ value\ obtained\ after\ normalization\ in\ step\ 2 + bias$

$E = 8 + 127 = 135_{(10)}$ . Convert this to binary and we have our exponent value:

$E = 10000111_{(2)}$
We now have our floating-point number equivalent to 286.75:

Now with the above example of decimal to floating-point conversion, it should be clear as to what is mantissa, exponent & the bias.

IEEE 754 Standard Floating Point Standard to Decimal Point Conversion

Let’s reverse the above process and convert back the floating-point word obtained above to decimal. We have already done this in section 1 but for a different value.

floating point fig8

IEEE 754 Standard Floating Point Arithmetic

Let us look at Multiplication, Addition, subtraction & inversion algorithms performed on IEEE 754 floating-point standard. Let us consider the IEEE 754 floating-point format numbers X1 & X2 for our calculations.

floating point fig9

IEEE 754 Standard Floating Point Multiplication Algorithm

A brief overview of the floating-point multiplication algorithm is explained below:

$X1$ and $X2$ . Result $X3 = X1 * X2 = (-1)^{s1} (M1 \times 2^{E1}) * (-1)^{s2} (M2 \times 2^{E2})$

Where:

$S1, S2$ => Sign bits of number $X1$ & $X2$ .
$E1, E2$ : =>Exponent bits of number $X1$ & $X2$ .
$M1, M2$ =>Mantissa bits of Number $X1$ & $X2$ .

Steps:

Check if one/both operands = 0 or infinity. Set the result to 0 or inf. i.e. exponents = all “0” or all “1”.
$S1$ , the signed bit of the multiplicand is XOR’d with the multiplier signed bit of $S2$ . The result is put into the resultant sign bit.
The mantissa of the Multiplier ( $M1$ ) and multiplicand ( $M2$ ) are multiplied and the result is placed in the resultant field of the mantissa (truncate/round the result for 24 bits). $= M1 \* M2$
The exponents of the Multiplier ( $E1$ ) and the multiplicand ( $E2$ ) bits are added and the base value is subtracted from the added result. The subtracted result is put in the exponential field of the result block. $= E1 + E2 - bias$
Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent.
Check for underflow/overflow. If Overflow set the output to infinity & for underflow set to zero.
If ( $E1 + E2 - bias$ ) >= to $E_{max}$ then set the product to infinity.
If $E1 + E2 - bias$ is lesser than/equal to $E_{min}$ then set product to zero.

Floating Point Multiplication Example

Floating Point Multiplication is simpler when compared to floating point addition. Let’s try to understand the Multiplication algorithm with the help of an example. Let’s consider two decimal numbers:

$X1 = 125.125_{(10)}$ $X2 = 12.0625_{(10)}$ $X3= X1 * X2 = 1509.3203125$

Equivalent floating-point binary words are:

$X1 =$

floating point fig10

Steps:

Find the sign bit by xor-ing sign bit of A and B i.e. Sign bit => (0 xor 0) => 0
Multiply the mantissa values including the “hidden one”. The Resultant product of the 24 bits mantissas ( $M1$ and $M2$ ) is 48 bits (2 bits are to the left of the binary point).

If $M3(48) = "1"$ then left shift the binary point and add “1” to the exponent else don’t add anything. This normalizes the mantissa. Truncate the result to 24 bits. Add the exponent “1” to the final exponent value.
Find exponent of the result:

$= E1 + E2 - bias + (normalized\ exponent\ from\ step\ 2)$

$= (10000101)\_2 + (10000010)\_2 - bias + 1$

$= 133 + 130 - 127 + 1 = 137$ . Add the exponent value after normalization to the biased exponent obtained in step 2.

i.e., $136 + 1 = 137$ => exponent value.

Note: The normalization of the product is simpler as the range of $M_A$ and $M_B$ is between 1 - 1.9999999 and the range of the product is between (1 - 3.9999999). Therefore, a 1-bit shift is required with the adjust of exponent. So we have found mantissa, sign, and exponent bits.
We have our final result, i.e.,

If we convert this to decimal we get:

$X = 1509.3203125$

Floating Point Multiplication is simpler when compared to floating point addition we will discuss the basic floating point multiplication algorithm.

IEEE 754 Standard Floating Point Addition Algorithm

Floating-point addition is more complex than multiplication. A brief overview of the floating-point addition algorithm is explained below:

$X3 = X1 + X2$ $X3 = (M1 \times 2^{E1}) +/- (M2 \times 2^{E2})$

Steps:

$X1$ and $X2$ can only be added if the exponents are the same i.e $E1 = E2$ .
We assume that $X1$ has the larger absolute value of the 2 numbers. Absolute value of $X1$ should be greater than absolute value of $X2$ , else swap the values such that $Abs(X1) > Abs(X2)$ .
The initial value of the exponent should be the larger of the 2 numbers, since we know exponent of $X1$ will be bigger, hence Initial exponent result $E3 = E1$ .
Calculate the exponent’s difference i.e. $Exp\_diff = (E1 - E2)$ .
Left shift the decimal point of mantissa ( $M2$ ) by the exponent difference. Now the exponents of both $X1$ and $X2$ are the same.
Compute the sum/difference of the mantissas depending on the sign bit $S1$ $S 1$ and $S2$ $S 2$ .
- If signs of $X1$ and $X2$ are equal ( $S1 == S2$ ) then add the mantissas
- If signs of $X1$ and $X2$ are not equal ( $S1 != S2$ ) then subtract the mantissas
Normalize the resultant mantissa ( $M3$ ) if needed (1.m3 format) and the initial exponent result $E3 = E1$ needs to be adjusted according to the normalization of mantissa.
If any of the operands is infinity or if ( $E3 > E_{max}$ ), overflow has occurred, the output should be set to infinity. If ( $E3 < E_{min}$ ) then it’s an underflow and the output should be set to zero.
NaN’s are not supported.

Floating Point Addition Example

$A = 9.75$ $B = 0.5625$

Equivalent floating-point binary words are:

floating point fig13

Steps:

$Abs(A) > Abs(B)$ ? Yes.
Result of Initial exponent $E3 = E1 = 10000010 = 130_{(10)}$
$E1 - E2 = (10000010 - 01111110) => (130 - 126) = 4$
Shift the mantissa $M2$ by $(E1 - E2)$ so that the exponents are the same for both numbers.
Sign bits of both are equal? Yes. Add the mantissa’s
Normalization needed? No, (if Normalization was required for $M3$ then the initial exponent result $E3 = E1$ should be adjusted accordingly)
Result

$X3$ in decimal = 10.3125.
If we had to perform subtraction, just change the sign bit of $X2$ to “1”, Then we would have subtracted the mantissas, since sign bits are not equal.

NOTE: For floating point Subtraction, invert the sign bit of the number to be subtracted and apply it to floating-point Adder.

IEEE 754 Standard Floating Point Division Algorithm

Division of IEEE 754 Floating-point numbers ( $X1$ & $X2$ ) is done by dividing the mantissas and subtracting the exponents.

$X3 = (X1 / X2) = (-1)^{S1} (M1 \times 2^{E1}) / (-1)^{S2} (M2 \times 2^{E2}) = (-1)^{S3} (M1 / M2) 2^{(E1-E2)}$

Steps:

If divisor $X2$ = zero then set the result to “infinity”, if both $X1$ and $X2$ are zero’s set it to “NAN”
Sign bit $S3 = (S1 \ xor \ S1)$ .
Find mantissa by dividing $M1/M2$
Exponent $E3 = (E1 - E2) + bias$
Normalize if required, i.e by left shifting the mantissa and decrementing the resultant exponent.
Check for overflow/underflow
- If $E3 > E_{max}$ return overflow i.e. “infinity”
- If $E3 < E_{min}$ return underflow i.e. zero

Floating Point Division Example

$X1 = 127.03125$ $X2 = 16.9375$

Equivalent floating-point binary words are:

floating point fig18

Steps:

$S3 = S1 \ xor \ S2 = 0$
$E3 = (E1 - E2) + bias = (10000101) - (10000011) + (1111111) = 133 - 131 + 127 => 129 => (10000001)$
Divide the mantissas $M1/M2$
Result

$X3$ in decimal = 7.5

floating point fig21

floating point fig22

Conclusion

The IEEE 754 standard for floating-point arithmetic provides a reliable framework for representing and manipulating real numbers in computer systems. The algorithms for multiplication, addition, and division are designed to maintain precision and handle a wide range of values. By adhering to the IEEE 754 standard, we can achieve accurate and efficient numerical calculations in a variety of scientific, engineering and financial applications. These concepts help developers and engineers working with numerical computations to optimize performance and prevent common pitfalls such as rounding errors and loss of significance.

IEEE 754 Floating Point Arithmetic: Algorithms and Examples

IEEE 754 Standard Floating Point Numbers

IEEE 754 Standard Floating Point Conversions

Decimal to IEEE 754 Standard Floating Point

IEEE 754 Standard Floating Point Standard to Decimal Point Conversion

IEEE 754 Standard Floating Point Arithmetic