Lesson 7 | Floating-point numbers |

Objective | Explain what a floating-point number is. |

In modern computing, the representation of real numbers, especially floating-point numbers, demands both precision and efficiency. To address this, the Institute of Electrical and Electronics Engineers (IEEE) introduced the IEEE Standard 754 for Floating-Point Arithmetic. This standard has been widely adopted by the computing industry and serves as the benchmark for floating-point computation in computer hardware, languages, and operating systems.

- Overview of IEEE Standard 754:

The IEEE Standard 754 provides a comprehensive methodology for representing and computing floating-point numbers. It defines: - Formats for representing floating-point numbers. - Rounding rules and operations. - Exception handling (e.g., handling of overflow, underflow, and NaN (Not a Number) situations). **Representation of Numbers:**The standard primarily defines two basic formats: - Single Precision (32 bits): Comprising 1 bit for sign, 8 bits for the exponent, and 23 bits for the fraction. - Double Precision (64 bits): Comprising 1 bit for sign, 11 bits for the exponent, and 52 bits for the fraction. There are also extended formats, but the single and double precisions are the most commonly used.- 3. Application to Given Numbers:
- 1/3: This is a rational number but cannot be exactly represented in binary. IEEE 754 will provide an approximation.
- PI: An irrational number, it also cannot be exactly represented. In practice, a truncated or rounded version of its binary form is used.
- 1.23 x 10^35: This number would be represented using the sign bit, an exponent adjusted by a bias value, and a fraction derived from the number's mantissa.
- 2.6 x 10^-28: Similarly, this number would use the sign bit for its negative value, an appropriate exponent, and a fraction.

To represent real numbers such as 1/3, PI, -1.23 * 1035, and -2.6 * 10-28, most computers use IEEE Standard 754 *floating-point numbers* . Using this representation, a real number is expressed as the product of a binary number greater than or equal to 1 and less than 2 (called the mantissa) multiplied by 2 raised to a binary number exponent.

In practice, it is very unlikely that you will ever need to look at the binary form for the floating-point representation of a real number, so we will just take a quick look at one example to give you the general idea. Single precision floating-point representation uses 32 bits.

1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa. Here's the 32-bit floating-point representation of the real number 1/3. Floating-point number: Used to represent a real number on a computer.

In practice, it is very unlikely that you will ever need to look at the binary form for the floating-point representation of a real number, so we will just take a quick look at one example to give you the general idea. Single precision floating-point representation uses 32 bits.

1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa. Here's the 32-bit floating-point representation of the real number 1/3. Floating-point number: Used to represent a real number on a computer.

Sign bit | Exponent | Mantissa |

`0` |
`01111101` |
`01010101010101010101010` |

The sign bit is 0, indicating that this is a positive number.
The exponent is the binary representation of the decimal number 125. To obtain the actual exponent we subtract 127, to obtain -2.
This exponent bias allows the range of the exponent to be from -127 to 128. Finally, the mantissa represents the binary number

Note that the leading 1 of the mantissa is implied, to provide an additional digit of precision.

1.01010101010101010101010.

Note that the leading 1 of the mantissa is implied, to provide an additional digit of precision.

The decimal value of the mantissa is:

This is approximately 1.3333333 and thus, 1/3 is represented, using 32-bit floating-point representation, as approximately 1.3333333 * 2^{-2} or 1.3333333 * 1/4.

What's most important to remember about floating-point representation is that it allows you to represent a tremendous range of real numbers, but with limited precision. Single precision (32-bit) floating-point numbers are accurate to about 7 decimal digits, and double precision (64-bit) floating-point numbers are accurate to about 17 decimal digits.

We have covered how a computer stores numbers. Next we will consider how text is stored.

20 + 2^{-2}+ 2^{-4}+ 2^{-6}+ ... + 2^{-22}= 1 + 1/4 + 1/16 + 1/64 + ... + 1/2097152

This is approximately 1.3333333 and thus, 1/3 is represented, using 32-bit floating-point representation, as approximately 1.3333333 * 2

What's most important to remember about floating-point representation is that it allows you to represent a tremendous range of real numbers, but with limited precision. Single precision (32-bit) floating-point numbers are accurate to about 7 decimal digits, and double precision (64-bit) floating-point numbers are accurate to about 17 decimal digits.

We have covered how a computer stores numbers. Next we will consider how text is stored.