IEEE Floating Point Standard 

IEEE Floating Point Standard

The IEEE floating point standard is a floating point arithmetic system adopted by the Institute for Electrical and Electronics Engineer in the early 1980s. 
 
Requirements for machines adopting the IEEE floating point standard 
  1. Arithmetic should be correctly rounded 
  2. floating point numbers should be consistently represented across machines 
  3. Exception handling should be sensible and consistent

Floating point number representation

Single precision numbers in a 32-bit machine
The bit pattern b1b2b3...b9b10b11...b32  of a word in a 32-bit machine represents the real number
(-1)s x 2e-127 x (1.f)2
where s = b1,  e = (b2...b9)2, and f = b10b11...b32

sign bit biased exponent fraction from normalized mantissa
1 bit 8 bits 23 bits
s e f
Note that only the fraction from the normalized mantissa is stored and so there is a hidden bit and the mantissa is actually represented by 24 binary digits.
Double precision numbers in a 32-bit machine
The bit pattern b1b2b3...b12b13b14...b64  of two words in a 32-bit machine represents the real number
(-1)s x 2e-1023 x (1.f)2
where s = b1,  e = (b2...b12)2, and f = b13b14...b64

sign bit biased exponent fraction from normalized mantissa
1 bit 11 bits 52 bits
s e f
Note that only the fraction from the normalized mantissa is stored and so there is a hidden bit and the mantissa is actually represented by 53 binary digits.
Decimal values of some normalized floating point numbers on a 32-bit machine: 

Single Precision Double Precision
Machine epsilon 2-23 or 1.192 x 10-7 2-52 or 2.220 x 10-16
Smallest positive 2-126 or 1.175 x 10-38 2-1022 or 2.225 x 10-308
Largest positive (2- 2-23) 2127 or 3.403 x 1038 (2- 2-52) 21023 or 1.798 x 10308
Smallest subnormal 2-150 or 7.0 x 10-46 2-1075 or 2.5 x 10-324
Decimal Precision 6 significant digits 15 significant digits

Rounding in IEEE standard

Round to the nearest mode is the most common choice.  Basically, given a real number x, its correctly rounded value is the floating point number fl(x) that is closest to x

Special values in IEEE floating point standard

Single Precision representation

sign bit biased exponent fraction from normalized mantissa

1 bit

8 bits

23 bits

7/4 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-34.432175 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0
-959818 1 0 0 1 0 0 1 0  1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
macheps 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
"smallest" 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
"largest" 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
infinity 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
NaN 0 1 1 1 1 1 1 1 1 Not all 0s or 1s
2-128** 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
**This is a subnormal number.  It is machine representable but is less accurate in computation than a normalizable value.