IEEE Float

IEEE Floating Point Standard

The IEEE floating point standard is a floating point arithmetic system adopted by the Institute for Electrical and Electronics Engineer in the early 1980s.

Requirements for machines adopting the IEEE floating point standard

Arithmetic should be correctly rounded

floating point numbers should be consistently represented across machines

Exception handling should be sensible and consistent

Floating point number representation

Single precision numbers in a 32-bit machine

The bit pattern b₁b₂b₃...b₉b₁₀b₁₁...b₃₂ of a word in a 32-bit machine represents the real number

(-1)^s x 2^e-127 x (1.f)₂

where s = b₁, e = (b₂...b₉)₂, and f = b₁₀b₁₁...b₃₂.


sign bit	biased exponent	fraction from normalized mantissa
1 bit	8 bits	23 bits
s	e	f

Note that only the fraction from the normalized mantissa is stored and so there is a hidden bit and the mantissa is actually represented by 24 binary digits.

Double precision numbers in a 32-bit machine

The bit pattern b₁b₂b₃...b₁₂b₁₃b₁₄...b₆₄ of two words in a 32-bit machine represents the real number
(-1)^s x 2^e-1023 x (1.f)₂
where s = b₁, e = (b₂...b₁₂)₂, and f = b₁₃b₁₄...b₆₄.

sign bit biased exponent fraction from normalized mantissa

1 bit 11 bits 52 bits

s e f

Note that only the fraction from the normalized mantissa is stored and so there is a hidden bit and the mantissa is actually represented by 53 binary digits.

Decimal values of some normalized floating point numbers on a 32-bit machine:


	Single Precision	Double Precision
Machine epsilon	2^-23 or 1.192 x 10^-7	2^-52 or 2.220 x 10^-16
Smallest positive	2^-126 or 1.175 x 10^-38	2^-1022 or 2.225 x 10^-308
Largest positive	(2- 2^-23)2¹²⁷ or 3.403 x 10³⁸	(2- 2^-52) 2¹⁰²³ or 1.798 x 10³⁰⁸
Smallest subnormal	2^-150 or 7.0 x 10^-46	2^-1075 or 2.5 x 10^-324
Decimal Precision	6 significant digits	15 significant digits

Rounding in IEEE standard

Round to the nearest mode is the most common choice. Basically, given a real number x, its correctly rounded value is the floating point number fl(x) that is closest to x.

Special values in IEEE floating point standard

Single Precision representation

sign bit biased exponent fraction from normalized mantissa

1 bit

8 bits

23 bits

7/4 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

-34.432175 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0

-959818 1 1 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0

+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

- 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

macheps 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

"smallest" 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

"largest" 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

infinity 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

NaN 0 1 1 1 1 1 1 1 1 Not all 0s or 1s

2^-128** 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

**This is a subnormal number. It is machine representable but is less accurate in computation than a normalizable value.

IEEE Floating Point Standard

Floating point number representation

Rounding in IEEE standard

Round to the nearest mode is the most common choice. Basically, given a real number x, its correctly rounded value is the floating point number fl(x) that is closest to x.

Special values in IEEE floating point standard

**This is a subnormal number. It is machine representable but is less accurate in computation than a normalizable value.