Double Format - Oracle® Solaris Studio 12.4: Numerical Computation Guide

Language:

2.2.3 Double Format

The IEEE double format consists of three fields: a 52-bit fraction, f; an 11-bit biased exponent, e; and a 1-bit sign, s. These fields are stored contiguously in two successively addressed 32-bit words, as shown in the following figure.

In the SPARC architecture, the higher address 32-bit word contains the least significant 32 bits of the fraction, while in the x86 architecture the lower address 32‐bit word contains the least significant 32 bits of the fraction.

If f[31:0] denotes the least significant 32 bits of the fraction, then bit 0 is the least significant bit of the entire fraction and bit 31 is the most significant of the 32 least significant fraction bits.

In the other 32-bit word, bits 0:19 contain the 20 most significant bits of the fraction, f[51:32], with bit 0 being the least significant of these 20 most significant fraction bits, and bit 19 being the most significant bit of the entire fraction; bits 20:30 contain the 11-bit biased exponent, e, with bit 20 being the least significant bit of the biased exponent and bit 30 being the most significant; and the highest-order bit 31 contains the sign bit, s.

The following figure numbers the bits as though the two contiguous 32-bit words were one 64‐bit word in which bits 0:51 store the 52-bit fraction, f; bits 52:62 store the 11-bit biased exponent, e; and bit 63 stores the sign bit, s.

Figure 2-2 Double-Storage Format

image:Representation of bits in double-storage format.

The values of the bit patterns in these three fields determine the value represented by the overall bit pattern.

Table 2–4 shows the correspondence between the values of the bits in the three constituent fields, on the one hand, and the value represented by the double-format bit pattern on the other; u means the value of the indicated field is irrelevant to the determination of value for the particular bit pattern in double format.

Table 2-4 Values Represented by Bit Patterns in IEEE Double Format

Double-Format Bit Pattern	Value
0 < `e` < 2047	(–1)^s × 2^e–1023 × 1.`f` (normal numbers)
`e` = 0; `f` ≠ 0 (at least one bit in `f` is nonzero)	(–1)^s × 2^–1022 × 0.`f` (subnormal numbers)
`e` = 0; `f` = 0 (all bits in `f` are zero)	(–1)^s × 0.0 (signed zero)
`s` = 0; `e` = 2047; `f` = 0 (all bits in `f` are zero)	+INF (positive infinity)
`s` = 1; `e` = 2047; `f` = 0 (all bits in `f` are zero)	–INF (negative infinity)
`s` = u; `e` = 2047; `f` ≠ 0 (at least one bit in `f` is nonzero)	NaN (Not-a-Number)

Notice that when e < 2047, the value assigned to the double-format bit pattern is formed by inserting the binary radix point immediately to the left of the fraction's most significant bit, and inserting an implicit bit immediately to the left of the binary point. The number thus formed is called the significand. The implicit bit is so named because its value is not explicitly given in the double-format bit pattern, but is implied by the value of the biased exponent field.

For the double format, the difference between a normal number and a subnormal number is that the leading bit of the significand (the bit to the left of the binary point) of a normal number is 1, whereas the leading bit of the significand of a subnormal number is 0. Double-format subnormal numbers were called double-format denormalized numbers in IEEE Standard 754.

The 52-bit fraction combined with the implicit leading significand bit provides 53 bits of precision in double-format normal numbers.

Examples of important bit patterns in the double-storage format are shown in Table 2–5. The bit patterns in the second column appear as two 8-digit hexadecimal numbers. For the SPARC architecture, the left one is the value of the lower addressed 32-bit word, and the right one is the value of the higher addressed 32-bit word, while for the x86 architecture, the left one is the higher addressed word, and the right one is the lower addressed word. The maximum positive normal number is the largest finite number representable in the IEEE double format. The minimum positive subnormal number is the smallest positive number representable in IEEE double format. The minimum positive normal number is often referred to as the underflow threshold. (The decimal values for the maximum and minimum normal and subnormal numbers are approximate; they are correct to the number of figures shown.)

Table 2-5 Bit Patterns in Double-Storage Format and Their IEEE Values

Common Name	Bit Pattern (Hex)	Decimal Value
+ 0	`00000000 00000000`	0.0
– 0	`80000000 00000000`	–0.0
1	`3ff00000 00000000`	1.0
2	`40000000 00000000`	2.0
max normal number	`7fefffff ffffffff`	1.7976931348623157e+308
min positive normal number	`00100000 00000000`	2.2250738585072014e–308
max subnormal number	`000fffff ffffffff`	2.2250738585072009e–308
min positive subnormal number	`00000000 00000001`	4.9406564584124654e–324
+∞	`7ff00000 00000000`	Infinity
–∞	`fff00000 00000000`	–Infinity
Not-a-Number	`7ff80000 00000000`	NaN

A NaN (Not a Number) can be represented by any of the many bit patterns that satisfy the definition of NaN. The hex value of the NaN shown in Table 2–5 is just one of the many bit patterns that can be used to represent a NaN.