REPRESENTATION OF FLOATING-POINT NUMBERS

Introduction

Floating-point numbers represent two components: one is an exponent and the other is fraction. For example, the number 200.07 can be represented as 0.20007*10³, where 0.2007 is the fraction and 3 is the exponent. In a binary form, they are represented similarly. There are two types of representation: short or single- precision floating-point number and long or double-precision floating-point number. short occupies 4 bytes or 32 bits while long occupies 8 bytes or 64 bits.

Program/Example

In C, short or single-precision floating point is represented by the data type float and appears as:

float f ;

A single-precision floating-point number is represented as follows:

Here the fractional part occupies 23 bits from 0 to 22. The exponent part occupies 8 bits from 23 to 30 (bias exponent, that is, exponent + 01111111). The sign bit occupies the 31st bit.

Suppose the decimal number is 100.25. It can be converted as follows:

Convert 100.25 into its equivalent binary representation: 1100100.01.
Then represent this number so that there is only 1 bit on the left side of the decimal point: 1.0010001*2⁶
In a binary representation, exponent 6 means the number 110. Now add the bias, 0111 1111, to get the exponent: 1000 0101

Since the number is positive, the sign bit is 0. The significant, or fractional, part is:

1001 0001 0000 0000 0000 000

Note that up until the fractional part, only those bits that are on the right side of the decimal point are present. The 0s are added to the right side to make the fractional part take up 23 bits.

Special rules are applied for some numbers:

The number 0 is stored as all 0s, but the sign bit is 1.
Positive infinity is represented as all 1s in the exponent and all 0s in the fractional part with the sign bit 0.
Negative infinity is represented as all 1s in the exponent and all 0s in fractional part with the sign bit 1.
A NAN (not a number) is an invalid floating number in which all the exponent bits are 1, and in the fractional part you may have 1s or 0s.

The range of the float data type is 10⁻³⁸ to 10³⁸ for positive values and −10³⁸ to −10⁻³⁸ for negative values.

The values are accurate to 6 or 7 significant digits depending on the actual implementation.

Conversion of a number in the floating-point form to a decimal number

Suppose the number has the following components:

Sign bit: 1
Exponent: 1000 0011
Significant or fractional part: 1001 0010 0000 0000 0000 000

Since the exponent is bias, find out the unbiased exponent.
100 = 1000 0011 – 0111 1111 (number 4)

Represent the number as 1.1001001*2⁴

Represent the number without the exponent as 11001.001

Convert the binary number to decimal: −25.125

For double precision, you can declare the variable as double d; it is represented as

Here the fractional part occupies 52 bits from 0 to 51. The exponent part occupies 11 bits from 52 to 62 (the bias exponent is the exponent plus 011 1111 1111). The sign bit occupies bit 63. The range of double representation is +10⁻³⁰⁸ to +10³⁰⁸ and −10³⁰⁸ to −10⁻³⁰⁸. The precision is to 10 or more digits.

Formats for representing floating points

Following are the valid representions of floating points:

0.23456
2.3456E-1
.23456
.23456e-2
2.3456E-4
-.232456E-4
2345.6
23.456E2
-23456
23456e3

Following are the invalid formats:

e1
2.5e-.5
25.2-e5
2.5.3

You can determine whether a format is valid or invalid based on the following rules:

The value can include a sign, it must include a numerical part, and it may or may not have exponent part.
The numerical part can be of following form:

d.d, d., .d, d, where d is a set of digits.
If the exponent part is present, it should be represented by ‘e’ or ‘E’, which is followed by a positive or negative integer. It should not have a decimal point and there should be at least 1 digit after ‘E’.
All floating numbers have decimal points or ‘e’ (or both).
When ‘e’ or ‘E’ is used, it is called scientific notation.
When you write a constant, such as 50, it is interpreted as an integer. To interpret it as floating point you have to write it as 50.0 or 50, or 50e0.

You can use the format %f for printing floating numbers. For example, printf("%f\n", f);

%f prints output with 6 decimal places. If you want to print output with 8 columns and 3 decimal places, you can use the format %8.3f. For printing double you can use %lf.

Floating-point computation may give incorrect results in the following situations:

If the calculated value has a precision that exceeds the precision limit of the type;
If the calculated value exceeds the range allowable for the type;
If the two calculated values involve approximation then their operation may involve approximation.

Points to Remember

C provides two main floating-point representations: float (single precision) and double (double precision).
A floating-point number has a fractional part and a biased exponent.
Float occupies 4 bytes and double occupies 8 bytes

judhyn's blog

blogger Indonesia