Introduction
Floating-point numbers represent two components: one is an exponent and the other is fraction. For example, the number 200.07 can be represented as 0.20007*103, where 0.2007 is the fraction and 3 is the exponent. In a binary form, they are represented similarly. There are two types of representation: short or single- precision floating-point number and long or double-precision floating-point number. short occupies 4 bytes or 32 bits while long occupies 8 bytes or 64 bits.
Program/Example
In C, short or single-precision floating point is represented by the data type float and appears as:
float f ;
A single-precision floating-point number is represented as follows:
Here the fractional part occupies 23 bits from 0 to 22. The exponent part occupies 8 bits from 23 to 30 (bias exponent, that is, exponent + 01111111). The sign bit occupies the 31st bit.
Suppose the decimal number is 100.25. It can be converted as follows:
-
Convert 100.25 into its equivalent binary representation: 1100100.01.
-
Then represent this number so that there is only 1 bit on the left side of the decimal point: 1.0010001*26
-
In a binary representation, exponent 6 means the number 110. Now add the bias, 0111 1111, to get the exponent: 1000 0101
Since the number is positive, the sign bit is 0. The significant, or fractional, part is:
1001 0001 0000 0000 0000 000
Note that up until the fractional part, only those bits that are on the right side of the decimal point are present. The 0s are added to the right side to make the fractional part take up 23 bits.
Special rules are applied for some numbers:
-
The number 0 is stored as all 0s, but the sign bit is 1.
-
Positive infinity is represented as all 1s in the exponent and all 0s in the fractional part with the sign bit 0.
-
Negative infinity is represented as all 1s in the exponent and all 0s in fractional part with the sign bit 1.
-
A NAN (not a number) is an invalid floating number in which all the exponent bits are 1, and in the fractional part you may have 1s or 0s.
The range of the float data type is 10−38 to 1038 for positive values and −1038 to −10−38 for negative values.
The values are accurate to 6 or 7 significant digits depending on the actual implementation.
Conversion of a number in the floating-point form to a decimal number
Suppose the number has the following components:
-
Sign bit: 1
-
Exponent: 1000 0011
-
Significant or fractional part: 1001 0010 0000 0000 0000 000
Since the exponent is bias, find out the unbiased exponent.
-
100 = 1000 0011 – 0111 1111 (number 4)
Represent the number as 1.1001001*24
Represent the number without the exponent as 11001.001
Convert the binary number to decimal: −25.125
For double precision, you can declare the variable as double d; it is represented as
Here the fractional part occupies 52 bits from 0 to 51. The exponent part occupies 11 bits from 52 to 62 (the bias exponent is the exponent plus 011 1111 1111). The sign bit occupies bit 63. The range of double representation is +10−308 to +10308 and −10308 to −10−308. The precision is to 10 or more digits.
Formats for representing floating points
Following are the valid representions of floating points:
2.3456E-1
.23456
.23456e-2
2.3456E-4
-.232456E-4
2345.6
23.456E2
-23456
23456e3
Following are the invalid formats:
e12.5e-.5
25.2-e5
2.5.3
You can determine whether a format is valid or invalid based on the following rules:
-
The value can include a sign, it must include a numerical part, and it may or may not have exponent part.
-
The numerical part can be of following form:
d.d, d., .d, d, where d is a set of digits.
-
If the exponent part is present, it should be represented by ‘e’ or ‘E’, which is followed by a positive or negative integer. It should not have a decimal point and there should be at least 1 digit after ‘E’.
-
All floating numbers have decimal points or ‘e’ (or both).
-
When ‘e’ or ‘E’ is used, it is called scientific notation.
-
When you write a constant, such as 50, it is interpreted as an integer. To interpret it as floating point you have to write it as 50.0 or 50, or 50e0.
You can use the format %f for printing floating numbers. For example, printf("%f\n", f);
%f prints output with 6 decimal places. If you want to print output with 8 columns and 3 decimal places, you can use the format %8.3f. For printing double you can use %lf.
Floating-point computation may give incorrect results in the following situations:
Points to Remember
-
C provides two main floating-point representations: float (single precision) and double (double precision).
-
A floating-point number has a fractional part and a biased exponent.
-
Float occupies 4 bytes and double occupies 8 bytes
No comments:
Post a Comment