Pretty sure, one must have used the datatype ‘float’ numerous number of times since the day one has started programming. However, there are many times, when we mishandle floats and doubles, or get unexpected results while using float variables or find ourselves being perplexed while using floating points. This is because of the very nature of floating points in binary. Binary Representation of Float Generally in our day to day mathematics, we work on decimal representation of numbers which is to the base of ten. However, computer systems arithmetics work on binary representation of numbers which is to the base of two. One of the reason of taking binary representation could be binary arithmetic are faster. However, it also comes with a downside that floating point representation and arithmetic in binary are inexact. And hence, vulnerable to some unexpected results and behaviours. Lets understand how do we represent a floating point number in binary (to base 2) from a decimal (to base 10). An example: Code: inDecimal = 3.625 In binary as well, the representation includes a ‘dot’ separating the whole number part, i.e. ‘3’ and the fractional part i.e. ‘.625’. Conversion to binary is done separately for both the parts. For the whole number in our example, it needs no explanation as to how it is represented in binary: Code: ( 3 ) base 10 = ( 011 )base 2 However, to determine the binary representation for the fractional part, the steps to follow are STEP 1: Take the fractional part and multiply it by 2. Code: fractional = .625 .625 * 2 = 1.250 result = 1.25 STEP 2: Now keep the whole part of the ‘result’ of above step. It would either be ‘0’ or ‘1’. This is the first bit of the binary representation of the input number’s fractional part. Code: .1...................... STEP 3: Next, again picking just fractional part of the ‘result’ of step 1, multiply by 2 Code: fractional = .25 .25 * 2 = 0.50 result = 0.5 STEP 4: Again, keep the whole part of the ‘result’ as the next bit. Code: .10................... STEP 5: Next, again picking just the fractional part of the ‘result’ in step 3, and multiplying by 2 Code: fractional = .5 .5 * 2 = 1.0 result = 1.0 STEP 6: Again, keep the whole part of the ‘result’ as the next bit. Code: .101 STEP 7: Since, the fractional part is zero now, we are done with the binary conversion and hence fractional part in binary comes out to be Code: .101 Therefore, combining the whole part and fractional part Code: (In decimal )3.625 = (In Binary ) 0011.101 Here in our example, we got lucky as the binary conversion of the fractional part was finite. However, there are cases where they go on infinitely. Try converting the decimal real number 0.3 into its binary form. Hence, the number of bits in the fractional part solely depends on its value, rather than anything else. This is one of the reason which makes binary floating point arithmetic inexact. We shall be discussing more about it in further sections. Floating Point Representation in Memory Well, while storing the floating point numbers in memory, it is different. Earlier, every architecture used to have its own way of handling floating points making programming architecture specific where floating points are involved. However, with the IEEE std 754-1985, it has been a standard to represent a floating point as: Code: (-1)^S * M * (2 ^ E) Where ^ is operator for raise to the power of * operator for multiplication M is Mantissa or significand of the form “i.f”, where ‘i’ is an integer and ‘f’ is the fractional part. Though some authors avoid the use of term Mantissa, but I am okay with it. E is the exponent S means the Mantissa sign bit IEEE defines 32 bit floating point (single precision) and 64 bit floating point (double precision). Keeping in mind, the IEEE standard of floating point, in memory 32-bit floating points (single precision) are stored as: Convert into the binary representation Represent it through IEEE standard. Code: (-1)^Sign * Mantissa * (2 ^ Exponent) Add number 127 to Exponent, to make it unsigned, otherwise which is signed. This is called biasing by 127. Sign bit is given memory 1 Bit, Mantissa is 23 bits, Exponent takes 8 bits including its sign bit. The memory looks like something in the following figure (it may vary depending upon the underlying hardware): For 64-bit floating point (double precision), it is 1-bit Sign Bit, 11-Bit Exponent and 52 Bit Mantissa. It’ll be more clearer with the help of an example, In decimal, number is 3.625 The binary representation: 11.101 The IEEE standard identifies it as : 11.101 = 1.1101 * (2^1) = (-1)^0 * 1.1101 * (2^1) Therefore, Exponent is 1, and biasing by 127, Code: Exponent = 1 + 127 = 128 Sign = 0; Mantissa = 11010000000000000000000 Exponent = 10000000 Therefore, the 32 number in memory would be: Code: 0 10000000 11010000000000000000000 Float Arithmetic - Unexpected Results As already mentioned, the number of bits taken by a float could vary greatly depending upon the the value. Hence at times, to fit into a 32 bit float or 64 bit double, it requires approximations. This makes the values in floating points inexact and raising certain arithmetic unexpected behaviours. Since, now we understand, how underlying memory reads the value, it would be useful to understand these unexpected results and, anticipate and handle them wisely while programming. Comparisons Have a look at the following code, Code: #include <stdio.h> int main() { float fval = 0.3; if (fval == 0.3) { printf("fval is 0.3\n"); } else { printf("fval is NOT 0.3\n"); } return 0; } What do you think would be the output? It looks pretty obvious, right? Lets compile and run, Code: rupali@home-OptiPlex-745:~/programs/g4e$ gcc comparision.c -Wall -o comparision rupali@home-OptiPlex-745:~/programs/g4e$ ./comparision fval is NOT 0.3 Surprisingly yes! Those who understood, really good. Those who did not understand, please check the binary representation of float value ‘0.3’. Its mantissa goes beyond 32 bits and hence the value has to be approximated, due to which the values don’t match. To get the expected result, just type cast the constant value or put it in a float variable for comparison. Code: #include <stdio.h> int main() { float fval = 0.3; if (fval == (float)0.3) { printf("fval is 0.3\n"); } else { printf("fval is NOT 0.3\n"); } return 0; } Now, checking the output: Code: rupali@home-OptiPlex-745:~/programs/g4e$ gcc comparision.c -Wall -o comparision rupali@home-OptiPlex-745:~/programs/g4e$ ./comparision fval is 0.3 Assignment Assigning some value to a float variable, here is a code. Code: #include <stdio.h> int main() { float fval = 0.0; int ival1 = 12; int ival2 = 24; fval = ival1/ival2; printf("fval is %f\n", fval); return 0; } Mathematically, it should give 0.5, also as the lvalue is capable of storing float values. Let us compile and check the result. Code: rupali@home-OptiPlex-745:~/programs/g4e$ gcc assign.c -o assign -Wall rupali@home-OptiPlex-745:~/programs/g4e$ ./assign fval is 0.000000 It computes to 0.0, but why? Well, when the following arithmetic happened between two int’s Code: ival1 / ival2 It gave the resultant value in an integer only, as the operands being integers. Therefore, the integer result of the division is zero, which is then stored in a float variable making it 0.0. The correct way to get floating point outputs is, by making any one of the operands float through typecasting. Code: #include <stdio.h> int main() { float fval = 0.0; int ival1 = 12; int ival2 = 24; fval = (float)ival1/ival2; printf("fval is %f\n", fval); return 0; } Now, confirming our solution and understanding, lets check the output: Code: rupali@home-OptiPlex-745:~/programs/g4e$ gcc assign.c -o assign -Wall rupali@home-OptiPlex-745:~/programs/g4e$ ./assign fval is 0.500000 Interesting Use . Check out an interesting trick to control the number of digits of precision of floating point in decimal. Code: #include <stdio.h> int main() { int np = 5; float fval = 3.27123457823; printf("%.*f\n",np,fval); return 0; } The output is: Code: 3.27123 Note, the floating point in decimal is printed with precision upto ‘np’ variable value.