- **C**
(*http://www.go4expert.com/articles/c-tutorials/*)

- - **Understanding float datatype in C**
(*http://www.go4expert.com/articles/understanding-float-datatype-c-t29336/*)

Understanding float datatype in CPretty sure, one must have used the datatype ‘float’ numerous number of times since the day one has started programming. However, there are many times, when we mishandle floats and doubles, or get unexpected results while using float variables or find ourselves being perplexed while using floating points. This is because of the very nature of floating points in binary.
## Binary Representation of FloatGenerally in our day to day mathematics, we work on decimal representation of numbers which is to the base of ten. However, computer systems arithmetics work on binary representation of numbers which is to the base of two. One of the reason of taking binary representation could be binary arithmetic are faster. However, it also comes with a downside that floating point representation and arithmetic in binary are inexact. And hence, vulnerable to some unexpected results and behaviours. Lets understand how do we represent a floating point number in binary (to base 2) from a decimal (to base 10). An example: Code:
`inDecimal = 3.625` Code:
`( 3 ) base 10 = ( 011 )base 2` - STEP 1: Take the fractional part and multiply it by 2.
Code:`fractional = .625` .625 * 2 = 1.250 result = 1.25
- STEP 2: Now keep the whole part of the ‘result’ of above step. It would either be ‘0’ or ‘1’. This is the first bit of the binary representation of the input number’s fractional part.
Code:`.1......................`
- STEP 3: Next, again picking just fractional part of the ‘result’ of step 1, multiply by 2
Code:`fractional = .25` .25 * 2 = 0.50 result = 0.5
- STEP 4: Again, keep the whole part of the ‘result’ as the next bit.
Code:`.10...................`
- STEP 5: Next, again picking just the fractional part of the ‘result’ in step 3, and multiplying by 2
Code:`fractional = .5` .5 * 2 = 1.0 result = 1.0
- STEP 6: Again, keep the whole part of the ‘result’ as the next bit.
Code:`.101`
- STEP 7: Since, the fractional part is zero now, we are done with the binary conversion and hence fractional part in binary comes out to be
Code:`.101`
Therefore, combining the whole part and fractional part Code:
`(In decimal )3.625` Hence, the number of bits in the fractional part solely depends on its value, rather than anything else. This is one of the reason which makes binary floating point arithmetic inexact. We shall be discussing more about it in further sections. ## Floating Point Representation in MemoryWell, while storing the floating point numbers in memory, it is different. Earlier, every architecture used to have its own way of handling floating points making programming architecture specific where floating points are involved. However, with the IEEE std 754-1985, it has been a standard to represent a floating point as: Code:
`(-1)^S * M * (2 ^ E)` ^ is operator for raise to the power of * operator for multiplication M is Mantissa or significand of the form “i.f”, where ‘i’ is an integer and ‘f’ is the fractional part. Though some authors avoid the use of term Mantissa, but I am okay with it. E is the exponent S means the Mantissa sign bit IEEE defines 32 bit floating point (single precision) and 64 bit floating point (double precision). Keeping in mind, the IEEE standard of floating point, in memory 32-bit floating points (single precision) are stored as: - Convert into the binary representation
- Represent it through IEEE standard.
Code:`(-1)^Sign * Mantissa * (2 ^ Exponent)`
- Add number 127 to Exponent, to make it unsigned, otherwise which is signed. This is called biasing by 127.
- Sign bit is given memory 1 Bit, Mantissa is 23 bits, Exponent takes 8 bits including its sign bit.
The memory looks like something in the following figure (it may vary depending upon the underlying hardware):
http://imgs.g4estatic.com/float-c/float.png
It’ll be more clearer with the help of an example, In decimal, number is 3.625 The binary representation: 11.101 The IEEE standard identifies it as : 11.101 = 1.1101 * (2^1) = (-1)^0 * 1.1101 * (2^1) Therefore, Exponent is 1, and biasing by 127, Code:
`Exponent = 1 + 127 = 128` Therefore, the 32 number in memory would be: Code:
`0 10000000 11010000000000000000000` ## Float Arithmetic - Unexpected ResultsAs already mentioned, the number of bits taken by a float could vary greatly depending upon the the value. Hence at times, to fit into a 32 bit float or 64 bit double, it requires approximations. This makes the values in floating points inexact and raising certain arithmetic unexpected behaviours. Since, now we understand, how underlying memory reads the value, it would be useful to understand these unexpected results and, anticipate and handle them wisely while programming. ComparisonsHave a look at the following code, Code:
`#include <stdio.h>` Code:
`rupali@home-OptiPlex-745:~/programs/g4e$ gcc comparision.c -Wall -o comparision` To get the expected result, just type cast the constant value or put it in a float variable for comparison. Code:
`#include <stdio.h>` Code:
`rupali@home-OptiPlex-745:~/programs/g4e$ gcc comparision.c -Wall -o comparision` Assignment Assigning some value to a float variable, here is a code. Code:
`#include <stdio.h>` Code:
`rupali@home-OptiPlex-745:~/programs/g4e$ gcc assign.c -o assign -Wall` Well, when the following arithmetic happened between two int’s Code:
`ival1 / ival2` The correct way to get floating point outputs is, by making any one of the operands float through typecasting. Code:
`#include <stdio.h>` Code:
`rupali@home-OptiPlex-745:~/programs/g4e$ gcc assign.c -o assign -Wall` ## Interesting Use.Check out an interesting trick to control the number of digits of precision of floating point in decimal. Code:
`#include <stdio.h>` Code:
`3.27123` |

All times are GMT +5.5. The time now is 15:25. |