Pretty sure, one must have used the datatype ‘float’ numerous number of times since the day one has started programming. However, there are many times, when we mishandle floats and doubles, or get unexpected results while using float variables or find ourselves being perplexed while using floating points. This is because of the very nature of floating points in binary.

Generally in our day to day mathematics, we work on decimal representation of numbers which is to the base of ten. However, computer systems arithmetics work on binary representation of numbers which is to the base of two. One of the reason of taking binary representation could be binary arithmetic are faster. However, it also comes with a downside that floating point representation and arithmetic in binary are inexact. And hence, vulnerable to some unexpected results and behaviours.

Lets understand how do we represent a floating point number in binary (to base 2) from a decimal (to base 10).

An example:

In binary as well, the representation includes a ‘dot’ separating the whole number part, i.e. ‘3’ and the fractional part i.e. ‘.625’. Conversion to binary is done separately for both the parts. For the whole number in our example, it needs no explanation as to how it is represented in binary:

However, to determine the binary representation for the fractional part, the steps to follow are

Therefore, combining the whole part and fractional part

Here in our example, we got lucky as the binary conversion of the fractional part was finite. However, there are cases where they go on infinitely. Try converting the decimal real number 0.3 into its binary form.

Hence, the number of bits in the fractional part solely depends on its value, rather than anything else. This is one of the reason which makes binary floating point arithmetic inexact. We shall be discussing more about it in further sections.

Well, while storing the floating point numbers in memory, it is different. Earlier, every architecture used to have its own way of handling floating points making programming architecture specific where floating points are involved. However, with the IEEE std 754-1985, it has been a standard to represent a floating point as:

Where

^ is operator for raise to the power of

* operator for multiplication

M is Mantissa or significand of the form “i.f”, where ‘i’ is an integer and ‘f’ is the fractional part. Though some authors avoid the use of term Mantissa, but I am okay with it.

E is the exponent

S means the Mantissa sign bit

IEEE defines 32 bit floating point (single precision) and 64 bit floating point (double precision).

Keeping in mind, the IEEE standard of floating point, in memory 32-bit floating points (single precision) are stored as:

It’ll be more clearer with the help of an example,

In decimal, number is 3.625

The binary representation: 11.101

The IEEE standard identifies it as :

11.101 = 1.1101 * (2^1)

= (-1)^0 * 1.1101 * (2^1)

Therefore, Exponent is 1, and biasing by 127,

Sign = 0; Mantissa = 11010000000000000000000 Exponent = 10000000

Therefore, the 32 number in memory would be:

As already mentioned, the number of bits taken by a float could vary greatly depending upon the the value. Hence at times, to fit into a 32 bit float or 64 bit double, it requires approximations. This makes the values in floating points inexact and raising certain arithmetic unexpected behaviours.

Since, now we understand, how underlying memory reads the value, it would be useful to understand these unexpected results and, anticipate and handle them wisely while programming.

Have a look at the following code,

What do you think would be the output? It looks pretty obvious, right? Lets compile and run,

Surprisingly yes! Those who understood, really good. Those who did not understand, please check the binary representation of float value ‘0.3’. Its mantissa goes beyond 32 bits and hence the value has to be approximated, due to which the values don’t match.

To get the expected result, just type cast the constant value or put it in a float variable for comparison.

Now, checking the output:

Assigning some value to a float variable, here is a code.

Mathematically, it should give 0.5, also as the lvalue is capable of storing float values. Let us compile and check the result.

It computes to 0.0, but why?

Well, when the following arithmetic happened between two int’s

It gave the resultant value in an integer only, as the operands being integers. Therefore, the integer result of the division is zero, which is then stored in a float variable making it 0.0.

The correct way to get floating point outputs is, by making any one of the operands float through typecasting.

Now, confirming our solution and understanding, lets check the output:

Check out an interesting trick to control the number of digits of precision of floating point in decimal.

The output is:

Note, the floating point in decimal is printed with precision upto ‘np’ variable value.

### Binary Representation of Float

Generally in our day to day mathematics, we work on decimal representation of numbers which is to the base of ten. However, computer systems arithmetics work on binary representation of numbers which is to the base of two. One of the reason of taking binary representation could be binary arithmetic are faster. However, it also comes with a downside that floating point representation and arithmetic in binary are inexact. And hence, vulnerable to some unexpected results and behaviours.

Lets understand how do we represent a floating point number in binary (to base 2) from a decimal (to base 10).

An example:

Code:

inDecimal = 3.625

Code:

( 3 ) base 10 = ( 011 )base 2

- STEP 1: Take the fractional part and multiply it by 2.

Code:fractional = .625 .625 * 2 = 1.250 result = 1.25

- STEP 2: Now keep the whole part of the ‘result’ of above step. It would either be ‘0’ or ‘1’. This is the first bit of the binary representation of the input number’s fractional part.

Code:.1......................

- STEP 3: Next, again picking just fractional part of the ‘result’ of step 1, multiply by 2

Code:fractional = .25 .25 * 2 = 0.50 result = 0.5

- STEP 4: Again, keep the whole part of the ‘result’ as the next bit.

Code:.10...................

- STEP 5: Next, again picking just the fractional part of the ‘result’ in step 3, and multiplying by 2

Code:fractional = .5 .5 * 2 = 1.0 result = 1.0

- STEP 6: Again, keep the whole part of the ‘result’ as the next bit.

Code:.101

- STEP 7: Since, the fractional part is zero now, we are done with the binary conversion and hence fractional part in binary comes out to be

Code:.101

Therefore, combining the whole part and fractional part

Code:

(In decimal )3.625 = (In Binary ) 0011.101

Hence, the number of bits in the fractional part solely depends on its value, rather than anything else. This is one of the reason which makes binary floating point arithmetic inexact. We shall be discussing more about it in further sections.

### Floating Point Representation in Memory

Well, while storing the floating point numbers in memory, it is different. Earlier, every architecture used to have its own way of handling floating points making programming architecture specific where floating points are involved. However, with the IEEE std 754-1985, it has been a standard to represent a floating point as:

Code:

(-1)^S * M * (2 ^ E)

^ is operator for raise to the power of

* operator for multiplication

M is Mantissa or significand of the form “i.f”, where ‘i’ is an integer and ‘f’ is the fractional part. Though some authors avoid the use of term Mantissa, but I am okay with it.

E is the exponent

S means the Mantissa sign bit

IEEE defines 32 bit floating point (single precision) and 64 bit floating point (double precision).

Keeping in mind, the IEEE standard of floating point, in memory 32-bit floating points (single precision) are stored as:

- Convert into the binary representation
- Represent it through IEEE standard.

Code:(-1)^Sign * Mantissa * (2 ^ Exponent)

- Add number 127 to Exponent, to make it unsigned, otherwise which is signed. This is called biasing by 127.
- Sign bit is given memory 1 Bit, Mantissa is 23 bits, Exponent takes 8 bits including its sign bit.

The memory looks like something in the following figure (it may vary depending upon the underlying hardware):

It’ll be more clearer with the help of an example,

In decimal, number is 3.625

The binary representation: 11.101

The IEEE standard identifies it as :

11.101 = 1.1101 * (2^1)

= (-1)^0 * 1.1101 * (2^1)

Therefore, Exponent is 1, and biasing by 127,

Code:

Exponent = 1 + 127 = 128

Therefore, the 32 number in memory would be:

Code:

0 10000000 11010000000000000000000

### Float Arithmetic - Unexpected Results

As already mentioned, the number of bits taken by a float could vary greatly depending upon the the value. Hence at times, to fit into a 32 bit float or 64 bit double, it requires approximations. This makes the values in floating points inexact and raising certain arithmetic unexpected behaviours.

Since, now we understand, how underlying memory reads the value, it would be useful to understand these unexpected results and, anticipate and handle them wisely while programming.

**Comparisons**Have a look at the following code,

Code:

#include <stdio.h> int main() { float fval = 0.3; if (fval == 0.3) { printf("fval is 0.3\n"); } else { printf("fval is NOT 0.3\n"); } return 0; }

Code:

rupali@home-OptiPlex-745:~/programs/g4e$ gcc comparision.c -Wall -o comparision rupali@home-OptiPlex-745:~/programs/g4e$ ./comparision fval is NOT 0.3

To get the expected result, just type cast the constant value or put it in a float variable for comparison.

Code:

#include <stdio.h> int main() { float fval = 0.3; if (fval == (float)0.3) { printf("fval is 0.3\n"); } else { printf("fval is NOT 0.3\n"); } return 0; }

Code:

rupali@home-OptiPlex-745:~/programs/g4e$ gcc comparision.c -Wall -o comparision rupali@home-OptiPlex-745:~/programs/g4e$ ./comparision fval is 0.3

**Assignment**Assigning some value to a float variable, here is a code.

Code:

#include <stdio.h> int main() { float fval = 0.0; int ival1 = 12; int ival2 = 24; fval = ival1/ival2; printf("fval is %f\n", fval); return 0; }

Code:

rupali@home-OptiPlex-745:~/programs/g4e$ gcc assign.c -o assign -Wall rupali@home-OptiPlex-745:~/programs/g4e$ ./assign fval is 0.000000

Well, when the following arithmetic happened between two int’s

Code:

ival1 / ival2

The correct way to get floating point outputs is, by making any one of the operands float through typecasting.

Code:

#include <stdio.h> int main() { float fval = 0.0; int ival1 = 12; int ival2 = 24; fval = (float)ival1/ival2; printf("fval is %f\n", fval); return 0; }

Code:

rupali@home-OptiPlex-745:~/programs/g4e$ gcc assign.c -o assign -Wall rupali@home-OptiPlex-745:~/programs/g4e$ ./assign fval is 0.500000

### Interesting Use

.Check out an interesting trick to control the number of digits of precision of floating point in decimal.

Code:

#include <stdio.h> int main() { int np = 5; float fval = 3.27123457823; printf("%.*f\n",np,fval); return 0; }

Code:

3.27123