15 March 2011

Topics in Computer Engineering - Floating Point Numbers(Single) conversion

In this post, I am going to talk about how to convert a base10 number into a single-precision floating point number. You may have used these if you have used variable types of Real, Single or Float.

The single-precision floating point number uses 32 bits to store the value of a number. The first bit is the sign, bits 2-9 represent the "127 bias" and bits 10-32 represent the significand. I will cover these terms and how they are used in the example.

Let's use the decimal (base10) number 1972.113010 in our example. The first step will be to convert this to a binary(base2) number. I will start by using the remainder method for the whole number portion and the multiplicative method for the fractional portion.

When using the remainder method on the whole number, you will divide the number by the base. The remainder will be the number to record starting from the least significant position. the whole number answer from the first problem now becomes the number that gets divided by the base for the next part. Repeat this until your answer is 0.

1972/2 = 986 r 0
986/2 = 493 r 0
493/2 = 246 r 1
246/2 = 123 r 0
123/2 = 61 r 1
61/2 = 30 r 1
30/2 = 15 r 0
15/2 = 7 r 1
7/2 = 3 r 1
3/2 = 1 r 1
1/2 = 0 r 1

Whole number portion = 111101101002

For the fractional number, take the decimal portion and multiply it by the base. Record the whole number portion. Eliminate the whole number portion from the number and multiply again and continue. You will want to continue until your fractional part becomes zero, or until the total number of digits in the whole number plus the fractional side is 24. Since the whole number has 11 digits, at most, we will have 13 digits for the fraction.

0.1130 x 2 = 0.226 0
0.226 x 2 = 0.452 0
0.452 x 2 = 0.904 0
0.904 x 2 = 1.808 1
0.808 x 2 = 1.616 1
0.616 x 2 = 1.232 1
0.232 x 2 = 0.464 0
0.464 x 2 = 0.928 0
0.928 x 2 = 1.856 1
0.856 x 2 = 1.712 1
0.712 x 2 = 1.424 1
0.424 x 2 = 0.848 0
0.848 x 2 = 1.696 1

The fractional portion is .00011100111012
Depending if you are rounding or not, this part will change a bit. If you are rounding and the fractional part of the last computed number is greater than .5, you will increase the last bit. Let us assume we will round...our new fractional part will be .00011100111102

Now, we have the number of 11110110100.00011100111102.

From here, we must get this into an exponential form with only a 1 in the whole number field. This is like using exponents in decimal. For instance, if you have the decimal number 123, that is also 1.23 x 102. To do this with a binary number will be the same as a decimal, except the "x 10n" will be "x 2n".

Our number is now represented as 1.11101101000001110011110 x 210.

Now we can start filling in the fields of the single-precision floating point number.

The first bit represented the sign of the number. So, that bit is 0 for positive numbers and 1 for negative numbers.

Floating point value:
0

The next 8 bits represent the 127 bias. This is so you can represent either very large ("x 2128")or very small numbers("x 2-127"). These bits are the value of "bias = 127 + power". For this problem, the value to look at is the "x 210". So, to figure the bias value, it will be 127 + 10 = 137.

Floating point value: (spaces are for viewing purposes only, it is actually one continuous number)
0 10001001

Now, to get the significand, it is everything that followed the decimal point in the binary number representation. The leading whole number 1 is assumed to be there and is not taking up space in the floating point number.

Floating point value:
0 10001001 11101101000001110011110

So, the value of 1972.113010 has the floating point representation of 010001001111011010000011100111102 or 44F6839E16.

That concludes this post, but if you have any comments or questions, I will try to address them.

No comments: