Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › floating point numbers!
New Posts  All Forums:Forum Nav:

floating point numbers!

Oh the beloved floating point numbers, how awesome you are but how mysterious as well.

Because I am bored and have spent the last two days researching, doing the math behind it, and documenting (because frankly everything out there is to complicated), I will document in this thread what floating point numbers really are and how a computer handles them.

Before I start I am going to cover some quick lingo.

-Bits: a single bit is a 1 or a 0 also referred to as binary.
-Bytes: A single byte is 8 Bits in length. 0000 0000 is a single byte in binary whilst 0x00 would be a single byte in hex.
-Half byte: Half of a byte. In binary it is 4 Bits.

Floating point numbers are numbers that carry incomplete numbers into the decimals with a certain precision. Floats in most languages are referred to as single precision which have 23 bits of precision or roughly 6 places of decimals. In most languages there is another term called Doubles which carry more precision which would be calculated the same way, but for sake of time I will only describe single precision floating point numbers.

When programming in a higher level language we are most certainly blinded as to what happens in the background when our applications runs; How it accesses memory, how it writes to and form the stack, and what I will be describing is how the processor handles floating point numbers.

Naturally CPUs enjoy reading in hex and binary (or any other base that is a multiple of base 2. Most CPUs only use base 2 and base 16 (binary and hex accordingly) but if forced to using other bases such as 8 would not be that hard). Anyways with this in mind CPUs can only deal with whole numbers because there is literally no way to incorporate a decimal number into base 2, but this is where the guys who create floating point numbers came in.

Since binary or hex cannot hold less than a whole number these mathematicians came up with a way to convert a floating point number from base 10 into a base 2 format that can be reversed easily back into its original form. This created form is commonly referred to as exponent and mantissa.

This form for a single precision float on a 32bit machine takes form of 3 binary parts. The first Bit is the sign. This sign tells us if the finished number is a positive (0) or a negative (1). The next segment is called the exponent, which after a long while of research unless you are familiar with base 2 math it doesn't really mean exponent. The exponent is in total 8 bits long. And finally the third part is the mantissa. The mantissa is the guts of the format and is the record holding the original number and is 23 bits long in length.

Following these explanation the following form can be deduced:
Code:
0 00000000 00000000000000000000000

For the following examples we will use the floating point number 3.125

To start the conversion to a Floating point number we must first start converting the decimal into our mantissa record. The rule of doing this is you multiply the decimal by 2 until it equals 0.0 or you reach 23 bits of precision. Once you get to 1.XXX you subtract 1 and insert a 1 into the mantissa. If you hit 1.000000000000001 you must add a one to the mantissa subtract 1 keep going until you either hit 23 bits or 0.0, also if you finish early and don't use the full 23 bits you must pad the remaining with 0's.

According to this, our math and record would look like so.
Code:
0.125 * 2 = 0.25
mantissa: 0
0.25 * 2 = 0.5
mantissa: 00
0.5 * 2 = 1.0
mantissa: 001

subtract 1 from 1.0 we are left with 0.0 meaning we are finished pad the rest of the mantissa with zeros.

mantissa: 00100000000000000000000

Next we must convert our whole number into binary. 3 converts to 0011 in binary so we must affix this back to our record.
Code:
0011.00100000000000000000000

Now it is time to calculate our exponent for our exponent and mantissa combo. This is done by shifting the entire record to the right until we reach a 01.XXX binary number. The number of bits shifted is our exponent. In this example we only have to shift 1 bit so our exponent is 1.
Code:
001.10010000000000000000000

Now we can knock the 001 off because this is not part of our mantissa record and when we reverse this process it is automatically re-added and assumed it was there to begin with.

Next we must calculate our exponent offset. The exponent originally starts at 127 (01111111) rather than 0 (00000000), why I couldn't tell you I don't have the slightest clue. There is most likely a good reason but I think of it as a way the math guys who made this up to make our lives difficult. Anyways from this we must add our exponent (the number of bits shifted) to 127 and convert to our exponent that is stored.

127 (01111111) + 1 = 128 (10000000)

Now with this calculated we can form a full record.
Code:
0 10000000 10010000000000000000000

to convert this to hex group in groups of 4 starting from left to right and simply convert
Code:
0100 0000 0100 1000 0000 0000 0000 0000
0x40480000

This is how floating point numbers are handled by the machine and why they are so slow. This is the only way the machine knows how to cope with numbers with decimals.

Now how to reverse this process. You simply reverse the operations that occurred.

Take your records hex and convert them into exponent mantissa format and calculate your exponent.
Code:
0 10000000 10010000000000000000000

128 (10000000) - 127 (01111111) = 1

Now you must shift your record back over to regain your whole number. re-add your 1. that we took off at the begging. As I said this is always assumed and can be left off and re-added when needed. Then we have to shift X exponent times to the left. In this case we are shifting 1 time.
Code:
1.10010000000000000000000

11.00100000000000000000000

Pop off the whole number and convert. 11 (0011) is 3.

Now you must reverse your mantissa record. For every 1 in the record add +1 to your total. Every bit you must divide your total by 2. Do this from right to left and start from 0.0.
Code:
(0.0 + 0)/2 = 0
0010000000000000000000
(0.0 + 0)/2 = 0
001000000000000000000
(0.0 + 0)/2 = 0
00100000000000000000
(0.0 + 0)/2 = 0
0010000000000000000
......
(0.0 + 1)/2 = 0.5
00
(0.0 + 0)/2 = 0.25
0
(0.0 + 0)/2 = 0.125

Last step is to combine your whole number ad your decimal back into place. The finished result is 3.125

This is why you avoid floats when needed. They are slow needing roughly 64 loops every time you use to convert them to and from this record form.

Let me clear a couple things up.

First, computers ALL operate in binary. NOT hex, octal, or anything else. When you enter a hex, decimal, or octal number, the computer runs a program to convert that number to the binary it can understand. the advantage of hex or octal is that they have direct conversion for example, using the following table

HEX  OCT  BIN
0    0   0000
1    1   0001
2    2   0010
3    3   0011
4    4   0100
5    5   0101
6    6   0110
7    7   0111
8        1000
9        1001
A        1010
B        1011
C        1100
D        1101
E        1110
F        1111

The thing to note here is that each hex number is directly replaceable by
its corresponding 4-digit binary number. To do the same with octal,
just leave off the beginning number of the chart so it's a 3-digit binary
number.

For example (note that the parenthesis and subscripts indicate the base of
the number otherwise you couldn't tell if the octal number was octal or hex)

(A4F8)16 --> 1010 0100 1111 1000  //notice the 4 sets of 4 for for hex
(17573)8 --> 001 111 101 111 011  //notice 5 sets of 3 for octal

Converting from decimal is much harder. In fact, a number such as 0.1 in decimal is an infinitely repeating number in binary (just like 1/3 is 0.333333333... in decimal). The algorithm here is basically repeated additions and account for power shifts (the standard base conversion formula). It's a lot more computer intensive It's worth noting that the numbers are entered into the computer from a keyboard and are understood by most computers to be strings. luckily, the guys that made ascii were smart enough to make the last 4 binary digits of the number symbols correspond to the REAL binary numbers, so no conversion work is necessary for the individual digits (though conversion of the complete number is still necessary).

A note about the exponent. For exponents you need to offer both positive and negative exponents. One way would be to make the first bit of the exponent part the sign bit and make the exponent just a small signed integer and use standard integer methods (for what it's worth I think this way's better because of the symmetry). Another way is to realize that if you split the total number of possibilities (256) in half, you get 128 and 127 (with all the all zeroes possibility being reserved). Subtracting 127 from 256 will give you 128 (the highest possible exponent) while subtracting 127 from zero will give you -127 (your lowest answer). Computationally, a single subtraction is VERY simple (addition and subtraction -- which are the same to computer btw -- are about the easiest computation they can do). I believe this was the reason for choosing this method (despite the fact that the normal signed int method would work just as well and be just as quick). The reason for doing it this way may be due to the scaling of exponents (to be able to use the same subtraction hardware on all the different sizes of signed numbers rather than needing differing adders for each -- but I don't know and the spec doesn't say).

Speaking of the spec, it's short and EVERY programmer should be required to read it.

http://kfe.fjfi.cvut.cz/~klimo/nm/ieee754.pdf (1985 edition -- there's a new 2008 edition which adds a couple of things)

A final note on performance. Floating point numbers are NOT slow and are NOT handled as decimals (note: decimal numbers are converted to binary when a program is compiled or they are entered by keyboard and labeled as non-strings. By the time they execute, they are already binary floats, so this doesn't slow down performance in reality as the speed a person can input the numbers is FAR slower than how fast a computer can convert them to binary). They are handled as a set of 2 integer numbers (because computers can't actually use numbers with decimal places). You can simulate IEEE754 floating point numbers with regular integer units and that's slow (but still fast enough where micro-controllers use it all the time), but dedicated units are FAST.

What they do is split the pipeline. For example, if you tell it to multiply 2 floats, it takes the first bit (the sign bit) and XOR's it (if both are 1 or both are 0, it gives a 0 which makes the number positive (pos x pos or neg x neg gives a pos answer) and if they are opposite signs (one is 1 and the other is 0) then they give a negative answer like they should). It then takes the next set of bits (the amount of numbers varies with how many bits long the whole number is) and adds the exponents together (it can add then subtract the necessary amount or subtract 127 from both at the same time, add them together and then add 127 to the result for storage). Finally, the mantissa part is put through a standard multiplication unit. All the numbers are checked for the error conditions specified in IEEE754 and then they are put back together and continue their journey.

Thanks for this it actually answered a few of my unanswered questions that I had to begin with and corrected some of my faults.
Quote:
A final note on performance. Floating point numbers are NOT slow and are NOT handled as decimals.

When I stated they were slow, I meant compared VS a normal integer. Converting to and from the float form requires quite a few more instructions than simply moving an integer around and this was my basis of that statement. Hard coding floats makes them get precomputed by the assembler, but I am not aware of how they are handled during run time.
Quote:
they are entered by keyboard and labeled as non-strings. By the time they execute, they are already binary floats, so this doesn't slow down performance in reality as the speed a person can input the numbers is FAR slower than how fast a computer can convert them to binary)

Is this part about the keyboard true? The whole reason for this research is to learn how they are created because I couldn't figure out a way to read from STDIN and convert strings to floats in x86 assembler. The closest I got to a answer from my research is to take the string read in from STDIN and pass it to atof() which essentially would convert the string to its base 10 equivalent and then produce the float or to code your own function that did this yourself. If you could clarify your statement here I would greatly appreciate it since this was my basis of the entire research on floats.
Quote:
They are handled as a set of 2 integer numbers (because computers can't actually use numbers with decimal places).

I thought this would be it. My general thought for converting string to float from STDIN was to read in the string, split it based on the period, push both halfs to the stack and call my function which would then calculate the float based on the two halfs.

If you could clarify/add more information I would appreciate it. It's hard to find information that is sensible especially for x86 assembly for linux.
Quote:
Originally Posted by abduct

Thanks for this it actually answered a few of my unanswered questions that I had to begin with and corrected some of my faults.
When I stated they were slow, I meant compared VS a normal integer. Converting to and from the float form requires quite a few more instructions than simply moving an integer around and this was my basis of that statement. Hard coding floats makes them get precomputed by the assembler, but I am not aware of how they are handled during run time.
Is this part about the keyboard true? The whole reason for this research is to learn how they are created because I couldn't figure out a way to read from STDIN and convert strings to floats in x86 assembler. The closest I got to a answer from my research is to take the string read in from STDIN and pass it to atof() which essentially would convert the string to its base 10 equivalent and then produce the float or to code your own function that did this yourself. If you could clarify your statement here I would greatly appreciate it since this was my basis of the entire research on floats.
I thought this would be it. My general thought for converting string to float from STDIN was to read in the string, split it based on the period, push both halfs to the stack and call my function which would then calculate the float based on the two halfs.

If you could clarify/add more information I would appreciate it. It's hard to find information that is sensible especially for x86 assembly for linux.

10001100111000011110101001110110

Is that number a float or an int? You can't tell the difference. Instead, you must predetermine which it is (part of the basis of static typing) and then either use a floating point operation or an integer operation. The computer doesn't convert numbers from integer to float each time it needs to calculate a float. Most of the time it generates the float or integer directly and the type is never changed.

While keyboards have a bit more going on than I'm actually going to describe, they can (for our purposes) be viewed as basically just sending an 8-bit number to the CPU. ASCII characters for decimal numbers (0 - 9) are the binary numbers from 0011 0000 counting up to 0011 1001. To make these into their 8-bit equivalent numbers, All you've got to do is zero out the top nibble (4 bits). This can be done several ways, but perhaps the most intuitive is to XOR with 0011 0000. Once this is done, you have the binary form of the digit. Now let's look at a basic, non-optimized algorithm for conversion to integer.

The math for numbers is as follows:

if you have the decimal 34891

it is equal to

3 * 10000 + 4 * 1000 + 8 * 100 + 9 * 10 + 1 * 1

To get a binary number, all you need to do is (in pythonesque pseudocode):

final_number = 0

for digit in the_number:
final_number = final_number * 10
final_number = final_number + digit

#this loop repeats until you run out of digits to add
#at which point final_number will be your binary equivalent

Converting from char to floating point

there are two possibilities. The first is that you get something like this

-4.345e-12

And the second is where you get poorly formed numbers like

to_float(-12345) or -41.22e-12

In either case, you need to convert the numbers to binary like we did above, but you can do it in separate steps for the first type as the parts are already separated. In the second part, you need to manually count the number of digits to the right of the decimal place and add them to the exponent (if it exists like in the third example). Once this is done, you can then convert without issue.

I don't recall if you covered this (it was late when I read your post), but in scientific notation, a number MUST have one and ONLY one number before the decimal place and that number must be NON-zero. For example:

2.345 * 10^20

is well formed while

0.2345 * 10^4

11.034 * 10^88

Are both NOT well-formed (they should move the decimal place one time to the right in the top one and one place to the left in the bottom one and adjust the exponent accordingly)

Now think about this:

In binary, ALL well-formed numbers using scientific notation MUST begin with the number 1 (zero's the only other number in town and it isn't valid).

Since the IEEE754 guys knew this, they realized that they didn't need to add that digit to the actual stored number and this would give them one more digit of precision. For example, if you have:

1 10000011 10000000000000000000000

it could be re-written in a more human-readable form as

-(1.10000000000000000000000)2 * (10)10^(10000011)2

Notice the added 1 at the beginning

Typecasting floats to normal integers ranges from difficult to impossible (and conversion to big-ints is impractical do to floats inaccuracy). If the exponent is low, then you can drop the fractional part, but if the exponent is high (eg. 100) then you run out of room (some number followed by 100 zeroes is big in decimal, but positively huge in binary). If it is possible, you must apply the exponent to the mantissa (don't forget the missing 1) and then remove the fractional parts.

When converting the other way, most non-optimized algorithms come down to converting back to decimal and then converting to float. I'm not an expert in this (for the hardware design I've done, I left this to more interested individuals. The design methods are standardized and just reusing those is more pragmatic and less bug prone -- an important point in hardware design). Not to worry though. I took the time to look up a vhdl implementation of floating point that includes their methods of conversion (if you don't know, vhdl is to hardware design what C is to software design).

http://www.eda-stds.org/fphdl/float_pkg_c.vhdl

Edited by hajile - 8/3/13 at 5:25am
bytes aren't always 8bits. Bytes are hystorically defined by character encodings and there's computers out there which are 6 and 7bit byte systems (that sentence reads a bit odd, but I hope you get what I mean).

I think it's fairly safe to assume that the size of a byte is unlikely to change in the future, however since you're interested in the theory here, I just wanted to highlight that the size of a byte isn't fixed to 8bits.
I see hajile. The way you first described it was that you could magically convert input from STDIN to floats perhaps via special interrupt, although I was skeptical and knew that this couldn't possibly be what you meant. I know the process it would take to convert the ascii into floats as well as premade functions that do this such as atof(), but the way it was originally typed it sounded like there was another "magical" way.

As for the scientific notation part i did briefly mention it in my original post. When you calculate the exponent you have to shift the binary to the right until you are left with a 1.XX notation, which then of course is cut off from the mantissa because it is assumed it is there and because if it were to be left on the entire binary would be 33 bits long. That's why most papers describe it as the hidden bit because it is not actually included.
New Posts  All Forums:Forum Nav:
Return Home
Back to Forum: Coding and Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › floating point numbers!