Because I am bored and have spent the last two days researching, doing the math behind it, and documenting (because frankly everything out there is to complicated), I will document in this thread what floating point numbers really are and how a computer handles them.
Before I start I am going to cover some quick lingo.
-Bits: a single bit is a 1 or a 0 also referred to as binary.
-Bytes: A single byte is 8 Bits in length. 0000 0000 is a single byte in binary whilst 0x00 would be a single byte in hex.
-Half byte: Half of a byte. In binary it is 4 Bits.
Floating point numbers are numbers that carry incomplete numbers into the decimals with a certain precision. Floats in most languages are referred to as single precision which have 23 bits of precision or roughly 6 places of decimals. In most languages there is another term called Doubles which carry more precision which would be calculated the same way, but for sake of time I will only describe single precision floating point numbers.
When programming in a higher level language we are most certainly blinded as to what happens in the background when our applications runs; How it accesses memory, how it writes to and form the stack, and what I will be describing is how the processor handles floating point numbers.
Naturally CPUs enjoy reading in hex and binary (or any other base that is a multiple of base 2. Most CPUs only use base 2 and base 16 (binary and hex accordingly) but if forced to using other bases such as 8 would not be that hard). Anyways with this in mind CPUs can only deal with whole numbers because there is literally no way to incorporate a decimal number into base 2, but this is where the guys who create floating point numbers came in.
Since binary or hex cannot hold less than a whole number these mathematicians came up with a way to convert a floating point number from base 10 into a base 2 format that can be reversed easily back into its original form. This created form is commonly referred to as exponent and mantissa.
This form for a single precision float on a 32bit machine takes form of 3 binary parts. The first Bit is the sign. This sign tells us if the finished number is a positive (0) or a negative (1). The next segment is called the exponent, which after a long while of research unless you are familiar with base 2 math it doesn't really mean exponent. The exponent is in total 8 bits long. And finally the third part is the mantissa. The mantissa is the guts of the format and is the record holding the original number and is 23 bits long in length.
Following these explanation the following form can be deduced:
0 00000000 00000000000000000000000
For the following examples we will use the floating point number 3.125
To start the conversion to a Floating point number we must first start converting the decimal into our mantissa record. The rule of doing this is you multiply the decimal by 2 until it equals 0.0 or you reach 23 bits of precision. Once you get to 1.XXX you subtract 1 and insert a 1 into the mantissa. If you hit 1.000000000000001 you must add a one to the mantissa subtract 1 keep going until you either hit 23 bits or 0.0, also if you finish early and don't use the full 23 bits you must pad the remaining with 0's.
According to this, our math and record would look like so.
0.125 * 2 = 0.25 mantissa: 0 0.25 * 2 = 0.5 mantissa: 00 0.5 * 2 = 1.0 mantissa: 001 subtract 1 from 1.0 we are left with 0.0 meaning we are finished pad the rest of the mantissa with zeros. mantissa: 00100000000000000000000
Next we must convert our whole number into binary. 3 converts to 0011 in binary so we must affix this back to our record.
Now it is time to calculate our exponent for our exponent and mantissa combo. This is done by shifting the entire record to the right until we reach a 01.XXX binary number. The number of bits shifted is our exponent. In this example we only have to shift 1 bit so our exponent is 1.
Now we can knock the 001 off because this is not part of our mantissa record and when we reverse this process it is automatically re-added and assumed it was there to begin with.
Next we must calculate our exponent offset. The exponent originally starts at 127 (01111111) rather than 0 (00000000), why I couldn't tell you I don't have the slightest clue. There is most likely a good reason but I think of it as a way the math guys who made this up to make our lives difficult. Anyways from this we must add our exponent (the number of bits shifted) to 127 and convert to our exponent that is stored.
127 (01111111) + 1 = 128 (10000000)
Now with this calculated we can form a full record.
0 10000000 10010000000000000000000
to convert this to hex group in groups of 4 starting from left to right and simply convert
0100 0000 0100 1000 0000 0000 0000 0000 0x40480000
This is how floating point numbers are handled by the machine and why they are so slow. This is the only way the machine knows how to cope with numbers with decimals.
Now how to reverse this process. You simply reverse the operations that occurred.
Take your records hex and convert them into exponent mantissa format and calculate your exponent.
0 10000000 10010000000000000000000 128 (10000000) - 127 (01111111) = 1
Now you must shift your record back over to regain your whole number. re-add your 1. that we took off at the begging. As I said this is always assumed and can be left off and re-added when needed. Then we have to shift X exponent times to the left. In this case we are shifting 1 time.
Pop off the whole number and convert. 11 (0011) is 3.
Now you must reverse your mantissa record. For every 1 in the record add +1 to your total. Every bit you must divide your total by 2. Do this from right to left and start from 0.0.
(0.0 + 0)/2 = 0 0010000000000000000000 (0.0 + 0)/2 = 0 001000000000000000000 (0.0 + 0)/2 = 0 00100000000000000000 (0.0 + 0)/2 = 0 0010000000000000000 ...... (0.0 + 1)/2 = 0.5 00 (0.0 + 0)/2 = 0.25 0 (0.0 + 0)/2 = 0.125
Last step is to combine your whole number ad your decimal back into place. The finished result is 3.125
This is why you avoid floats when needed. They are slow needing roughly 64 loops every time you use to convert them to and from this record form.