The Life of a Programmer


Essential facts about floating point calculations

Floating point numbers are everywhere. It’s hard to find software that doesn’t use any. For something so essential to writing software you’d think we take great care in working with them. But generally we don’t. A lot of code treats floating point as real numbers; a lot of code produces invalid results. In this article I show several counterintuitive properties of floating point numbers. These are things you have to know in order to do calculations correctly.

x + y == x

This first rule is one of magnitude. Addition, and subtraction, require values significant enough to each other to produce meaningful results. The significance here is measured by the difference in the exponent.

For example, the value 1e-10 has a very small magnitude compared to 1e10. With a typical 64-bit float we can add and subtract that small number all we want and still be left with 1e10. The floating point doesn’t have a high enough precision to notice the difference of the small magnitude number.

Any algorithm dealing with large and small values must be aware of this limitation.

n * x != x + … + x

This is a basic rule about precision that follows from the previous one. Repeatedly adding a number n times will not yield the same result as multiplying by n. The result is different even if x remains significant compared to the summation. There is always rounding involved; two floating point numbers rarely line up well enough that adding them yields the exact real number result.

Iterative calculations should be avoided whenever possible in floating point. Finding closed form versions of algorithms usually yields higher precision. In cases where the iteration can’t be avoided, it’s vital to understand you won’t get the expected result. A certain amount of rounding error will always accumulate.

(x βŠ• y) βŠ• z != x βŠ• (y βŠ• z)

For many binary operators, like +, - or *, real numbers have an associative property. The expression can be grouped in different ways or done in a different order and have the same result.

This doesn’t hold precisely in floating point. It follows from the previous rules about precision. Changing the order of operations can, and will, alter the results. Even if we’ve taken care that all our values are of similar magnitude the result will still be different.

This is one of the key reasons why we use high precision floating point. We don’t typically need the whole precision in our result, so small accumulated errors are still okay. Consider an example from my current graphics project. If I’m trying to position an object on the screen I am rounding to exact pixels. So the values of 13, 13.1, 13.00000123 are all treated the same, ending up on pixel 13. Accumulated error isn’t a problem unless it accumulates enough to modify the significant part of our result.

x != 0

Any formula involving division has to worry about dividing by zero. There is a lot of code with a condition such as:

if (x != 0)
    y = b / x
    y = 0

In general this code is broken. If x is the result of a calculation it’s unlikely to actually be zero. Due to rounding errors and precision it may be a number very close to zero instead. The division by zero is avoided, but most formulas I’ve seen equally fail when dividing by a really small number. The problem is one of magnitude again. As the divisor gets smaller the result grows in magnitude. A number with a ridiculously high magnitude may then proceed to produce garbage results in the remainder of the formula.

This rule applies to strict equality checks with any number, not just 0. A range of significance is always required in this operations, such as if ( abs(x) < 1e-6 ).

There’s one exception here. Constant values which fit into the range of the floating point variable will be stored exactly. This makes it possible to assign a true 0 or 1 to value, and even check that later. A check of x == 0 can succeed if some code explicitly assigned a 0 value.

x / b = inf

If the previous rule is not heeded we often end up producing an infinity value. If the result of division is too high, out of range of the floating point, it simply becomes a special “infinity” value.

Once a variable is infinity is tends to stay that way. If x = inf then x + y = inf, x / y = inf, x * y = inf, etc. To make it even more troublesome, we also get x / x = nan and x - x = nan. nan is a special “not a number” marker and has rules of its own. It’s more final than infinity since any operation with nan results in nan.

x != y

If we can’t compare to a specific number, like 0 then we will certainly have problems comparing two floating point numbers. The precision of calculations makes it highly unlikely that we’ll get the same number out of an algorithm unless we use the exact same inputs. If we are comparing the results of two algorithms it’ll be even less likely.

The first goal should be to avoid comparing numbers, but sometimes it can be necessary. The typical approach is to subtract the numbers and compare against an epsilon, checking that they are close enough, such as
if( abs(x - y) < 1e-6 ).

It follows that if we can’t compare equality directly then operators like <= and >= also pose a problem. Again an epsilon can be used, yielding expressions like if( (x - 1e-6) > y ) instead of if (x >= y).

x != 0.1

Floating point numbers are binary, not decimal. Several exact base-10 numbers, like 0.1 cannot be encoded exactly in floating point. This holds regardless of the precision. Even a 128-bit float cannot encode 0.1 exactly.

Consider the fraction 1/3, the exact base-10 decimal form is an infinitely repeating fraction: \(0.\overline{3}\). The same thing happens with binary: 0.1 decimal is a repeating binary value \(0.0\overline{0011}\). It’s obvious that no number of fixed bits can encode such infinite sequences.

It’s important to consider this with respect to the rules about precision. Far too often we think about precision in terms of decimal: 0.1 appears to require only 1 decimal digit, but this thinking is clearly wrong. Very quickly the rules about precision creep in and mess with simple decimal operations. Just consider that with 64-bit floats 0.3 + 0.6 = 0.89999999999999991. It really took no effort at all to get a precision problem!

x = expr; x != expr

This is a special gotcha arising from extended floating point precision. Several processors, including the x86 and x86-64 families, have an 80-bit floating point processor. This is great for precision, but it has a caveat. Typically the double type, or default float type, is only 64-bits.

The assignment causes a loss of precision, reducing from 80 bits to 64 bits. The comparison however can evaluate the expr again, and leave the result in 80 bits on the CPU. Then it loads the 64 bit value, promotes to 80 bits, and does the comparison. The values don’t equal since those extra bits just have zero now.

This is a hard problem to workaround. In general we design algorithms so that exact value is never important (recall all comparisons should involve ranges). But this issue does creep into innocent looking code. Some languages / compilers guarantee that assigning to a variable truncates the precision.

float( str( x ) ) != x

I’m not sure why, but the default string formatting of floating point numbers in most languages, even domain-specific math languages, does not have high enough precision to store the complete number. The string format is truncated such that if parsed again it may not equal the original number.

It’s helpful to look at why the truncation happens. A binary number can be formatted exactly in decimal, but as the absolute magnitude increases the number of digits increases. (Just work through formatting the series 2^n or 1/2^n to see for yourself). It’s also possible that the precision of the formatter can not get the exact number. Instead they simply truncate the output.

It is possible to get the same number back though, even when using decimal. If the truncation length is simply extended long enough the parsing will yield the same number back. The actual decimal value may not be strictly equivalent to the floating point value, but the round-trip can work.

Floating point is not real

Floating point numbers are not real numbers. They have a fixed precision and subject to continuous rounding error. While high precision floats can help minimize the problem it never goes away completely. Simple calculations, especially those with expected decimal results, can quickly produce the wrong value. Even if we get the calculation right we still have to worry about conversion to/from decimal strings.

Not all processors have high precision floating point available. Consider that common graphics GPUs, especially on mobile, only have 32-bit floats in the vertex shader, and some even resort to 16-bit floats for pixel shader computations! I had this problem in my game, where my lighting model resulted in an all black scene on some phones.

Certainly some applications have more of a problem than others with precision, but there is no application that doesn’t have to worry at all. You can’t avoid working with it. As a programmer it is essential to know how floating point works and understand its limitations.

Please join me on Discord to discuss, or ping me on Mastadon.

Essential facts about floating point calculations

A Harmony of People. Code That Runs the World. And the Individual Behind the Keyboard.

Mailing List

Signup to my mailing list to get notified of each article I publish.

Recent Posts