Essential facts about floating point calculations

Floating point numbers are everywhere. It’s hard to find software that doesn’t use any. For something so essential to writing software you’d think we take great care in working with them. But generally we don’t. A lot of code treats floating point as real numbers; a lot of code produces invalid results. In this article I show several counterintuitive properties of floating point numbers. These are things you have to know in order to do calculations correctly.

x + y == x

This first rule is one of magnitude. Addition, and subtraction, require values significant enough to each other to produce meaningful results. The significance here is measured by the difference in the exponent.

For example, the value 1e-10 has a very small magnitude compared to 1e10. With a typical 64-bit float we can add and subtract that small number all we want and still be left with 1e10. The floating point doesn’t have a high enough precision to notice the difference of the small magnitude number.

Any algorithm dealing with large and small values must be aware of this limitation.

n * x != x + … + x

This is a basic rule about precision that follows from the previous one. Repeatedly adding a number n times will not yield the same result as multiplying by n. The result is different even if x remains significant compared to the summation. There is always rounding involved; two floating point numbers rarely line up well enough that adding them yields the exact real number result.

Iterative calculations should be avoided whenever possible in floating point. Finding closed form versions of algorithms usually yields higher precision. In cases where the iteration can’t be avoided, it’s vital to understand you won’t get the expected result. A certain amount of rounding error will always accumulate.

(x ⊕ y) ⊕ z != x ⊕ (y ⊕ z)

For many binary operators, like +, - or *, real numbers have an associative property. The expression can be grouped in different ways or done in a different order and have the same result.

This doesn’t hold precisely in floating point. It follows from the previous rules about precision. Changing the order of operations can, and will, alter the results. Even if we’ve taken care that all our values are of similar magnitude the result will still be different.

This is one of the key reasons why we use high precision floating point. We don’t typically need the whole precision in our result, so small accumulated errors are still okay. Consider an example from my current graphics project. If I’m trying to position an object on the screen I am rounding to exact pixels. So the values of 13, 13.1, 13.00000123 are all treated the same, ending up on pixel 13. Accumulated error isn’t a problem unless it accumulates enough to modify the significant part of our result.

x != 0

Any formula involving division has to worry about dividing by zero. There is a lot of code with a condition such as:

if (x != 0)
    y = b / x
    y = 0

In general this code is broken. If x is the result of a calculation it’s unlikely to actually be zero. Due to rounding errors and precision it may be a number very close to zero instead. The division by zero is avoided, but most formulas I’ve seen equally fail when dividing by a really small number. The problem is one of magnitude again. As the divisor gets smaller the result grows in magnitude. A number with a ridiculously high magnitude may then proceed to produce garbage results in the remainder of the formula.

This rule applies to strict equality checks with any number, not just 0. A range of significance is always required in this operations, such as if ( abs(x) < 1e-6 ).

There’s one exception here. Constant values which fit into the range of the floating point variable will be stored exactly. This makes it possible to assign a true 0 or 1 to value, and even check that later. A check of x == 0 can succeed if some code explicitly assigned a 0 value.

x / b = inf

If the previous rule is not heeded we often end up producing an infinity value. If the result of division is too high, out of range of the floating point, it simply becomes a special “infinity” value.

Once a variable is infinity is tends to stay that way. If x = inf then x + y = inf, x / y = inf, x * y = inf, etc. To make it even more troublesome, we also get x / x = nan and x - x = nan. nan is a special “not a number” marker and has rules of its own. It’s more final than infinity since any operation with nan results in nan.

x != y

If we can’t compare to a specific number, like 0 then we will certainly have problems comparing two floating point numbers. The precision of calculations makes it highly unlikely that we’ll get the same number out of an algorithm unless we use the exact same inputs. If we are comparing the results of two algorithms it’ll be even less likely.

The first goal should be to avoid comparing numbers, but sometimes it can be necessary. The typical approach is to subtract the numbers and compare against an epsilon, checking that they are close enough, such as
if( abs(x - y) < 1e-6 ).

It follows that if we can’t compare equality directly then operators like <= and >= also pose a problem. Again an epsilon can be used, yielding expressions like if( (x - 1e-6) > y ) instead of if (x >= y).

x != 0.1

Floating point numbers are binary, not decimal. Several exact base-10 numbers, like 0.1 cannot be encoded exactly in floating point. This holds regardless of the precision. Even a 128-bit float cannot encode 0.1 exactly.

Consider the fraction 1/3, the exact base-10 decimal form is an infinitely repeating fraction: 0.\overline{3}. The same thing happens with binary: 0.1 decimal is a repeating binary value 0.0\overline{0011}. It’s obvious that no number of fixed bits can encode such infinite sequences.

It’s important to consider this with respect to the rules about precision. Far too often we think about precision in terms of decimal: 0.1 appears to require only 1 decimal digit, but this thinking is clearly wrong. Very quickly the rules about precision creep in and mess with simple decimal operations. Just consider that with 64-bit floats 0.3 + 0.6 = 0.89999999999999991. It really took no effort at all to get a precision problem!

x = expr; x != expr

This is a special gotcha arising from extended floating point precision. Several processors, including the x86 and x86-64 families, have an 80-bit floating point processor. This is great for precision, but it has a caveat. Typically the double type, or default float type, is only 64-bits.

The assignment causes a loss of precision, reducing from 80 bits to 64 bits. The comparison however can evaluate the expr again, and leave the result in 80 bits on the CPU. Then it loads the 64 bit value, promotes to 80 bits, and does the comparison. The values don’t equal since those extra bits just have zero now.

This is a hard problem to workaround. In general we design algorithms so that exact value is never important (recall all comparisons should involve ranges). But this issue does creep into innocent looking code. Some languages / compilers guarantee that assigning to a variable truncates the precision.

float( str( x ) ) != x

I’m not sure why, but the default string formatting of floating point numbers in most languages, even domain-specific math languages, does not have high enough precision to store the complete number. The string format is truncated such that if parsed again it may not equal the original number.

It’s helpful to look at why the truncation happens. A binary number can be formatted exactly in decimal, but as the absolute magnitude increases the number of digits increases. (Just work through formatting the series 2^n or 1/2^n to see for yourself). It’s also possible that the precision of the formatter can not get the exact number. Instead they simply truncate the output.

It is possible to get the same number back though, even when using decimal. If the truncation length is simply extended long enough the parsing will yield the same number back. The actual decimal value may not be strictly equivalent to the floating point value, but the round-trip can work.

Floating point is not real

Floating point numbers are not real numbers. They have a fixed precision and subject to continuous rounding error. While high precision floats can help minimize the problem it never goes away completely. Simple calculations, especially those with expected decimal results, can quickly produce the wrong value. Even if we get the calculation right we still have to worry about conversion to/from decimal strings.

Not all processors have high precision floating point available. Consider that common graphics GPUs, especially on mobile, only have 32-bit floats in the vertex shader, and some even resort to 16-bit floats for pixel shader computations! I had this problem in my game, where my lighting model resulted in an all black scene on some phones.

Certainly some applications have more of a problem than others with precision, but there is no application that doesn’t have to worry at all. You can’t avoid working with it. As a programmer it is essential to know how floating point works and understand its limitations.

14 replies »

  1. Very good writeup, I still see a lot of programmers who aren’t aware of these problems. It should be noted, however, that floating point is not necessarily binary. IEEE-754 2008 contains definitions for decimal floating point numbers and operations on them, and some languages support alternative decimal floating point implementations (such as .NET’s decimal). The problems caused by floating point accuracy and un-representable decimal numbers are generally orthogonal.

  2. Nice article! There’s a little hiccup in section “x != 0.1” where you write: “0.3 + 0.6 = 8.9999999999999991”. The precision problem isn’t THAT bad…

    • My intent wasn’t to show it’s a bad problem, just that it already deviates from a perfect decimal result. That is, just to draw attention to the lack of decimal encoding.

    • Sorry, I didn’t make myself clear enough. There’s an error of factor ten in that equation. 0.3 + 0.6 isn’t supposed to be 9.

  3. I believe that in the “x != y” section, last line, the correct expression should be: if ( (x + 1e-6) > y )

    • This depends very much on what precisely the algorithm is trying to do. I have code that does both + and – comparison with an epsilon in the same line even!

      The fully subtleties of how to deal with the problems I present here are beyond the scope of this article. I just wished to give a general overview of them.

    • I’m afraid I would disagree. As shown, it fails when x and y are the same number. It must see this case as true.

  4. “Most floating point numbers would require an infinite repeating sequence in decimal to represent precisely.”

    I believe this is incorrect. 1/3 can’t be represented as a decimal (base 10) number, because 3 is not a factor of 10. Some base 10 numbers cannot be represented as binary (base 2) numbers, because one of the factors of 10 (5) is not a factor of 2. But every binary number can be precisely represented as a decimal number with finite digits, because 2 is a factor of 10. See

    If anyone’s interested in further reading, Bruce Dawson ( has several in-depth articles on comparing floats, floating point precision, determinism, and more.

  5. I’d note that at least some C++ compilers (MSVC 2015 comes to mind) are starting to enforce the distinction between abs() and fabs(). Using the former for floating point values gets you a warning, so it’s a good habit to break.

    • I wasn’t refering to a particular `abs` function, just the one that you should be using for your language/compiler. Though I don’t see a problem with using `std::abs` in C++ for floating point. It accepts floating point values and should be the correct value.

  6. We used floating point numbers instead of real numbers in The NICE programming language. Why?

    • Perhaps because it’s not possible to actually implement real numbers. Requiring infinite precision is something a computer can’t do. The closest one could come is a symbolic library combined with rationals and retain expressions whenever an exact answer cannot be obtained. Of course that would be extremely slow, as every operation you added would increase the size of the expression.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s