Saltar a contenido

Floating-point: why 0.1 + 0.2 != 0.3

>>> 0.1 + 0.2
0.30000000000000004

Every programmer hits this once. The usual reaction is "JavaScript is broken" or "Python is broken." Neither is. Almost every language follows IEEE 754, the international standard for floating-point arithmetic, and IEEE 754 cannot represent 0.1 exactly in binary - the same way decimal cannot represent 1/3 exactly. The result of 0.1 + 0.2 is the exact answer to a slightly-different question, and that is what the screen prints.

This page is the standard, the consequences, and the patterns engineers actually use to write correct code in the face of it.

1. The representation

A 32-bit "float" (float32, C's float, Java's float) is laid out as:

  bit:  31    30..23      22..0
        S    E (8 bits)   M (23 bits)

A 64-bit "double" (float64, C's double, Java's double) is laid out as:

  bit:  63    62..52      51..0
        S    E (11 bits)  M (52 bits)
  • S (sign): 1 bit. 0 means positive, 1 means negative.
  • E (exponent): an unsigned integer with a bias (127 for float32, 1023 for float64). Stored value E represents an actual exponent of E - bias.
  • M (mantissa, or significand): the fraction part of a normalized binary number 1.xxxxx. The leading 1. is implicit (and not stored), so a 23-bit mantissa gives 24 bits of precision.

The value of a normal (non-zero, non-infinite, non-NaN) float is:

  (-1)^S × 1.M × 2^(E - bias)

For 1.5 as a float32:

  • Sign: 0 (positive).
  • 1.5 in binary is 1.1 (one one, then a binary point, then one one - "one and a half").
  • Normalized: 1.1 × 2^0. So actual exponent is 0, stored exponent is 0 + 127 = 127 = 01111111.
  • Mantissa: the bits after the implicit leading 1, which is 1 followed by 22 zeros.

Bit pattern: 0 01111111 10000000000000000000000 = 0x3FC00000.

import "math"

bits := math.Float32bits(1.5)
fmt.Printf("%032b\n", bits)   // 00111111110000000000000000000000
fmt.Printf("%#x\n",   bits)   // 0x3fc00000

The bit layout is exactly what the standard says. Everything that follows is consequences of the layout.

2. Why 0.1 cannot be represented exactly

0.1 in decimal is 1/10. The denominator is 10, which factors as 2 × 5. Binary representations can exactly express only fractions whose denominator is a power of 2. Because 10 has a factor of 5, 0.1 in binary is a repeating fraction:

  0.1 in binary = 0.0001100110011001100110011...

The pattern 0011 repeats forever. The IEEE 754 representation rounds it to the nearest representable value with 52 bits of mantissa. So when you write 0.1 in Go or Python or JavaScript, the value stored is not one-tenth - it is 0.1000000000000000055511151231257827021181583404541015625, which is the closest float64 to one-tenth.

Same for 0.2. Same for 0.3. Each is approximated by the nearest representable value, and the approximation errors do not cancel. When you compute 0.1 + 0.2 you add the two approximations and get a number whose closest decimal representation is 0.30000000000000004. The result is correct for the question the computer is answering (the sum of those two specific representable numbers). It is just not the question you thought you were asking.

The fractions a binary float can represent exactly: 0.5, 0.25, 0.125, 0.0625, 0.75, 1.5, 2.0, anything you can write as a / 2^b with a and b small. Add 0.5 + 0.25 and you get exactly 0.75 - no surprise. The surprise is unique to fractions with non-power-of-2 denominators.

3. Precision: how many digits actually mean anything?

float32 has 24 bits of mantissa (23 stored + 1 implicit), which is roughly log10(2^24) ≈ 7.2 decimal digits of precision. float64 has 53 bits of mantissa, roughly log10(2^53) ≈ 15.95 decimal digits.

So when you print a float32, only the first ~7 digits are meaningful; everything after is the binary representation leaking through. When you print a float64, only the first ~16 are meaningful.

fmt.Printf("%.20f\n", float32(0.1))   // 0.10000000149011611938
fmt.Printf("%.20f\n", float64(0.1))   // 0.10000000000000000555

The "9011611938" and "5511151231" are the bits past the precision limit being interpreted as decimals. They mean nothing about the user's input; they are an artifact of how the binary approximation translates back to decimal.

The standard %g format in Go and %g in C choose just enough digits to round-trip the value (so reading and writing preserves it exactly), which is usually 9 for float32 and 17 for float64. For human-readable output you probably want %.6f or %.2f and to accept that you are showing a rounded representation.

4. The five operations IEEE 754 defines exactly

The IEEE 754 standard requires that the basic operations (+, -, *, /, sqrt) return the correctly-rounded result: as if the operation were computed with infinite precision and then rounded to the nearest representable value. This is a strong guarantee - it means cross-platform reproducibility for these five operations, given the same inputs and the same rounding mode (the default is "round to nearest, ties to even").

Transcendental functions (sin, log, exp, pow) are not required to be correctly rounded by IEEE 754. Different implementations (libm on Linux vs the math library on Windows vs Java's StrictMath) can return slightly different results for the same input. This is "table maker's dilemma" territory. If you need bit-exact reproducibility for transcendentals, use a library that guarantees it (Java's StrictMath, the CRlibm project) or stick to operations the standard does cover.

5. The special values

IEEE 754 reserves three categories of bit patterns for non-finite values:

5.1 Zero (positive and negative)

The exponent is all zeros and the mantissa is all zeros. Sign bit can be either - so +0.0 and -0.0 are distinct bit patterns that compare equal.

a, b := 0.0, -0.0
fmt.Println(a == b)                       // true
fmt.Println(math.Float64bits(a))          // 0
fmt.Println(math.Float64bits(b))          // 9223372036854775808 (0x8000000000000000)

-0.0 matters in some contexts: 1 / +0.0 == +Inf, 1 / -0.0 == -Inf. You can detect it with math.Signbit(x).

5.2 Infinity (positive and negative)

The exponent is all ones and the mantissa is all zeros. Sign bit gives positive or negative infinity.

fmt.Println(1.0 / 0.0)        // +Inf
fmt.Println(-1.0 / 0.0)       // -Inf
fmt.Println(math.Inf(1))      // +Inf
fmt.Println(math.IsInf(x, 1))

Infinities propagate through arithmetic in reasonable ways: Inf + 1 == Inf, Inf * 2 == Inf, 1 / Inf == 0. The cases that produce NaN are explicitly invalid operations (see next).

5.3 NaN (Not a Number)

The exponent is all ones and the mantissa is anything except all zeros. This means there are billions of NaN bit patterns; they all behave the same to arithmetic. Operations that produce NaN: 0/0, Inf - Inf, Inf * 0, sqrt(-1), any operation with NaN as an operand.

NaN has one bizarre and load-bearing property: NaN is not equal to anything, including itself.

x := math.NaN()
fmt.Println(x == x)           // false!  (only value in the language with this property)
fmt.Println(math.IsNaN(x))    // true    (the correct way to test)

This is intentional. NaN means "I do not know what value this should be," and "I do not know" cannot equal "I do not know." The consequence: if x != x is the portable C/C++/Java idiom for "is x a NaN." In Go and Python, use the named helper (math.IsNaN(x), math.isnan(x)).

A second consequence: NaN poisons your data. One NaN anywhere in a sum, average, or product makes the entire result NaN. Production data pipelines that touch floats need to either filter NaN out at the edge or use NaN-aware operations (np.nansum, np.nanmean in NumPy; explicit checks in Go).

5.4 Subnormals (denormals)

When the exponent reaches its minimum, IEEE 754 has a second representation mode: subnormal numbers, with the implicit-leading-1 dropped and the exponent fixed. This extends the range smoothly down to zero, avoiding a sudden "underflow gap." Subnormals are why 1e-300 * 1e-50 gives a tiny but nonzero result instead of jumping to zero.

The cost: subnormal arithmetic is often dramatically slower (sometimes by 100×) than normal arithmetic, because CPUs handle them via microcode instead of native hardware. Audio synthesis code in particular tunes itself to avoid subnormals - a denormal in a delay-line filter can spike CPU usage so badly the audio glitches. The standard workaround is "flush to zero" (FTZ) mode, which trades accuracy near zero for predictable performance.

6. The patterns engineers use to write correct float code

6.1 Compare with a tolerance, never ==

// Wrong - depends on bit-level equality
if a == b { ... }

// Right - within epsilon
const eps = 1e-9
if math.Abs(a - b) < eps { ... }

The right eps depends on the scale of a and b. For numbers near 1.0, 1e-9 is fine. For numbers near 1e10, 1e-9 is meaningless (the gap between adjacent representable floats up there is already ~1e-6). The relative-error form scales:

// Relative tolerance
if math.Abs(a - b) <= eps * math.Max(math.Abs(a), math.Abs(b)) { ... }

Go's math.Float64bits and an "ULP" (units in the last place) comparison is the strict version used in floating-point library testing. For application code, the relative-error form is usually enough.

6.2 Sum many small numbers carefully

Naive sum += x over a million values accumulates rounding error. For high precision, use Kahan summation:

func KahanSum(xs []float64) float64 {
    sum := 0.0
    c   := 0.0  // running compensation
    for _, x := range xs {
        y := x - c
        t := sum + y
        c = (t - sum) - y
        sum = t
    }
    return sum
}

Kahan summation keeps the lost low-order bits in c and adds them back. The cost is four extra additions per element; the gain is hundreds of times better precision on large sums. NumPy, pandas, and SciPy all use this (or its sister, pairwise summation) internally.

6.3 Use integers for money

Money is decimal. Floats are binary. They do not mix. The standard fix:

  • Store money as an integer count of the smallest unit (cents, not dollars; satoshis, not bitcoins).
  • Do all arithmetic in integers.
  • Convert to a display string at the edge.

Languages with a decimal type (Java's BigDecimal, Python's Decimal, C#'s decimal) are the alternative when you need actual decimal arithmetic with rounding rules a regulator will accept. Never use float or double for monetary amounts - the rounding errors compound and a customer somewhere will be charged $0.01 too much.

6.4 Beware of catastrophic cancellation

When you subtract two nearly-equal floats, the leading digits cancel and what is left is dominated by the noise in the low bits:

a := 1.0000001
b := 1.0000000
fmt.Println(a - b)  // 9.999999997324457e-08, not 1e-07

Most of the precision is gone. The fix is to rearrange the math to avoid the subtraction. For the quadratic formula:

x = (-b ± sqrt(b² - 4ac)) / 2a

When b is large and positive, the -b + sqrt(...) branch suffers catastrophic cancellation. The numerically stable form computes the other root first and uses Vieta's relations to get the small one: x_small = c / (a * x_large). This is the kind of trick scientific computing libraries (LAPACK, Eigen, BLAS) are built on - hundreds of operations rearranged to preserve precision.

7. Advanced: things that surprise even experienced programmers

7.1 Associativity is gone

For real numbers, (a + b) + c == a + (b + c). For floats, not always:

a, b, c := 1e20, -1e20, 1.0
fmt.Println((a + b) + c)   // 1.0
fmt.Println(a + (b + c))   // 0.0

In the second case, b + c is -1e20 + 1.0 which rounds back to -1e20 (the 1 is below the precision threshold at that scale). Then a + (-1e20) is exactly 0. The order in which a compiler decides to associate additions can change the result. This is why -ffast-math (which allows the compiler to reassociate) can dramatically change numerical results.

7.2 The intermediate-precision trap

On x86 in the past, the FPU computed in 80-bit extended precision and the compiler chose when to round to 64-bit double. If the rounding happened in different places between two builds, the bit-exact result of a calculation could differ. SSE2 (which uses 64-bit registers for double) made this less of a problem; the -mfpmath=sse flag is the modern default for exactly this reason. Java's strictfp modifier is the language-level promise to round consistently across platforms.

7.3 Signed zero matters for branch cuts

atan2(+0, -0) == +π, but atan2(-0, -0) == -π. The sign of zero is the only thing that distinguishes "approaching from above" vs "approaching from below" for transcendental functions with branch cuts. If you ever wonder why complex-analysis code is fussy about signs of zero, this is why.

7.4 Reproducibility across CPUs

A float64 computation done on x86 with the standard libm and the same computation on ARM with the platform libm can produce slightly different results for transcendental functions (sin, exp, log, ...) because the libm implementations are different. The IEEE 754 basic operations are reproducible; the higher-level math library is not. This is a real problem for cross-platform games (replay determinism), distributed training (gradient consistency), and any spec that requires bit-identical results across servers.

The fixes: pin a specific libm (musl's is commonly used for reproducibility), use a fixed-point representation for the deterministic parts, or accept the discrepancy and budget for "this many ULPs of difference are okay."

7.5 GPU floats are sometimes lying

Many GPU shading languages allow "fast math" modes by default, which sacrifice IEEE 754 conformance for speed. x * (1 / y) might be substituted for x / y; multiply-add might fuse into a single instruction with a different rounding behavior. NVIDIA's CUDA has --fmad=false to disable; GLSL has precise qualifiers. If a graphics or ML application gives slightly different results between CPU and GPU runs of the same calculation, the GPU's relaxed math is the first suspect.

8. The mental model to keep

  • A float is a sign, an exponent, and a normalized mantissa - (-1)^S × 1.M × 2^(E - bias).
  • Only fractions with power-of-2 denominators are representable exactly. 0.1 is not one of them.
  • float32 ≈ 7 decimal digits of precision. float64 ≈ 16. Anything you print past that is artifact.
  • +, -, *, /, sqrt are correctly rounded per IEEE 754. Transcendentals are not.
  • NaN != NaN. Test with IsNaN, never with ==.
  • Compare floats with a tolerance, never ==. The right tolerance depends on the scale.
  • Money is integers (or Decimal), never floats.
  • Catastrophic cancellation, lost associativity, and subnormal slowdowns are real production failure modes.

Floating-point is not broken; it is precisely specified, and the specification has consequences. Understanding the layout makes the "weird" behaviors predictable.

9. Further reading

  • David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991) - the canonical reference. Long but worth it.
  • IEEE 754-2008 standard - the spec itself. Dry but precise.
  • William Kahan's papers - he designed much of IEEE 754 and writes prolifically about its details.
  • Numerical Recipes (any edition) - the practical handbook of "rearrange the math to preserve precision."
  • Bit operations - useful for understanding the bit-layout discussion.
  • Two's complement - the integer counterpart to this page.