Worked example - Week 8: a JMH benchmark, line by line¶

Companion to Java Mastery → Month 02 → Week 8: JMH and Microbenchmarking. JMH (Java Microbenchmark Harness) is the only sane way to benchmark JVM code; everything else lies because of JIT, dead-code elimination, escape analysis, and GC variance. This page walks one tiny benchmark from setup to interpreting output.

The question we're measuring¶

"Is String.format("%d", n) slower than Integer.toString(n), and by how much?" A plausible-sounding question that's nearly impossible to answer with System.nanoTime() wrappers.

The naive approach (don't do this)¶

long t0 = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
    String s = String.format("%d", i);
}
long t1 = System.nanoTime();
System.out.println((t1 - t0) / 10_000_000.0 + " ns/op");

Four ways this lies:

JIT warmup. The first ~10,000 iterations run interpreted. Average is dominated by interpreter time.
Dead code elimination. s is never used. C2 may prove the entire loop body has no side effects and delete it. You'd measure 0.
Constant folding. If the loop variable is provably bounded, C2 may unroll and precompute.
No statistical anything. One run, one number, no error bars.

JMH fixes all four.

The same question, with JMH¶

// FormatBench.java
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(value = 2)
public class FormatBench {

    @Param({"42", "1234567"})
    public int n;

    @Benchmark
    public String stringFormat() {
        return String.format("%d", n);
    }

    @Benchmark
    public String integerToString() {
        return Integer.toString(n);
    }
}

Walk the annotations:

@BenchmarkMode(Mode.AverageTime) - measure mean time per operation. Alternatives: Throughput, SampleTime, SingleShotTime. AverageTime is what you usually want for "how slow is this call."
@OutputTimeUnit(NANOSECONDS) - report numbers in ns. Without this, microseconds is the default, which buries small differences.
@State(Scope.Benchmark) - the benchmark instance is reused across iterations within a fork; one instance per JVM. Scope.Thread gives one per benchmarking thread (use it when state shouldn't be shared).
@Warmup(iterations = 3, time = 1) - run 3 warmup iterations of 1 second each before measuring. Lets the JIT compile, profile, recompile, and reach steady state.
@Measurement(iterations = 5, time = 1) - 5 measured iterations of 1 second each. Used to compute mean + standard deviation.
@Fork(value = 2) - run the entire benchmark in 2 separate JVMs. This is critical: a single JVM's JIT decisions are path-dependent. Two forks reveal variance you'd otherwise miss.
@Param({"42", "1234567"}) - JMH will run the benchmark twice, once with n=42 and once with n=1234567. Lets you see whether input size matters.
@Benchmark return String - returning the result is how you prevent dead-code elimination. JMH has a special infrastructure (Blackhole) that consumes return values so the compiler can't prove they're dead.

That's the whole correct benchmark. Run it:

$ mvn package
$ java -jar target/benchmarks.jar FormatBench

The output, narrated¶

After a couple of minutes:

Benchmark                       (n)  Mode  Cnt    Score    Error  Units
FormatBench.integerToString      42  avgt   10    7.823 ±  0.214  ns/op
FormatBench.integerToString 1234567  avgt   10   12.404 ±  0.301  ns/op
FormatBench.stringFormat         42  avgt   10  421.302 ± 18.547  ns/op
FormatBench.stringFormat    1234567  avgt   10  428.119 ± 15.221  ns/op

Read it row by row:

Score is the mean time per operation.
Error is the 99.9% confidence half-width - the value with ± is the margin where you'd expect the true mean to fall.
Cnt is the total measured-iteration count (5 iterations × 2 forks = 10).

What it tells you: - Integer.toString(42) takes ~8 ns. Integer.toString(1234567) takes ~12 ns. The difference is the extra digit-conversion work. - String.format("%d", n) takes ~420 ns regardless of n. Input size doesn't matter because the formatter's overhead dominates: parser, allocator, internal StringBuilder, locale lookup. - String.format is ~50× slower than Integer.toString. The error bars don't overlap (8±0.2 vs 421±19), so the difference is real, not noise.

That's a real answer. Production code: don't use String.format for trivial integer conversions in hot paths. Use it for the cases where its formatting power earns its overhead.

What can still trick you¶

Even JMH can lie if you don't think about:

Coordinated omission. If your code is event-loop-blocked (your JMH thread waits on something else), JMH measures wall-clock time per op but masks the queueing effect. Real production latency is higher.
Scope.Benchmark sharing. Two threads on Scope.Benchmark state will produce contention not visible to single-threaded benchmarks.
Power-saving / thermal throttling. Use a dedicated benchmark machine; disable turbo boost or accept the noise; don't benchmark on a laptop on battery.
JVM flags. Default flags are not production flags. Pass the same -XX:+UseG1GC -Xmx4g etc. you'd use in prod.

The trap¶

Reading the output as gospel. A 50× difference between two operations measured for 1 nanosecond each may be a real algorithmic gap or may be JIT measurement noise. Always check: - Confidence intervals don't overlap. - Multiple forks agree. - The result is stable across machines.

If any of those don't hold, you have a benchmark problem, not a code problem.

Exercise¶

Run the benchmark above. Confirm the order-of-magnitude difference.
Add a third benchmark: String.valueOf(n). Predict its result first. Then run. Were you right?
Add a fourth using StringBuilder().append(n).toString(). Predict, then measure.
For each, look at -prof gc (java -jar benchmarks.jar -prof gc FormatBench). Compare allocation rates. The String.format allocator overhead should be visible.
(Hard) Add -prof perfasm (Linux only). Inspect the assembly emitted for the hot path of integerToString. How many instructions per call?

The main Week 8 chapter covers JMH's design and standard pitfalls.
The Performance methodology cross-topic page places JMH in the broader context of measuring without lying.
Glossary: Microbenchmark, Dead code elimination, JIT in the main glossary.