Worked example - Week 8: a JMH benchmark, line by line¶
Companion to Java Mastery → Month 02 → Week 8: JMH and Microbenchmarking. JMH (Java Microbenchmark Harness) is the only sane way to benchmark JVM code; everything else lies because of JIT, dead-code elimination, escape analysis, and GC variance. This page walks one tiny benchmark from setup to interpreting output.
The question we're measuring¶
"Is String.format("%d", n) slower than Integer.toString(n), and by how much?" A plausible-sounding question that's nearly impossible to answer with System.nanoTime() wrappers.
The naive approach (don't do this)¶
long t0 = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
String s = String.format("%d", i);
}
long t1 = System.nanoTime();
System.out.println((t1 - t0) / 10_000_000.0 + " ns/op");
Four ways this lies:
- JIT warmup. The first ~10,000 iterations run interpreted. Average is dominated by interpreter time.
- Dead code elimination.
sis never used. C2 may prove the entire loop body has no side effects and delete it. You'd measure 0. - Constant folding. If the loop variable is provably bounded, C2 may unroll and precompute.
- No statistical anything. One run, one number, no error bars.
JMH fixes all four.
The same question, with JMH¶
// FormatBench.java
import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(value = 2)
public class FormatBench {
@Param({"42", "1234567"})
public int n;
@Benchmark
public String stringFormat() {
return String.format("%d", n);
}
@Benchmark
public String integerToString() {
return Integer.toString(n);
}
}
Walk the annotations:
@BenchmarkMode(Mode.AverageTime)- measure mean time per operation. Alternatives:Throughput,SampleTime,SingleShotTime. AverageTime is what you usually want for "how slow is this call."@OutputTimeUnit(NANOSECONDS)- report numbers in ns. Without this, microseconds is the default, which buries small differences.@State(Scope.Benchmark)- the benchmark instance is reused across iterations within a fork; one instance per JVM.Scope.Threadgives one per benchmarking thread (use it when state shouldn't be shared).@Warmup(iterations = 3, time = 1)- run 3 warmup iterations of 1 second each before measuring. Lets the JIT compile, profile, recompile, and reach steady state.@Measurement(iterations = 5, time = 1)- 5 measured iterations of 1 second each. Used to compute mean + standard deviation.@Fork(value = 2)- run the entire benchmark in 2 separate JVMs. This is critical: a single JVM's JIT decisions are path-dependent. Two forks reveal variance you'd otherwise miss.@Param({"42", "1234567"})- JMH will run the benchmark twice, once withn=42and once withn=1234567. Lets you see whether input size matters.@Benchmark return String- returning the result is how you prevent dead-code elimination. JMH has a special infrastructure (Blackhole) that consumes return values so the compiler can't prove they're dead.
That's the whole correct benchmark. Run it:
The output, narrated¶
After a couple of minutes:
Benchmark (n) Mode Cnt Score Error Units
FormatBench.integerToString 42 avgt 10 7.823 ± 0.214 ns/op
FormatBench.integerToString 1234567 avgt 10 12.404 ± 0.301 ns/op
FormatBench.stringFormat 42 avgt 10 421.302 ± 18.547 ns/op
FormatBench.stringFormat 1234567 avgt 10 428.119 ± 15.221 ns/op
Read it row by row:
Scoreis the mean time per operation.Erroris the 99.9% confidence half-width - the value with±is the margin where you'd expect the true mean to fall.Cntis the total measured-iteration count (5 iterations × 2 forks = 10).
What it tells you:
- Integer.toString(42) takes ~8 ns. Integer.toString(1234567) takes ~12 ns. The difference is the extra digit-conversion work.
- String.format("%d", n) takes ~420 ns regardless of n. Input size doesn't matter because the formatter's overhead dominates: parser, allocator, internal StringBuilder, locale lookup.
- String.format is ~50× slower than Integer.toString. The error bars don't overlap (8±0.2 vs 421±19), so the difference is real, not noise.
That's a real answer. Production code: don't use String.format for trivial integer conversions in hot paths. Use it for the cases where its formatting power earns its overhead.
What can still trick you¶
Even JMH can lie if you don't think about:
- Coordinated omission. If your code is event-loop-blocked (your JMH thread waits on something else), JMH measures wall-clock time per op but masks the queueing effect. Real production latency is higher.
- Scope.Benchmark sharing. Two threads on
Scope.Benchmarkstate will produce contention not visible to single-threaded benchmarks. - Power-saving / thermal throttling. Use a dedicated benchmark machine; disable turbo boost or accept the noise; don't benchmark on a laptop on battery.
- JVM flags. Default flags are not production flags. Pass the same
-XX:+UseG1GC -Xmx4getc. you'd use in prod.
The trap¶
Reading the output as gospel. A 50× difference between two operations measured for 1 nanosecond each may be a real algorithmic gap or may be JIT measurement noise. Always check: - Confidence intervals don't overlap. - Multiple forks agree. - The result is stable across machines.
If any of those don't hold, you have a benchmark problem, not a code problem.
Exercise¶
- Run the benchmark above. Confirm the order-of-magnitude difference.
- Add a third benchmark:
String.valueOf(n). Predict its result first. Then run. Were you right? - Add a fourth using
StringBuilder().append(n).toString(). Predict, then measure. - For each, look at
-prof gc(java -jar benchmarks.jar -prof gc FormatBench). Compare allocation rates. TheString.formatallocator overhead should be visible. - (Hard) Add
-prof perfasm(Linux only). Inspect the assembly emitted for the hot path ofintegerToString. How many instructions per call?
Related reading¶
- The main Week 8 chapter covers JMH's design and standard pitfalls.
- The Performance methodology cross-topic page places JMH in the broader context of measuring without lying.
- Glossary: Microbenchmark, Dead code elimination, JIT in the main glossary.