Skip to content

13 - Profiling basics

What this session is

About an hour. Chapter 12 said "measure, don't guess" - this is how you measure. You'll learn to find where a program actually spends its time and memory using JFR (the profiler built into the JDK), how to write a benchmark that doesn't lie (JMH), and how to read a thread dump to diagnose a hang or deadlock. By the end, "profile it" is something you can actually do instead of a thing experts say.

The golden rule, restated

You cannot find a performance problem by reading code. Human intuition about hot spots is wrong far more often than right - the bottleneck is routinely in a place nobody suspected (a logging call, a regex recompiled per request, an accidental O(n²)). The only reliable method is to measure the running program and let the data point at the problem. Everything in this chapter is a way to get that data.

Quick-and-dirty timing (and why it's not enough)

The crudest measurement, useful for huge differences (chapter 12's boxing demo):

long start = System.nanoTime();
doWork();
long elapsedMs = (System.nanoTime() - start) / 1_000_000;
System.out.println("took " + elapsedMs + " ms");

This is fine for spotting order-of-magnitude differences. But for anything subtle it lies, because of how the JVM runs:

  • JIT warmup. Java code starts interpreted and gets compiled to optimized machine code only after running enough times. Your first measurement is of slow, un-compiled code - not representative of steady state.
  • Dead-code elimination. If the JIT proves doWork()'s result is unused, it may delete the call entirely - you measure nothing.
  • GC pauses land randomly in your timing window, adding noise.

So nanoTime timing is a smoke detector, not a diagnostic. For real measurement you need a profiler (to find where time goes) and JMH (to benchmark specific code correctly).

JFR: the profiler in your JDK

Java Flight Recorder (JFR) is a low-overhead profiler built into the JDK - no install, ~1% overhead, safe to run in production. It records what the JVM is doing (which methods run, what's allocated, GC activity, locks) into a file you analyze afterward.

Start a recording when launching your app:

java -XX:StartFlightRecording=duration=60s,filename=app.jfr -jar app.jar

Or attach to a running process with jcmd:

jcmd <pid> JFR.start duration=60s filename=app.jfr
jcmd <pid> JFR.dump filename=app.jfr        # dump what's recorded so far

Then open app.jfr in JDK Mission Control (JMC) - a free GUI (or VisualVM, or IntelliJ's profiler which uses JFR underneath). What you look at:

  • The hot-methods / flame graph view - which methods consumed the most CPU. This is the "where does time go" answer. The widest bars are your hot spots.
  • The allocation view - which methods allocated the most memory (the chapter 08/12 boxing and garbage culprits). "Who's creating all this garbage?"
  • The GC view - how often GC ran, how long pauses were, whether you're under memory pressure.
  • The lock/contention view - threads waiting on locks (chapter 10 contention).

Reading a flame graph

A flame graph is the key skill. Each box is a method; its width is how much total time was spent in it (and everything it called). Boxes stack to show the call hierarchy - a method sits on top of its caller.

How to read it: scan for the widest boxes. A wide box near the top (a "plateau") is a method burning CPU directly - your hot spot. Click it, see who calls it, decide if it can be made faster or called less. You're not reading every box; you're finding the few wide ones that dominate. Often a single surprising method is 60% of the width - that's your 3% from chapter 12, found.

Heap profiling: finding leaks and allocation

For memory problems (chapter 08), two moves:

Allocation profiling (who creates garbage) - JFR's allocation view, or async-profiler in alloc mode. Shows which methods allocate the most. A method allocating millions of Integers (boxing) or temporary strings lights up here.

Heap dumps (what's retained) - for leaks, capture a snapshot of every live object:

jcmd <pid> GC.heap_dump heap.hprof
# or automatically on OOM:
java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=. -jar app.jar

Open heap.hprof in Eclipse MAT (Memory Analyzer Tool) or VisualVM. The questions it answers (chapter 08's leak diagnosis):

  • "What's using the most memory?" - the dominator tree shows the biggest retainers. "2 GB of byte[]."
  • "What's keeping it alive?" - the path to GC root. "These byte[]s are retained by Cache.entries, a HashMap, held by a static field." That sentence is the leak, solved. (Eclipse MAT's "Leak Suspects" report often points right at it.)

The path-to-GC-root is the single most valuable thing a heap dump gives you. A leak is a still-reachable object (chapter 08); the path shows you exactly which reference chain to break.

JMH: benchmarking that doesn't lie

When you need to compare two implementations precisely - is this optimization actually faster? - use JMH (Java Microbenchmark Harness), the standard tool. It handles JIT warmup, prevents dead-code elimination, runs multiple forks for statistical validity, and reports results with error bars. Naive nanoTime benchmarking gives wrong answers; JMH gives trustworthy ones.

import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 1)        // run 3 warmup rounds so the JIT compiles the code
@Measurement(iterations = 5, time = 1)   // then 5 measured rounds
@Fork(2)                                  // in 2 separate JVMs (catches JVM-specific flukes)
public class StringBench {

    @Param({"100", "10000"})
    int n;

    @Benchmark
    public String concat() {              // the slow way
        String s = "";
        for (int i = 0; i < n; i++) s += i;
        return s;                         // RETURN it - prevents dead-code elimination
    }

    @Benchmark
    public String builder() {             // the fast way
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < n; i++) sb.append(i);
        return sb.toString();
    }
}

Key annotations (the things naive timing misses):

  • @Warmup - runs the code untimed first, so the JIT compiles it before you measure steady-state performance.
  • @Measurement - the actual timed runs.
  • @Fork - runs in separate JVMs to catch JVM-specific anomalies.
  • Returning the result - JMH consumes returned values so the JIT can't delete "useless" computation.

Run it (mvn package then java -jar target/benchmarks.jar StringBench) and you get a table with scores and error bars:

Benchmark            (n)  Mode  Cnt    Score    Error  Units
StringBench.builder  100  avgt   10    0.412 ±  0.02   us/op
StringBench.concat   100  avgt   10    2.140 ±  0.11   us/op
StringBench.builder 10000 avgt   10   38.5  ±  1.2    us/op
StringBench.concat  10000 avgt   10 9821.0  ± 210     us/op   <- O(n^2) blowup, proven

The error bars matter: if two scores' ranges overlap, the difference isn't real - it's noise. JMH's discipline is what makes a benchmark trustworthy instead of a number you fooled yourself with.

Thread dumps: diagnosing hangs and deadlocks

When a program hangs (not slow - frozen), you need a thread dump - a snapshot of what every thread is doing right now. This is the tool for chapter 10's deadlocks and for "why is my app stuck."

jstack <pid>                      # print all thread stacks
jcmd <pid> Thread.print           # same thing, modern command
# or press Ctrl-\ (SIGQUIT) in the terminal running the JVM

The dump lists every thread and its current stack. What you look for:

  • Deadlock - the JVM detects and explicitly reports it: Found one Java-level deadlock: followed by the threads and the locks they're each holding/waiting for. The exact cycle from chapter 10, named for you. This is the fastest way to confirm and locate a deadlock.
  • A thread stuck in BLOCKED state - waiting on a lock another thread holds (contention or deadlock).
  • Many threads in the same method - a hot spot or a bottleneck where everything queues up.
  • Threads in WAITING/TIMED_WAITING - parked (often normal for pool threads idle between tasks; suspicious if a thread you expect to be working is parked).

Take two or three dumps a few seconds apart: if a thread is on the same line in all of them, it's stuck there (a hang); if threads move between dumps, work is progressing (maybe just slow).

The profiling workflow

Putting it together - the loop you run when something is slow or broken:

  1. Reproduce it under realistic load (a slow path needs traffic to show up; profiling an idle app shows nothing).
  2. Pick the tool for the symptom:
  3. Slow (high CPU) -> JFR CPU profile, read the flame graph for hot methods.
  4. Growing memory / OOM -> heap dump, find the dominator and path-to-GC-root.
  5. High GC / churn -> JFR allocation profile, find who allocates.
  6. Hung / frozen -> thread dump, look for deadlock or blocked threads.
  7. "Is change X faster?" -> JMH benchmark of X vs the original.
  8. Find the one dominant cause - usually one method or one reference chain accounts for most of the problem.
  9. Fix that (using chapters 05, 08, 10, 12 - the right data structure, breaking the leak chain, reducing contention, killing the allocation).
  10. Re-measure to confirm it actually helped and the bottleneck moved (it always moves to the next thing - stop when you're at "fast enough").

Notice this is chapter 12's mindset made concrete. The hard part of performance work isn't fixing - it's finding, and these tools are how you find.

Try it

  1. Crude timing first. Take chapter 12's boxing example (Long vs long sum). Time both with nanoTime. The difference is huge enough that even crude timing shows it. This is the case where quick timing is legitimately enough.

  2. Record with JFR. Run any non-trivial program (or one with a deliberate hot loop) with -XX:StartFlightRecording=duration=30s,filename=run.jfr. Open run.jfr in JDK Mission Control. Find the hot-methods view. Identify the widest method. Did you guess right beforehand? (Usually not - that's the lesson.)

  3. Read a flame graph. In JMC (or IntelliJ's profiler), open the flame graph for a CPU-heavy run. Find the widest plateau. Click it, trace its callers. Write one sentence: "X% of CPU is in method M, called from C." That sentence is a profiling result.

  4. Capture a heap leak. Run chapter 08's unbounded-cache leak with -Xmx128m -XX:+HeapDumpOnOutOfMemoryError. When it OOMs, open the .hprof in Eclipse MAT. Run "Leak Suspects." Confirm it points at the HashMap/byte[] and names the static field retaining it. You just diagnosed a leak the way professionals do.

  5. Write a JMH benchmark. Set up the StringBench above (add the JMH Maven dependency). Run it. Confirm concat is dramatically slower at n=10000 and that the error bars don't overlap. Then benchmark ArrayList vs HashSet contains (chapter 12). Real numbers, properly measured.

  6. Catch a deadlock in a dump. Run chapter 10's deadlocking transfer. While it's hung, run jstack <pid>. Find the "Found one Java-level deadlock" section. Read which thread holds which lock and waits for which. Then apply the ordered-lock fix and confirm the dump no longer reports a deadlock.

What you might wonder

"JFR vs async-profiler vs VisualVM vs IntelliJ profiler - which?" They overlap. JFR is built into the JDK, low-overhead, production-safe - the default. JDK Mission Control is the GUI for JFR files. async-profiler is a popular open-source sampling profiler with excellent flame graphs (and it samples native code JFR can miss). VisualVM is a free all-rounder (profiling + heap dumps). IntelliJ's profiler wraps JFR/async-profiler in the IDE - the most convenient for development. Start with whatever's in front of you; they answer the same questions.

"Is it safe to profile in production?" JFR yes - it's designed for it (~1% overhead, always-on is a legitimate strategy). Heap dumps briefly pause the app (they stop the world to snapshot) and produce large files, so do them deliberately, not casually. Thread dumps are cheap and safe. Heavy instrumenting profilers (older ones) can have high overhead - prefer sampling profilers (JFR, async-profiler) in production.

"Do I need JMH for everyday performance checks?" Only when comparing implementations precisely (is A faster than B?) and the difference might be subtle. For "is this whole feature fast enough," profile the running app instead. JMH is for microbenchmarks - small, isolated pieces of code where naive timing would lie. Don't JMH a whole application; profile it.

"The flame graph shows my hot method is in a library I can't change. Now what?" Then the fix is to call it less, not make it faster. If 60% of time is in regex.compile, the fix is "compile the pattern once and reuse it" (it was being recompiled per call), not optimizing the regex engine. Profiling tells you where time goes; the fix is often "do this expensive thing fewer times," which is in your code (caching, pre-computing, batching).

"How do I profile a problem I can't reproduce locally?" This is why production-safe profiling matters. Enable always-on JFR in production; when the problem happens, you have the recording. For OOMs, -XX:+HeapDumpOnOutOfMemoryError captures the dump automatically when it crashes. The discipline is "have the recorder running before the problem happens," because intermittent production issues won't reproduce on demand.

"What's a 'sampling' vs 'instrumenting' profiler?" A sampling profiler periodically checks what each thread is doing (cheap, low-overhead, statistically accurate for hot spots - JFR, async-profiler). An instrumenting profiler injects timing code into every method (precise per-method counts but high overhead, can distort results and slow the app a lot). For finding hot spots, sampling is almost always the right choice.

Done

  • You know why crude nanoTime timing lies (JIT warmup, dead-code elimination, GC noise) and where it's still useful.
  • You can capture and read a JFR recording: hot-methods/flame graph for CPU, allocation view for garbage, GC view for pressure.
  • You can read a flame graph - scan for the widest boxes, trace callers.
  • You can capture a heap dump and find a leak via the dominator tree and path-to-GC-root.
  • You can write a trustworthy JMH benchmark (warmup, forks, return values, error bars).
  • You can take a thread dump and spot deadlocks and blocked threads.
  • You have the profiling workflow: reproduce, pick the tool for the symptom, find the one cause, fix, re-measure.

Next: testing at the next level - mocking, parameterized tests, and testing the concurrent code you now write.

Next: Testing at the next level →

Comments