Appendix A - Production Hardening¶

The tools and recipes that turn a working JVM service into one you can debug at 3 AM. This appendix is referenced from every month's "Production Hardening Slice." By the end of Month 6, the hardening/ directory in your repo should contain runnable versions of everything below.

A.1 The Diagnostic Toolbox¶

Tool	Where it lives	When to reach for it
`jcmd`	JDK `bin/`	Always-on Swiss-army knife: heap dump, JFR start/stop, thread dump, GC commands, flag inspection.
`jstack`	JDK `bin/`	One-shot thread dump. Usually `jcmd <pid> Thread.print` is cleaner.
`jmap`	JDK `bin/`	Heap dumps. Largely subsumed by `jcmd <pid> GC.heap_dump`.
`jhsdb`	JDK `bin/`	Serviceability agent - post-mortem core dump analysis. Last resort but invaluable.
`jfr` (CLI)	JDK `bin/`	Process JFR recordings without JMC. `jfr print --events ... file.jfr`.
JDK Mission Control (JMC)	Separate download	JFR GUI. Read recordings produced by `jcmd JFR.dump`.
async-profiler	github.com/async-profiler	CPU, alloc, lock, wall-clock sampling. Flame graphs. The single most useful add-on tool for the JVM.
Eclipse MAT	eclipse.org/mat	Heap-dump analysis. Dominator tree + leak suspects.
JMH	openjdk.org/projects/code-tools/jmh	Microbenchmarks. The only correct way.
`jcstress`	openjdk.org/projects/code-tools/jcstress	JMM stress tests for concurrent code.
GCViewer / gceasy.io	Third-party	GC log visualization.
VisualVM	visualvm.github.io	Older, lighter GUI for live monitoring. Useful for dev.
`hsdis`	OpenJDK extension	Disassembly plugin for `-XX:+PrintAssembly`.

Install them all. Know which to reach for in 30 seconds.

A.2 The Always-On JVM Flags¶

A defensible default set for a containerized service:

# Memory
-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=75.0
-XX:MaxMetaspaceSize=256m
-XX:MaxDirectMemorySize=256m
-XX:ReservedCodeCacheSize=256m

# GC (pick one)
-XX:+UseG1GC
# or for latency-sensitive multi-GB heaps:
# -XX:+UseZGC

# Diagnostics
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/heap.hprof
-XX:+ExitOnOutOfMemoryError
-Xlog:gc*=info,safepoint=info:file=/var/log/gc.log:time,uptime,level,tags:filecount=10,filesize=10M

# JFR - continuous recording
-XX:StartFlightRecording=filename=/var/log/jfr/app.jfr,maxsize=200M,maxage=24h,settings=profile,dumponexit=true

# Compact object headers (24+, when stable in your JDK)
# -XX:+UseCompactObjectHeaders

# Modern JIT
# -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler   (Graal as C2)

Every flag has a justification comment in your hardening/ template. No copy-paste-and-forget.

A.3 The Continuous-JFR Pattern¶

JFR has negligible overhead (~1% by default) and is the gold-standard production profiler. Two ways to run it:

Boot-time: the -XX:StartFlightRecording=... flag above.
On-demand: jcmd <pid> JFR.start name=adhoc duration=120s filename=/tmp/adhoc.jfr settings=profile.

Pattern: keep a rotating buffer always on; when an alert fires, snapshot the buffer (jcmd <pid> JFR.dump name=adhoc filename=/tmp/snapshot.jfr) and ship it to S3 (or equivalent). Now you can diagnose a transient incident from the recording, not from speculation.

Events worth alerting on: - jdk.GCPauseTime > 200ms (G1) or > 10ms (ZGC). - jdk.OldObjectSample growth (slow leak). - jdk.VirtualThreadPinned > 20ms (Loom pinning). - jdk.CPULoad sustained > 0.9. - jdk.SocketRead duration p99 above your SLO.

A.4 Heap-Dump Triage in Five Steps¶

When you have a hprof file:

Open in MAT. Choose "Leak Suspects" report. Read it first - it's right surprisingly often.
Open the Dominator Tree, sorted by retained heap. The top 5 entries explain 80%+ of memory.
For each suspicious dominator, right-click → "Path to GC Roots → exclude weak/soft references". This tells you why it's alive.
If the retainer is a framework collection (HashMap, ConcurrentHashMap), open it and inspect entries - usually the keys reveal the leak (e.g., per-request keys that never expire).
Fix, redeploy, re-dump after the same load to confirm.

A.5 The CPU-Flame-Graph Pattern (async-profiler)¶

# Attach for 30s, sample CPU, emit interactive flamegraph
asprof -e cpu -d 30 -f cpu.html <pid>

# Allocation profile (TLAB + outside-TLAB)
asprof -e alloc -d 30 -f alloc.html <pid>

# Wall-clock (great for "where is my service waiting")
asprof -e wall -d 30 -f wall.html <pid>

# Lock contention
asprof -e lock -d 30 -f lock.html <pid>

Wall-clock flame graphs are underrated - they show you blocked time, which CPU profiles miss entirely. Run them whenever a service is "slow but the CPU is fine."

A.6 JMH Conventions¶

A defensible JMH suite:

Separate Maven/Gradle module so the JMH annotation processor doesn't pollute your main artifact.
One class per benchmarked scenario, with @State(Scope.Benchmark) or Scope.Thread chosen deliberately.
@Fork(value = 3, jvmArgsAppend = {"-XX:+UseG1GC"}) minimum - three forks gives noise estimates.
@Warmup(iterations = 5, time = 1), @Measurement(iterations = 10, time = 1) as baseline.
Always emit Mode.Throughput and Mode.AverageTime for the same benchmark - they reveal different things.
Profile with -prof gc to see allocation rate; -prof async:output=flamegraph for flame graphs.
CI gates on regression vs baseline, never absolute numbers (CI is too noisy for absolutes).

A.7 The Pre-Production Checklist¶

Before any new service goes live, walk this list:

The list is the artifact. Put it in hardening/CHECKLIST.md and tick boxes by hand for every release until you automate it.