Appendix A - Production Hardening¶
The tools and recipes that turn a working JVM service into one you can debug at 3 AM. This appendix is referenced from every month's "Production Hardening Slice." By the end of Month 6, the hardening/ directory in your repo should contain runnable versions of everything below.
A.1 The Diagnostic Toolbox¶
| Tool | Where it lives | When to reach for it |
|---|---|---|
jcmd |
JDK bin/ |
Always-on Swiss-army knife: heap dump, JFR start/stop, thread dump, GC commands, flag inspection. |
jstack |
JDK bin/ |
One-shot thread dump. Usually jcmd <pid> Thread.print is cleaner. |
jmap |
JDK bin/ |
Heap dumps. Largely subsumed by jcmd <pid> GC.heap_dump. |
jhsdb |
JDK bin/ |
Serviceability agent - post-mortem core dump analysis. Last resort but invaluable. |
jfr (CLI) |
JDK bin/ |
Process JFR recordings without JMC. jfr print --events ... file.jfr. |
| JDK Mission Control (JMC) | Separate download | JFR GUI. Read recordings produced by jcmd JFR.dump. |
| async-profiler | github.com/async-profiler | CPU, alloc, lock, wall-clock sampling. Flame graphs. The single most useful add-on tool for the JVM. |
| Eclipse MAT | eclipse.org/mat | Heap-dump analysis. Dominator tree + leak suspects. |
| JMH | openjdk.org/projects/code-tools/jmh | Microbenchmarks. The only correct way. |
jcstress |
openjdk.org/projects/code-tools/jcstress | JMM stress tests for concurrent code. |
| GCViewer / gceasy.io | Third-party | GC log visualization. |
| VisualVM | visualvm.github.io | Older, lighter GUI for live monitoring. Useful for dev. |
hsdis |
OpenJDK extension | Disassembly plugin for -XX:+PrintAssembly. |
Install them all. Know which to reach for in 30 seconds.
A.2 The Always-On JVM Flags¶
A defensible default set for a containerized service:
# Memory
-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=75.0
-XX:MaxMetaspaceSize=256m
-XX:MaxDirectMemorySize=256m
-XX:ReservedCodeCacheSize=256m
# GC (pick one)
-XX:+UseG1GC
# or for latency-sensitive multi-GB heaps:
# -XX:+UseZGC
# Diagnostics
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/heap.hprof
-XX:+ExitOnOutOfMemoryError
-Xlog:gc*=info,safepoint=info:file=/var/log/gc.log:time,uptime,level,tags:filecount=10,filesize=10M
# JFR - continuous recording
-XX:StartFlightRecording=filename=/var/log/jfr/app.jfr,maxsize=200M,maxage=24h,settings=profile,dumponexit=true
# Compact object headers (24+, when stable in your JDK)
# -XX:+UseCompactObjectHeaders
# Modern JIT
# -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler (Graal as C2)
Every flag has a justification comment in your hardening/ template. No copy-paste-and-forget.
A.3 The Continuous-JFR Pattern¶
JFR has negligible overhead (~1% by default) and is the gold-standard production profiler. Two ways to run it:
- Boot-time: the
-XX:StartFlightRecording=...flag above. - On-demand:
jcmd <pid> JFR.start name=adhoc duration=120s filename=/tmp/adhoc.jfr settings=profile.
Pattern: keep a rotating buffer always on; when an alert fires, snapshot the buffer (jcmd <pid> JFR.dump name=adhoc filename=/tmp/snapshot.jfr) and ship it to S3 (or equivalent). Now you can diagnose a transient incident from the recording, not from speculation.
Events worth alerting on:
- jdk.GCPauseTime > 200ms (G1) or > 10ms (ZGC).
- jdk.OldObjectSample growth (slow leak).
- jdk.VirtualThreadPinned > 20ms (Loom pinning).
- jdk.CPULoad sustained > 0.9.
- jdk.SocketRead duration p99 above your SLO.
A.4 Heap-Dump Triage in Five Steps¶
When you have a hprof file:
- Open in MAT. Choose "Leak Suspects" report. Read it first - it's right surprisingly often.
- Open the Dominator Tree, sorted by retained heap. The top 5 entries explain 80%+ of memory.
- For each suspicious dominator, right-click → "Path to GC Roots → exclude weak/soft references". This tells you why it's alive.
- If the retainer is a framework collection (
HashMap,ConcurrentHashMap), open it and inspect entries - usually the keys reveal the leak (e.g., per-request keys that never expire). - Fix, redeploy, re-dump after the same load to confirm.
A.5 The CPU-Flame-Graph Pattern (async-profiler)¶
# Attach for 30s, sample CPU, emit interactive flamegraph
asprof -e cpu -d 30 -f cpu.html <pid>
# Allocation profile (TLAB + outside-TLAB)
asprof -e alloc -d 30 -f alloc.html <pid>
# Wall-clock (great for "where is my service waiting")
asprof -e wall -d 30 -f wall.html <pid>
# Lock contention
asprof -e lock -d 30 -f lock.html <pid>
Wall-clock flame graphs are underrated - they show you blocked time, which CPU profiles miss entirely. Run them whenever a service is "slow but the CPU is fine."
A.6 JMH Conventions¶
A defensible JMH suite:
- Separate Maven/Gradle module so the JMH annotation processor doesn't pollute your main artifact.
- One class per benchmarked scenario, with
@State(Scope.Benchmark)orScope.Threadchosen deliberately. @Fork(value = 3, jvmArgsAppend = {"-XX:+UseG1GC"})minimum - three forks gives noise estimates.@Warmup(iterations = 5, time = 1),@Measurement(iterations = 10, time = 1)as baseline.- Always emit
Mode.ThroughputandMode.AverageTimefor the same benchmark - they reveal different things. - Profile with
-prof gcto see allocation rate;-prof async:output=flamegraphfor flame graphs. - CI gates on regression vs baseline, never absolute numbers (CI is too noisy for absolutes).
A.7 The Pre-Production Checklist¶
Before any new service goes live, walk this list:
-
MaxRAMPercentageset; container request/limit aligned. - GC chosen with a documented reason.
- Heap dump on OOM enabled, write path writable.
- JFR continuous recording enabled, rotation configured.
- GC logs to disk with rotation.
- All
ExecutorServices are bounded, named, gracefully shutdown. - All
ThreadLocals justified or replaced withScopedValue. - OpenTelemetry traces flowing to the platform's collector.
- Micrometer Prometheus endpoint exposed.
- Logs JSON-structured, with trace-id correlation.
- Liveness and readiness probes distinct, with sane timeouts.
-
SIGTERMtriggers graceful shutdown; tested. - Dependencies pinned; SBOM generated (CycloneDX).
- Security scan passes (Trivy/Grype on the container).
- Runbook exists with top-5 alerts and mitigation steps.
- One synthetic load test recorded as a baseline.
The list is the artifact. Put it in hardening/CHECKLIST.md and tick boxes by hand for every release until you automate it.