Saltar a contenido

Appendix A - Production Hardening

The tools and recipes that turn a working JVM service into one you can debug at 3 AM. This appendix is referenced from every month's "Production Hardening Slice." By the end of Month 6, the hardening/ directory in your repo should contain runnable versions of everything below.


A.1 The Diagnostic Toolbox

Tool Where it lives When to reach for it
jcmd JDK bin/ Always-on Swiss-army knife: heap dump, JFR start/stop, thread dump, GC commands, flag inspection.
jstack JDK bin/ One-shot thread dump. Usually jcmd <pid> Thread.print is cleaner.
jmap JDK bin/ Heap dumps. Largely subsumed by jcmd <pid> GC.heap_dump.
jhsdb JDK bin/ Serviceability agent - post-mortem core dump analysis. Last resort but invaluable.
jfr (CLI) JDK bin/ Process JFR recordings without JMC. jfr print --events ... file.jfr.
JDK Mission Control (JMC) Separate download JFR GUI. Read recordings produced by jcmd JFR.dump.
async-profiler github.com/async-profiler CPU, alloc, lock, wall-clock sampling. Flame graphs. The single most useful add-on tool for the JVM.
Eclipse MAT eclipse.org/mat Heap-dump analysis. Dominator tree + leak suspects.
JMH openjdk.org/projects/code-tools/jmh Microbenchmarks. The only correct way.
jcstress openjdk.org/projects/code-tools/jcstress JMM stress tests for concurrent code.
GCViewer / gceasy.io Third-party GC log visualization.
VisualVM visualvm.github.io Older, lighter GUI for live monitoring. Useful for dev.
hsdis OpenJDK extension Disassembly plugin for -XX:+PrintAssembly.

Install them all. Know which to reach for in 30 seconds.


A.2 The Always-On JVM Flags

A defensible default set for a containerized service:

# Memory
-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=75.0
-XX:MaxMetaspaceSize=256m
-XX:MaxDirectMemorySize=256m
-XX:ReservedCodeCacheSize=256m

# GC (pick one)
-XX:+UseG1GC
# or for latency-sensitive multi-GB heaps:
# -XX:+UseZGC

# Diagnostics
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/heap.hprof
-XX:+ExitOnOutOfMemoryError
-Xlog:gc*=info,safepoint=info:file=/var/log/gc.log:time,uptime,level,tags:filecount=10,filesize=10M

# JFR - continuous recording
-XX:StartFlightRecording=filename=/var/log/jfr/app.jfr,maxsize=200M,maxage=24h,settings=profile,dumponexit=true

# Compact object headers (24+, when stable in your JDK)
# -XX:+UseCompactObjectHeaders

# Modern JIT
# -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler   (Graal as C2)

Every flag has a justification comment in your hardening/ template. No copy-paste-and-forget.


A.3 The Continuous-JFR Pattern

JFR has negligible overhead (~1% by default) and is the gold-standard production profiler. Two ways to run it:

  1. Boot-time: the -XX:StartFlightRecording=... flag above.
  2. On-demand: jcmd <pid> JFR.start name=adhoc duration=120s filename=/tmp/adhoc.jfr settings=profile.

Pattern: keep a rotating buffer always on; when an alert fires, snapshot the buffer (jcmd <pid> JFR.dump name=adhoc filename=/tmp/snapshot.jfr) and ship it to S3 (or equivalent). Now you can diagnose a transient incident from the recording, not from speculation.

Events worth alerting on: - jdk.GCPauseTime > 200ms (G1) or > 10ms (ZGC). - jdk.OldObjectSample growth (slow leak). - jdk.VirtualThreadPinned > 20ms (Loom pinning). - jdk.CPULoad sustained > 0.9. - jdk.SocketRead duration p99 above your SLO.


A.4 Heap-Dump Triage in Five Steps

When you have a hprof file:

  1. Open in MAT. Choose "Leak Suspects" report. Read it first - it's right surprisingly often.
  2. Open the Dominator Tree, sorted by retained heap. The top 5 entries explain 80%+ of memory.
  3. For each suspicious dominator, right-click → "Path to GC Roots → exclude weak/soft references". This tells you why it's alive.
  4. If the retainer is a framework collection (HashMap, ConcurrentHashMap), open it and inspect entries - usually the keys reveal the leak (e.g., per-request keys that never expire).
  5. Fix, redeploy, re-dump after the same load to confirm.

A.5 The CPU-Flame-Graph Pattern (async-profiler)

# Attach for 30s, sample CPU, emit interactive flamegraph
asprof -e cpu -d 30 -f cpu.html <pid>

# Allocation profile (TLAB + outside-TLAB)
asprof -e alloc -d 30 -f alloc.html <pid>

# Wall-clock (great for "where is my service waiting")
asprof -e wall -d 30 -f wall.html <pid>

# Lock contention
asprof -e lock -d 30 -f lock.html <pid>

Wall-clock flame graphs are underrated - they show you blocked time, which CPU profiles miss entirely. Run them whenever a service is "slow but the CPU is fine."


A.6 JMH Conventions

A defensible JMH suite:

  • Separate Maven/Gradle module so the JMH annotation processor doesn't pollute your main artifact.
  • One class per benchmarked scenario, with @State(Scope.Benchmark) or Scope.Thread chosen deliberately.
  • @Fork(value = 3, jvmArgsAppend = {"-XX:+UseG1GC"}) minimum - three forks gives noise estimates.
  • @Warmup(iterations = 5, time = 1), @Measurement(iterations = 10, time = 1) as baseline.
  • Always emit Mode.Throughput and Mode.AverageTime for the same benchmark - they reveal different things.
  • Profile with -prof gc to see allocation rate; -prof async:output=flamegraph for flame graphs.
  • CI gates on regression vs baseline, never absolute numbers (CI is too noisy for absolutes).

A.7 The Pre-Production Checklist

Before any new service goes live, walk this list:

  • MaxRAMPercentage set; container request/limit aligned.
  • GC chosen with a documented reason.
  • Heap dump on OOM enabled, write path writable.
  • JFR continuous recording enabled, rotation configured.
  • GC logs to disk with rotation.
  • All ExecutorServices are bounded, named, gracefully shutdown.
  • All ThreadLocals justified or replaced with ScopedValue.
  • OpenTelemetry traces flowing to the platform's collector.
  • Micrometer Prometheus endpoint exposed.
  • Logs JSON-structured, with trace-id correlation.
  • Liveness and readiness probes distinct, with sane timeouts.
  • SIGTERM triggers graceful shutdown; tested.
  • Dependencies pinned; SBOM generated (CycloneDX).
  • Security scan passes (Trivy/Grype on the container).
  • Runbook exists with top-5 alerts and mitigation steps.
  • One synthetic load test recorded as a baseline.

The list is the artifact. Put it in hardening/CHECKLIST.md and tick boxes by hand for every release until you automate it.

Comments