Performance methodology¶

Why it matters¶

There's a deep distinction between performance tools (covered in Observability) and performance methodology - the systematic process of "this is slow, why, and what do I change?"

The tools are well-documented per path. The methodology is what distinguishes engineers who fix performance from those who guess at it. This page is about the methodology.

The three rules of performance work¶

Before any tool:

Measure first, hypothesize second, change third. The cardinal sin: changing code based on "I think this is slow." Three quarters of the time, you're wrong about which line matters.
Locality is performance. Cache misses, page faults, branch mispredictions, GC pauses, network round-trips - the cost of all of these comes from "the thing you need isn't where you are right now." Most optimization is about improving locality.
There's always a bottleneck. Removing a bottleneck reveals the next one. Knowing when to stop matters - at some point the bottleneck is "this is what the work fundamentally takes."

The USE method (Brendan Gregg)¶

For diagnosing system-level performance problems:

Utilization - how busy is each resource?
Saturation - how much work is queued waiting on each resource?
Errors - error events on each resource?

For every resource (CPU, memory, disk, network, GPU), check these three numbers. The pattern of "high utilization + high saturation + no errors" is "scale up"; "low utilization + high saturation" suggests serialization (one CPU pegged while 31 sit idle); "low utilization + low saturation + slow" is "the work isn't bottlenecking on these resources - look elsewhere."

top, vmstat, iostat, mpstat, netstat give you the numbers on Linux. The harder skill is interpreting them.

The RED method (for services)¶

For request-handling services:

Rate - requests per second.
Errors - failing requests per second.
Duration - request latency distribution (p50, p95, p99).

Three dashboards for every service. If any of these is bad, you have a problem. If all three are good but the system feels slow, look at the layers underneath (USE method on the underlying machines).

Most modern observability stacks (Prometheus + Grafana) ship default RED dashboards. See Observability for the per-language instrumentation.

Latency vs throughput¶

The two metrics, often confused:

Latency - time for a single request. Measured in milliseconds or microseconds.
Throughput - requests served per second. Measured in QPS.

They trade against each other. A queue between client and server can increase throughput (batch + amortize fixed costs) while increasing latency (waiting in the queue).

Most performance work targets one or the other, not both. A latency-sensitive service (interactive UI, trading system) needs ms-level p99. A throughput-sensitive service (batch processing) can have seconds of latency if the work-per-second is high.

Be explicit about which you're optimizing. The trade-offs differ.

Latency at percentiles¶

Means are misleading. Always look at percentiles (p50, p95, p99, p99.9, p99.99).

A service with p50=10ms and p99=10s is broken for 1% of users. The mean might be 30ms - sounds fine until you remember that 1-in-100 means a user gets a bad experience every ~100 requests.

The coordinated omission problem (Gil Tene): naive percentile measurement misses the latency of requests that were never sent because the previous one was still in flight. Real percentile measurement (HdrHistogram-style) accounts for this.

Look at p99 minimum. p99.9 if your business cares about long tails (most do, even if they think they don't).

Amdahl's law and Gustafson's law¶

Two laws that bound parallelization speedup:

Amdahl's law: if X% of a task is sequential, parallelizing the rest gives at most 1/X speedup, no matter how many cores you throw at it. A task that's 10% sequential maxes out at 10× speedup.
Gustafson's law: as problem size grows, the sequential portion's fraction shrinks. For workloads that scale with available resources (more data, more cores), speedups can exceed Amdahl's bound.

Practical takeaway: identify the serial bottleneck before parallelizing. Removing serialization (locks, single-coordinator nodes, shared state) often beats adding cores.

Profiling workflow¶

The pattern that works:

Reproduce the slow scenario under realistic load. Synthetic benchmarks lie; production-shape load is the only truth.
Profile under that load with a sampling profiler. CPU first (pprof/JFR/async-profiler/py-spy/perf - see Observability).
Read the flame graph. Wide bars = where time is spent.
Identify the top 3 hot functions. That's where to look.
Drill in. Are those functions necessary? Could they be cached? Could they be batched? Could the caller call less often?
Change one thing. Re-measure. If it didn't help, revert and try the next idea.

The single most common mistake: optimizing without re-measuring. Run the same benchmark before and after; compare.

When to stop¶

Three signals that you're at the natural limit:

The flame graph shows the actual work, not overhead. If 70% of CPU is "your business logic" and 30% is "framework + GC + syscalls," there's not much fat left.
The wall-clock time matches the work's information theoretic minimum. A 1GB transfer over a 1Gbps link takes at least 8 seconds; you can't beat that.
The product cost / engineering effort no longer pays for further optimization. A 5% improvement that takes 2 weeks of engineering rarely returns the investment.

Stop. Move on. Most systems have plenty of higher-leverage problems.

The lens, per path¶

Go - pprof + execution traces¶

Appendix A - Production Hardening. net/http/pprof for live profiling. runtime/trace for scheduler events. Read flame graphs with go tool pprof -http :8080.

Distinguishing feature: the standard library ships the profiler. Every Go service can be profiled in production with zero code changes.

Java - JFR + JMC + async-profiler¶

Appendix A - Production Hardening. JFR for always-on, low-overhead recording. JMC for analysis. async-profiler for flame graphs. JMH for microbenchmarks.

Distinguishing feature: the deepest production profiling story of any path. Continuous JFR + post-incident snapshot analysis.

Rust - `perf` + flamegraph + criterion + cargo-bench¶

Appendix A - Production Hardening. Linux perf for system-level; flamegraph for CPU flames; criterion for benchmarks.

Distinguishing feature: the JIT-free model. Steady-state performance is reachable in microseconds; no warmup. Cost is hand-tuning when the compiler doesn't optimize what you'd hoped.

Python - py-spy + scalene + cProfile + Pyinstrument¶

Appendix A - Production Hardening. py-spy for no-instrumentation sampling. scalene for line-level CPU + memory.

Distinguishing feature: the "attach to running process" story. py-spy connects to a live PID without restarting; invaluable in production.

Linux - perf, ftrace, eBPF, bpftrace¶

Appendix A - Hardening & Tuning. perf record + perf report. ftrace for kernel function tracing. eBPF for dynamic instrumentation across runtime boundaries.

Distinguishing feature: the cross-runtime view. perf and eBPF see everything - kernel, userspace, JIT-compiled, native code.

AI Systems - Nsight Compute, DCGM, PyTorch Profiler¶

Appendix A - Hardening & Observability. NVIDIA Nsight Compute for GPU kernel-level. DCGM for fleet-level GPU metrics. PyTorch Profiler for ML-framework-aware traces.

Distinguishing feature: the GPU dimension. "Why is my training slow?" needs memory-bandwidth analysis (chapter 12 of the deep dives - kernel fusion) on top of CPU profiling.

What to read first¶

You've never measured anything → start with the RED method on whatever service you operate. Three dashboards. Build the baseline before anything else.
You're optimizing a specific service → its language's Appendix A on hardening + the per-language profiler walkthrough.
You want to think about performance the right way → Brendan Gregg's Systems Performance book. Brendan's blog. Read Gil Tene's "Latency, the lies we tell ourselves" talk.
You operate at scale → the Observability cross-topic page for instrumentation, plus this page's methodology.