Week 23 - Performance Tuning: Profile, Tune, Re-Profile¶
23.1 Conceptual Core¶
- The discipline: measure, then change. Never optimize from intuition. The profilers-
pprof,trace, `runtime/metrics - are the source of truth. - A working flow: capture a baseline profile, propose a hypothesis, change one thing, re-profile, accept or reject. Commit each accepted change with the before/after profile linked.
23.2 Mechanical Detail-The Tuning Toolkit¶
- CPU profile:
pprof http://host/debug/pprof/profile?seconds=60. Top-heavy stacks identify the hot path. Look forruntime.mallocgc,runtime.scanobject, `runtime.gcDrain - these mean GC is the bottleneck. - Heap profile: identify retention.
pprof -inuse_spaceshows what's live; - alloc_objects` shows churn. - Block profile:
runtime.SetBlockProfileRate(1)thenpprof /debug/pprof/block. Identifies channel/syscall waits. - Mutex profile:
runtime.SetMutexProfileFraction(1)thenpprof /debug/pprof/mutex. Identifies contended locks. - Goroutine profile: stack distribution. Sudden growth = leak.
- Execution trace:
go tool trace. The expensive but most informative tool. Identifies scheduler latencies, GC pauses, and goroutine-state transitions. - PGO (Profile-Guided Optimization): stable since Go 1.21. Capture a representative CPU profile in production, place it as
default.pgo, rebuild. ~5–10% throughput win on hot paths. benchstat: compare twogo test -benchruns statistically. Reports geomean and significance.
23.3 Mechanical Detail-Common Wins¶
- Replace
interface{}boxing in hot paths with concrete types or generics. - Reuse allocations via
sync.Pool(with the discipline from Week 8). - Pre-size slices and maps when capacity is known:
make([]T, 0, n),make(map[K]V, n). - Avoid
deferin hot loops (Go 1.14+ madedefer~zero-cost in most cases, but the loop variant still has overhead). strings.Builderover+=for building strings.- Slice instead of map for small collections (<~50 entries)-linear scan beats hash on modern caches.
- Goroutine cost: launching is cheap, but the aggregate of millions of goroutines on a long-tail-blocked path is not. Bound concurrency.
23.4 Lab-"Profile-Tune-Profile"¶
Take your capstone (whatever track) and:
1. Capture a CPU profile under representative load. Identify the top 5 functions.
2. Pick one and propose a fix. Estimate the win in advance.
3. Implement, re-profile, compare with benchstat. Document each change in PERF_LOG.md.
4. Capture a runtime/trace and identify any GC or scheduler stalls. Fix one.
5. Apply PGO. Confirm the win.
23.5 Idiomatic & golangci-lint Drill¶
prealloc,gosimple S1024(time.Subinstead oftime.Now().Sub),gocritic: rangeValCopy. Final lint pass-your codebase should be near-zero findings.
23.6 Production Hardening Slice¶
- Wire PGO into your release pipeline: a "canary" deploy collects a profile, the next "stable" build uses it. Document the procedure in
RELEASE.md.