Week 23 - Performance Tuning at Scale¶
23.1 Conceptual Core¶
- Tuning at scale is systematic-not "tweak
vm.swappiness." It is: measure, hypothesize, change one variable, re-measure, document. - Brendan Gregg's USE method: for every resource, characterize Utilization, Saturation, Errors. Apply to CPU, memory, disk, network.
- The common bottlenecks, in rough rank order: I/O (latency or throughput), memory pressure (PSI), syscall-frequency or context-switch storms, lock contention, NIC drops.
23.2 Mechanical Detail¶
- CPU:
mpstat -P ALL 1,pidstat 1. Look for one-CPU-pegged. Check for irq imbalance (/proc/interrupts). - Memory:
vmstat 1,/proc/pressure/memory. PSI > 0% sustained = problem. - Disk:
iostat -xz 1. Look atawaitand%util. >70%%utiland >10msawait= saturated. - Network:
sar -n DEV 1,ethtool -S <iface>. Drops, errors, frame-too-long counts. - Kernel:
perf top - if__do_softirqor_raw_spin_lock` is hot, dig further.
23.3 Lab-"Triage Drill"¶
A scripted "broken host" is provided (or build one): a VM with one of {disk-bound, memory-bound, network-bound, lock-contended, scheduler-thrashing} pathologies. Diagnose using only the tools above. Document the inference chain. Then introduce a fix and verify.
23.4 Hardening Drill¶
- Codify the USE-method dashboard for your environment. Wire to Prometheus/Grafana. Alert on saturation > 80% for any resource for 5 min.
23.5 Performance Tuning Slice¶
- Sysctl baseline for high-throughput servers (review and adapt-never copy blindly): Each line should be paired with a justification in your runbook.