Week 23 - Performance Tuning at Scale¶

23.1 Conceptual Core¶

Tuning at scale is systematic-not "tweak vm.swappiness." It is: measure, hypothesize, change one variable, re-measure, document.
Brendan Gregg's USE method: for every resource, characterize Utilization, Saturation, Errors. Apply to CPU, memory, disk, network.
The common bottlenecks, in rough rank order: I/O (latency or throughput), memory pressure (PSI), syscall-frequency or context-switch storms, lock contention, NIC drops.

23.2 Mechanical Detail¶

CPU: mpstat -P ALL 1, pidstat 1. Look for one-CPU-pegged. Check for irq imbalance (/proc/interrupts).
Memory: vmstat 1, /proc/pressure/memory. PSI > 0% sustained = problem.
Disk: iostat -xz 1. Look at await and %util. >70% %util and >10ms await = saturated.
Network: sar -n DEV 1, ethtool -S <iface>. Drops, errors, frame-too-long counts.
Kernel: perf top - if__do_softirqor_raw_spin_lock` is hot, dig further.

23.3 Lab-"Triage Drill"¶

A scripted "broken host" is provided (or build one): a VM with one of {disk-bound, memory-bound, network-bound, lock-contended, scheduler-thrashing} pathologies. Diagnose using only the tools above. Document the inference chain. Then introduce a fix and verify.

23.4 Hardening Drill¶

Codify the USE-method dashboard for your environment. Wire to Prometheus/Grafana. Alert on saturation > 80% for any resource for 5 min.

23.5 Performance Tuning Slice¶

Sysctl baseline for high-throughput servers (review and adapt-never copy blindly):

net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.core.netdev_max_backlog = 16384
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
vm.swappiness = 10
fs.file-max = 2097152

Each line should be paired with a justification in your runbook.