Skip to content

Week 23 - Performance Tuning at Scale

23.1 Conceptual Core

  • Tuning at scale is systematic-not "tweak vm.swappiness." It is: measure, hypothesize, change one variable, re-measure, document.
  • Brendan Gregg's USE method: for every resource, characterize Utilization, Saturation, Errors. Apply to CPU, memory, disk, network.
  • The common bottlenecks, in rough rank order: I/O (latency or throughput), memory pressure (PSI), syscall-frequency or context-switch storms, lock contention, NIC drops.

23.2 Mechanical Detail

  • CPU: mpstat -P ALL 1, pidstat 1. Look for one-CPU-pegged. Check for irq imbalance (/proc/interrupts).
  • Memory: vmstat 1, /proc/pressure/memory. PSI > 0% sustained = problem.
  • Disk: iostat -xz 1. Look at await and %util. >70% %util and >10ms await = saturated.
  • Network: sar -n DEV 1, ethtool -S <iface>. Drops, errors, frame-too-long counts.
  • Kernel: perf top - if__do_softirqor_raw_spin_lock` is hot, dig further.

23.3 Lab-"Triage Drill"

A scripted "broken host" is provided (or build one): a VM with one of {disk-bound, memory-bound, network-bound, lock-contended, scheduler-thrashing} pathologies. Diagnose using only the tools above. Document the inference chain. Then introduce a fix and verify.

23.4 Hardening Drill

  • Codify the USE-method dashboard for your environment. Wire to Prometheus/Grafana. Alert on saturation > 80% for any resource for 5 min.

23.5 Performance Tuning Slice

  • Sysctl baseline for high-throughput servers (review and adapt-never copy blindly):
    net.core.somaxconn = 4096
    net.ipv4.tcp_max_syn_backlog = 4096
    net.core.netdev_max_backlog = 16384
    net.ipv4.tcp_tw_reuse = 1
    net.ipv4.tcp_fin_timeout = 15
    vm.dirty_ratio = 10
    vm.dirty_background_ratio = 3
    vm.swappiness = 10
    fs.file-max = 2097152
    
    Each line should be paired with a justification in your runbook.

Comments