Week 10 - Control Groups v2¶
10.1 Conceptual Core¶
cgroups (control groups) are the kernel mechanism for resource limits and accounting. Every process belongs to exactly one cgroup per controller; the controllers enforce CPU, memory, I/O, and other limits collectively rather than per-process.
v2 is the unified hierarchy: a single tree under /sys/fs/cgroup/, every controller attached to it. v1 had a separate tree per controller and a long list of design lessons; v2 is the cleanup. All new code should target v2 - major distros default to it since 2019-2020 (systemd.unified_cgroup_hierarchy=1).
Controllers available in v2: cpu, memory, io, pids, cpuset, hugetlb, rdma, misc. Plus the implicit freezer mechanism (cgroup.freeze).
This is the foundation Kubernetes resource limits, Docker --memory, and systemd MemoryMax= all sit on top of. Master it and Kubernetes' "OOMKilled" alerts become diagnosable rather than mysterious.
10.2 Mechanical Detail¶
- Filesystem layout:
/sys/fs/cgroup/is a single mount in v2. Each subdirectory is a cgroup; create one withmkdir. Files inside (cpu.max,memory.max, ...) are the controller knobs. - Memory controller files you'll touch most:
memory.low- protection threshold; the kernel reclaims from cgroups exceeding theirlowbefore touching protected ones.memory.high- soft target; exceeding it throttles allocations (slows the process) but doesn't OOM.memory.max- hard cap; exceeding it triggers cgroup-level OOM.memory.events- counters:low,high,max,oom,oom_kill. Read these to alert on pressure.memory.pressure- PSI (Pressure Stall Information): time spent waiting on memory. The signal for "I'm not OOM, but I'm not happy either."
- CPU controller:
cpu.weight- proportional share (default 100, range 1-10000). Under contention, slices are weight-proportional.cpu.max- bandwidth as"$quota $period"microseconds (e.g.,"50000 100000"= 50% of one CPU). Kubernetes CPU limits compile to this.
- IO controller:
io.weight- proportional, likecpu.weight.io.max- per-device bandwidth and IOPS caps:"8:0 rbps=10485760 wbps=10485760 riops=100 wiops=100".
- PIDs controller:
pids.max- fork-bomb protection; set to a sane upper bound per service. - Moving a process:
echo $PID > /sys/fs/cgroup/foo/cgroup.procs. Children of moved processes inherit the new cgroup. - systemd integration: every systemd unit gets its own cgroup automatically.
systemd-cglsshows the tree;systemd-cgtopshows live resource usage;systemctl set-property foo.service MemoryMax=2Gupdates limits live.
The trap
Setting memory.max without memory.high. When the workload spikes, you go straight from "fine" to OOM-killed with no warning signal. Set memory.high to ~80% of memory.max so you see throttling (and memory.events.high counter ticks) before the kill arrives - your alerting can fire on high events rather than waiting for oom_kill.
10.3 Lab - "Multi-Tenant Cgroups"¶
- Create three sibling cgroups:
tenant-a,tenant-b,tenant-cunder/sys/fs/cgroup/test/. - Set
cpu.weight100/200/400 - under contention (runstress-ng --cpu Nin each), verify the 1:2:4 split withtop. - Set
memory.high=1G memory.max=2Gon each, run a memory hog (stress-ng --vm 1 --vm-bytes 3G), observe throttling first (memory.events.highticks, latency increases) then OOM (memory.events.oom_killticks). - Set
io.maxto limit disk bandwidth on a specific device for one cgroup; runfioinside, verify withiostat -x 1.
10.4 Hardening Drill¶
For every long-running service on a managed host, set explicit values for: MemoryHigh=, MemoryMax=, CPUQuota=, TasksMax=, IOWeight=. Document the policy in RESOURCE_POLICY.md (one row per service). The right defaults: MemoryHigh at 80% of MemoryMax; CPUQuota per service tier; TasksMax at expected concurrency × 4 for headroom.
10.5 Performance Tuning Slice¶
Wire two Prometheus metrics from every cgroup:
- cgroup_memory_events_total{event="high|max|oom_kill"} from memory.events.
- cgroup_pressure_seconds_total from memory.pressure / cpu.pressure / io.pressure (PSI).
Alert on high rate > 0 sustained (early warning), and on any oom_kill (incident). PSI is the right signal for "approaching saturation" - fires before any hard limit is hit.