Week 10 - Control Groups v2¶

10.1 Conceptual Core¶

cgroups (control groups) are the kernel mechanism for resource limits and accounting. Every process belongs to exactly one cgroup per controller; the controllers enforce CPU, memory, I/O, and other limits collectively rather than per-process.

v2 is the unified hierarchy: a single tree under /sys/fs/cgroup/, every controller attached to it. v1 had a separate tree per controller and a long list of design lessons; v2 is the cleanup. All new code should target v2 - major distros default to it since 2019-2020 (systemd.unified_cgroup_hierarchy=1).

Controllers available in v2: cpu, memory, io, pids, cpuset, hugetlb, rdma, misc. Plus the implicit freezer mechanism (cgroup.freeze).

This is the foundation Kubernetes resource limits, Docker --memory, and systemd MemoryMax= all sit on top of. Master it and Kubernetes' "OOMKilled" alerts become diagnosable rather than mysterious.

10.2 Mechanical Detail¶

Filesystem layout: /sys/fs/cgroup/ is a single mount in v2. Each subdirectory is a cgroup; create one with mkdir. Files inside (cpu.max, memory.max, ...) are the controller knobs.
Memory controller files you'll touch most:
- memory.low - protection threshold; the kernel reclaims from cgroups exceeding their low before touching protected ones.
- memory.high - soft target; exceeding it throttles allocations (slows the process) but doesn't OOM.
- memory.max - hard cap; exceeding it triggers cgroup-level OOM.
- memory.events - counters: low, high, max, oom, oom_kill. Read these to alert on pressure.
- memory.pressure - PSI (Pressure Stall Information): time spent waiting on memory. The signal for "I'm not OOM, but I'm not happy either."
CPU controller:
- cpu.weight - proportional share (default 100, range 1-10000). Under contention, slices are weight-proportional.
- cpu.max - bandwidth as "$quota $period" microseconds (e.g., "50000 100000" = 50% of one CPU). Kubernetes CPU limits compile to this.
IO controller:
- io.weight - proportional, like cpu.weight.
- io.max - per-device bandwidth and IOPS caps: "8:0 rbps=10485760 wbps=10485760 riops=100 wiops=100".
PIDs controller: pids.max - fork-bomb protection; set to a sane upper bound per service.
Moving a process: echo $PID > /sys/fs/cgroup/foo/cgroup.procs. Children of moved processes inherit the new cgroup.
systemd integration: every systemd unit gets its own cgroup automatically. systemd-cgls shows the tree; systemd-cgtop shows live resource usage; systemctl set-property foo.service MemoryMax=2G updates limits live.

The trap

Setting memory.max without memory.high. When the workload spikes, you go straight from "fine" to OOM-killed with no warning signal. Set memory.high to ~80% of memory.max so you see throttling (and memory.events.high counter ticks) before the kill arrives - your alerting can fire on high events rather than waiting for oom_kill.

10.3 Lab - "Multi-Tenant Cgroups"¶

Create three sibling cgroups: tenant-a, tenant-b, tenant-c under /sys/fs/cgroup/test/.
Set cpu.weight 100/200/400 - under contention (run stress-ng --cpu N in each), verify the 1:2:4 split with top.
Set memory.high=1G memory.max=2G on each, run a memory hog (stress-ng --vm 1 --vm-bytes 3G), observe throttling first (memory.events.high ticks, latency increases) then OOM (memory.events.oom_kill ticks).
Set io.max to limit disk bandwidth on a specific device for one cgroup; run fio inside, verify with iostat -x 1.

10.4 Hardening Drill¶

For every long-running service on a managed host, set explicit values for: MemoryHigh=, MemoryMax=, CPUQuota=, TasksMax=, IOWeight=. Document the policy in RESOURCE_POLICY.md (one row per service). The right defaults: MemoryHigh at 80% of MemoryMax; CPUQuota per service tier; TasksMax at expected concurrency × 4 for headroom.

10.5 Performance Tuning Slice¶

Wire two Prometheus metrics from every cgroup: - cgroup_memory_events_total{event="high|max|oom_kill"} from memory.events. - cgroup_pressure_seconds_total from memory.pressure / cpu.pressure / io.pressure (PSI).

Alert on high rate > 0 sustained (early warning), and on any oom_kill (incident). PSI is the right signal for "approaching saturation" - fires before any hard limit is hit.