Worked example - Week 10: cgroups v2 on your laptop, end to end¶
Companion to Linux Kernel → Month 03 → Week 10: Cgroups v2. The week explains the unified hierarchy and the controllers. This page is a hands-on tour: create a cgroup, put a process in it, limit it, watch the limit bite.
You need Linux 5.x+ with cgroups v2 enabled (default since ~2020 on most distros). Verify:
If you see tmpfs instead, you're on cgroups v1 hybrid mode - change a boot parameter and reboot, or use a recent Fedora/Arch/Ubuntu 22.04+ where v2 is the default.
The cgroup filesystem¶
cgroups v2 is just a filesystem. Every directory under /sys/fs/cgroup/ is a cgroup; child directories are nested cgroups; the files inside control what the cgroup does.
$ ls /sys/fs/cgroup/
cgroup.controllers cgroup.procs cpu.stat memory.current ...
cgroup.max.depth cgroup.subtree_control io.stat memory.max ...
cgroup.controllers- what controllers are available to descendants.cgroup.subtree_control- what controllers are enabled for descendants.cgroup.procs- PIDs currently in this cgroup (root has all of them).memory.max,cpu.max,io.max- controller-specific knobs.
Create a cgroup¶
$ sudo mkdir /sys/fs/cgroup/demo
$ ls /sys/fs/cgroup/demo
cgroup.controllers cgroup.events cgroup.procs cgroup.stat cgroup.type ...
The kernel populated the new directory with the standard files. Some are read-only (cgroup.controllers), some are writable knobs.
But notice: there are no memory.max or cpu.max files yet. Controllers must be enabled by the parent before they appear in a child:
$ cat /sys/fs/cgroup/cgroup.subtree_control
(empty)
$ echo "+memory +cpu +io" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
$ ls /sys/fs/cgroup/demo | grep -E '^(memory|cpu|io)'
cpu.idle
cpu.max
cpu.stat
io.max
io.stat
memory.current
memory.events
memory.max
memory.peak
memory.stat
...
Now the controllers are available in /sys/fs/cgroup/demo/.
Limit memory¶
Cap the cgroup at 100 MB:
Or use the M / G suffix syntax (kernel parses it):
That's the hard limit. The kernel will OOM-kill anything in this cgroup that pushes past it.
Put a process in the cgroup¶
To move a process in, write its PID to cgroup.procs:
The trick: $$ is the PID of the subshell before exec replaces it; writing to cgroup.procs moves the PID; exec bash then replaces the subshell so the new bash inherits the cgroup. From this new shell, every child process is also in demo.
Verify:
Yes - this process is in /demo.
Watch the limit bite¶
In the demo shell, allocate memory aggressively:
$ python3 -c '
chunks = []
for i in range(1000):
chunks.append(b"x" * (1024 * 1024))
if i % 10 == 0:
print(i, "MB")
'
0 MB
10 MB
20 MB
...
90 MB
Killed
At ~100 MB, the OOM killer fires. Note that only the process in this cgroup was killed; the rest of your system is fine.
Check what happened:
oom_kill 1 - one process was OOM-killed inside this cgroup. Tells you why your container died next time.
Limit CPU¶
The format is "quota period." 50000 100000 means "50ms of CPU per 100ms wall-clock period." That's 50% of one core, regardless of how many cores you have.
Test:
You can also use cpu.max max 100000 to remove the cap. Or set cpu.weight (proportional sharing) for a softer policy.
Limit IO¶
$ ls -la /dev/nvme0n1 # find the device number
brw-rw---- 1 root disk 259, 0 May 17 09:00 /dev/nvme0n1
$ echo "259:0 wbps=10485760" | sudo tee /sys/fs/cgroup/demo/io.max
That throttles writes to the named device to 10 MB/s. Run dd if=/dev/zero of=/tmp/test bs=1M count=200; watch the throughput cap.
Nested cgroups¶
Make /demo/web and /demo/worker:
$ sudo mkdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker
$ echo "+memory +cpu" | sudo tee /sys/fs/cgroup/demo/cgroup.subtree_control
$ echo "50M" | sudo tee /sys/fs/cgroup/demo/web/memory.max
$ echo "30M" | sudo tee /sys/fs/cgroup/demo/worker/memory.max
The limits compose: web is capped at 50 MB, worker at 30 MB, both inside demo's 100 MB ceiling. If web would exceed 50 MB, it's killed; if both together would exceed 100 MB, the kernel kills whichever is closer to its limit.
This is exactly the model container runtimes use. A pod is a cgroup, each container inside the pod is a sub-cgroup, the pod's memory.max is the pod's resource limit, the container's is each container's limit.
Tear it down¶
To delete a cgroup, first move all its processes out:
$ sudo bash -c 'echo $$ > /sys/fs/cgroup/cgroup.procs'
$ sudo rmdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker /sys/fs/cgroup/demo
rmdir only succeeds if the cgroup is empty (no procs, no descendants).
The trap¶
Setting memory.max to a value below your process's current memory use is fine on cgroups v1 (the limit is "from now on") but on v2, the kernel can immediately OOM-kill processes to bring memory back under the limit. Always reserve headroom; never tighten limits live unless you understand what's currently in there.
The other trap: cpu.weight is relative and useless on an idle system. If your cgroup is the only one running, it gets 100% of the CPU regardless of its weight. Only meaningful under contention.
Exercise¶
- Recreate the demo above. Confirm the OOM kill triggers at exactly the limit.
- Add
memory.highset to 80M (belowmemory.max=100M). Re-run the Python allocator. Observe: undermemory.high, the kernel throttles allocation but doesn't kill. What doesmemory.eventsshow? - Look at how Docker uses this.
docker run --memory=100m -d busybox. Thencat /sys/fs/cgroup/system.slice/docker-*.scope/memory.max. The match should be exact. - (Advanced) Read
cpu.pressure- the PSI (pressure-stall information) interface. It reports how much your cgroup is waiting on CPU. More useful for capacity planning than instantaneous CPU%.
Related reading¶
- The main Week 10 chapter covers the controller architecture and v1 vs v2 differences.
- Container Internals → capabilities walkthrough is the sibling - container runtimes layer cgroups on top of namespaces on top of capabilities.
- Glossary: Cgroup, Namespace, OOM killer, PSI in the main glossary.