Saltar a contenido

Worked example - Week 10: cgroups v2 on your laptop, end to end

Companion to Linux Kernel → Month 03 → Week 10: Cgroups v2. The week explains the unified hierarchy and the controllers. This page is a hands-on tour: create a cgroup, put a process in it, limit it, watch the limit bite.

You need Linux 5.x+ with cgroups v2 enabled (default since ~2020 on most distros). Verify:

$ stat -fc %T /sys/fs/cgroup
cgroup2fs

If you see tmpfs instead, you're on cgroups v1 hybrid mode - change a boot parameter and reboot, or use a recent Fedora/Arch/Ubuntu 22.04+ where v2 is the default.

The cgroup filesystem

cgroups v2 is just a filesystem. Every directory under /sys/fs/cgroup/ is a cgroup; child directories are nested cgroups; the files inside control what the cgroup does.

$ ls /sys/fs/cgroup/
cgroup.controllers  cgroup.procs        cpu.stat           memory.current  ...
cgroup.max.depth    cgroup.subtree_control  io.stat        memory.max      ...
  • cgroup.controllers - what controllers are available to descendants.
  • cgroup.subtree_control - what controllers are enabled for descendants.
  • cgroup.procs - PIDs currently in this cgroup (root has all of them).
  • memory.max, cpu.max, io.max - controller-specific knobs.

Create a cgroup

$ sudo mkdir /sys/fs/cgroup/demo
$ ls /sys/fs/cgroup/demo
cgroup.controllers  cgroup.events  cgroup.procs  cgroup.stat  cgroup.type  ...

The kernel populated the new directory with the standard files. Some are read-only (cgroup.controllers), some are writable knobs.

But notice: there are no memory.max or cpu.max files yet. Controllers must be enabled by the parent before they appear in a child:

$ cat /sys/fs/cgroup/cgroup.subtree_control
(empty)
$ echo "+memory +cpu +io" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
$ ls /sys/fs/cgroup/demo | grep -E '^(memory|cpu|io)'
cpu.idle
cpu.max
cpu.stat
io.max
io.stat
memory.current
memory.events
memory.max
memory.peak
memory.stat
...

Now the controllers are available in /sys/fs/cgroup/demo/.

Limit memory

Cap the cgroup at 100 MB:

$ echo $((100 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/demo/memory.max
104857600

Or use the M / G suffix syntax (kernel parses it):

$ echo "100M" | sudo tee /sys/fs/cgroup/demo/memory.max

That's the hard limit. The kernel will OOM-kill anything in this cgroup that pushes past it.

Put a process in the cgroup

To move a process in, write its PID to cgroup.procs:

$ sudo bash -c 'echo $$ > /sys/fs/cgroup/demo/cgroup.procs && exec bash'

The trick: $$ is the PID of the subshell before exec replaces it; writing to cgroup.procs moves the PID; exec bash then replaces the subshell so the new bash inherits the cgroup. From this new shell, every child process is also in demo.

Verify:

$ cat /proc/self/cgroup
0::/demo

Yes - this process is in /demo.

Watch the limit bite

In the demo shell, allocate memory aggressively:

$ python3 -c '
chunks = []
for i in range(1000):
    chunks.append(b"x" * (1024 * 1024))
    if i % 10 == 0:
        print(i, "MB")
'
0 MB
10 MB
20 MB
...
90 MB
Killed

At ~100 MB, the OOM killer fires. Note that only the process in this cgroup was killed; the rest of your system is fine.

Check what happened:

$ cat /sys/fs/cgroup/demo/memory.events
low 0
high 0
max 0
oom 1
oom_kill 1

oom_kill 1 - one process was OOM-killed inside this cgroup. Tells you why your container died next time.

Limit CPU

$ echo "50000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max

The format is "quota period." 50000 100000 means "50ms of CPU per 100ms wall-clock period." That's 50% of one core, regardless of how many cores you have.

Test:

$ yes > /dev/null &
$ top -p $!
# CPU% should hover around 50.0

You can also use cpu.max max 100000 to remove the cap. Or set cpu.weight (proportional sharing) for a softer policy.

Limit IO

$ ls -la /dev/nvme0n1   # find the device number
brw-rw---- 1 root disk 259, 0 May 17 09:00 /dev/nvme0n1

$ echo "259:0 wbps=10485760" | sudo tee /sys/fs/cgroup/demo/io.max

That throttles writes to the named device to 10 MB/s. Run dd if=/dev/zero of=/tmp/test bs=1M count=200; watch the throughput cap.

Nested cgroups

Make /demo/web and /demo/worker:

$ sudo mkdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker
$ echo "+memory +cpu" | sudo tee /sys/fs/cgroup/demo/cgroup.subtree_control
$ echo "50M" | sudo tee /sys/fs/cgroup/demo/web/memory.max
$ echo "30M" | sudo tee /sys/fs/cgroup/demo/worker/memory.max

The limits compose: web is capped at 50 MB, worker at 30 MB, both inside demo's 100 MB ceiling. If web would exceed 50 MB, it's killed; if both together would exceed 100 MB, the kernel kills whichever is closer to its limit.

This is exactly the model container runtimes use. A pod is a cgroup, each container inside the pod is a sub-cgroup, the pod's memory.max is the pod's resource limit, the container's is each container's limit.

Tear it down

To delete a cgroup, first move all its processes out:

$ sudo bash -c 'echo $$ > /sys/fs/cgroup/cgroup.procs'
$ sudo rmdir /sys/fs/cgroup/demo/web /sys/fs/cgroup/demo/worker /sys/fs/cgroup/demo

rmdir only succeeds if the cgroup is empty (no procs, no descendants).

The trap

Setting memory.max to a value below your process's current memory use is fine on cgroups v1 (the limit is "from now on") but on v2, the kernel can immediately OOM-kill processes to bring memory back under the limit. Always reserve headroom; never tighten limits live unless you understand what's currently in there.

The other trap: cpu.weight is relative and useless on an idle system. If your cgroup is the only one running, it gets 100% of the CPU regardless of its weight. Only meaningful under contention.

Exercise

  1. Recreate the demo above. Confirm the OOM kill triggers at exactly the limit.
  2. Add memory.high set to 80M (below memory.max=100M). Re-run the Python allocator. Observe: under memory.high, the kernel throttles allocation but doesn't kill. What does memory.events show?
  3. Look at how Docker uses this. docker run --memory=100m -d busybox. Then cat /sys/fs/cgroup/system.slice/docker-*.scope/memory.max. The match should be exact.
  4. (Advanced) Read cpu.pressure - the PSI (pressure-stall information) interface. It reports how much your cgroup is waiting on CPU. More useful for capacity planning than instantaneous CPU%.

Comments