Skip to content

Week 23 - Cgroups v2, Capabilities, Seccomp, OverlayFS

23.1 Conceptual Core

  • The remaining isolation layers: cgroups for resource limits, capabilities for privilege restriction, seccomp for syscall filtering, OverlayFS for the rootfs (if not already prepared by umoci).
  • Each layer applies at a specific lifecycle moment. Get the order wrong and the container starts but isolation is incomplete.

23.2 Mechanical Detail

  • Cgroups v2:
  • Create /sys/fs/cgroup/<container-id>/.
  • Write +memory +cpu +pids to cgroup.subtree_control of the parent.
  • Write <pid> to the child cgroup's cgroup.procs after fork, before exec.
  • Set limits: memory.max, memory.high, cpu.max (<quota> <period>), pids.max.
  • Capabilities: drop via cap_set_proc (libcap) or prctl(PR_CAPBSET_DROP) for bounding set + capset(2) for effective. Apply after setup syscalls that need them, before execve.
  • Seccomp: compile the OCI seccomp profile to a BPF program (the libseccomp library does this). Apply with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...). Requires no_new_privs (prctl(PR_SET_NO_NEW_PRIVS)).
  • OverlayFS: if the bundle has separate lower/upper/work dirs, mount overlay. Otherwise rootfs is a single dir already-just bind mount it.

23.3 Lab-"All The Layers"

  1. Implement cgroup v2 setup. Verify memory.max=64M actually limits the container.
  2. Implement capability dropping. Verify getcap/capsh --print inside the container.
  3. Implement seccomp filter loading. Verify a denied syscall fails.
  4. (Optional) Implement OverlayFS rootfs construction from a multi-layer image.

23.4 Hardening Drill

  • Default seccomp profile from `containers/common/pkg/seccomp - port to your runtime as the default.

23.5 Production Readiness Slice

  • Add chaos tests: deliberately malformed configs, runaway memory in the container, seccomp-blocked syscalls. Verify the runtime's error paths are clean.

Comments