Week 23 - Cgroups v2, Capabilities, Seccomp, OverlayFS¶
23.1 Conceptual Core¶
- The remaining isolation layers: cgroups for resource limits, capabilities for privilege restriction, seccomp for syscall filtering, OverlayFS for the rootfs (if not already prepared by
umoci). - Each layer applies at a specific lifecycle moment. Get the order wrong and the container starts but isolation is incomplete.
23.2 Mechanical Detail¶
- Cgroups v2:
- Create
/sys/fs/cgroup/<container-id>/. - Write
+memory +cpu +pidstocgroup.subtree_controlof the parent. - Write
<pid>to the child cgroup'scgroup.procsafter fork, before exec. - Set limits:
memory.max,memory.high,cpu.max(<quota> <period>),pids.max. - Capabilities: drop via
cap_set_proc(libcap) orprctl(PR_CAPBSET_DROP)for bounding set +capset(2)for effective. Apply after setup syscalls that need them, beforeexecve. - Seccomp: compile the OCI seccomp profile to a BPF program (the
libseccomplibrary does this). Apply withprctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...). Requiresno_new_privs(prctl(PR_SET_NO_NEW_PRIVS)). - OverlayFS: if the bundle has separate lower/upper/work dirs, mount overlay. Otherwise rootfs is a single dir already-just
bind mountit.
23.3 Lab-"All The Layers"¶
- Implement cgroup v2 setup. Verify
memory.max=64Mactually limits the container. - Implement capability dropping. Verify
getcap/capsh --printinside the container. - Implement seccomp filter loading. Verify a denied syscall fails.
- (Optional) Implement OverlayFS rootfs construction from a multi-layer image.
23.4 Hardening Drill¶
- Default seccomp profile from `containers/common/pkg/seccomp - port to your runtime as the default.
23.5 Production Readiness Slice¶
- Add chaos tests: deliberately malformed configs, runaway memory in the container, seccomp-blocked syscalls. Verify the runtime's error paths are clean.