Week 22 - Namespaces and Process Isolation¶
22.1 Conceptual Core¶
- The runtime needs to
clone(orfork+unshare) into the configured namespaces, set up UID/GID maps for user namespaces, configure UTS hostname, andpivot_rootinto the rootfs. - The classic two-process pattern: parent forks child with
CLONE_NEWPID | CLONE_NEWNS | ...; parent writes UID/GID maps for the child; child waits via pipe for parent to finish setup; child performs final setup (mount /proc, pivot_root); child execs.
22.2 Mechanical Detail¶
- In Go,
golang.org/x/sys/unix.Clonedoes not exist directly; usesyscall.SysProcAttr{Cloneflags: ...}onexec.Cmd, or use the lower-levelsyscall.Syscall(SYS_CLONE, ...). - The
runc"init" pattern: a self-re-exec into the binary with a sentinel argument signaling "I'm the container init." The first invocation does the setup; the re-exec performspivot_rootand finalexecve. Readrunc/libcontainer/standard_init_linux.go. - UID/GID mapping: write to
/proc/<child-pid>/uid_mapandgid_map. For non-root parents, also writesetgroups denyfirst. pivot_rootrequires amount(MS_PRIVATE)of the parent mount before the call (to avoid leaking mounts back to the host).
22.3 Lab-"Namespaces Working"¶
- Implement the parent/child fork-with-clone-flags. Verify
lsns -p <pid>shows new namespaces. - Implement
pivot_rootinto the rootfs. Verify/inside the container is the bundle'srootfs/. - Implement
/procmount inside the new PID namespace. Verifypsshows only the container's processes. - Implement UID/GID mapping for user-namespaced runs.
22.4 Hardening Drill¶
- Mask
/proc/kcore,/proc/keys,/proc/timer_list,/proc/sched_debug,/proc/scsi,/sys/firmware. Make/proc/asound,/proc/bus,/proc/fs,/proc/irq,/proc/sys,/proc/sysrq-triggerread-only. (Same as runtime-spec'smaskedPathsandreadonlyPaths.)
22.5 Production Readiness Slice¶
- Run
runc's integration tests against your runtime if feasible (they're spec-compliance tests). At minimum, run a representative subset of the OCI runtime test suite.