Week 14 - Capabilities for Containers¶

14.1 Conceptual Core¶

Linux capabilities subdivide root privilege into ~40 named caps (CAP_NET_ADMIN, CAP_SYS_PTRACE, etc.). Container runtimes apply a bounding set before exec - the container's processes can never gain a capability outside this set, regardless of UID 0.

The default Docker bounding set is ~14 caps: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE. Each enables a specific class of syscalls.

The discipline: most workloads need zero capabilities. Drop everything (--cap-drop=ALL); add back only what testing proves you need.

14.2 Mechanical Detail¶

The capabilities you'll meet most often, with what they unlock:

CAP_NET_BIND_SERVICE - bind to ports < 1024. Common for legacy services; modern apps run on 8080+ and skip this.
CAP_NET_ADMIN - configure interfaces, iptables, routing. Network plugins (CNI), service meshes need it. App containers should not.
CAP_NET_RAW - open raw / packet sockets (ICMP, ping). Often dropped: most services don't need to ping anything from inside.
CAP_SYS_ADMIN - the kitchen sink. ~40 different operations gated by it. Avoid at all costs; equivalent to root in many threat models.
CAP_SYS_PTRACE - attach to other processes (debuggers, strace). Confine to debug containers only.
CAP_DAC_OVERRIDE - bypass file permission checks. Almost always a sign of bad file ownership; fix the ownership instead.

Read what a running container actually has: capsh --decode=$(grep CapEff /proc/<pid>/status | awk '{print $2}').

Apply in Kubernetes via pod spec:

securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]    # only if you actually need it

The trap

Granting CAP_SYS_ADMIN because "the container needs to mount something." 90% of the time the actual need is for CAP_SYS_CHROOT or a specific filesystem-related cap. SYS_ADMIN opens ~40 unrelated operations and is the single most-abused capability in misconfigured containers.

14.3 Lab - "Capability Diet"¶

For three services (e.g., a Go HTTP server, an Nginx reverse proxy, a Node.js app), run each with --cap-drop=ALL. Identify what fails (the error usually mentions the syscall - map back to the capability via capabilities(7)).
Add back capabilities one at a time. Document the minimum set per service.
Configure your container runtime (podman, containerd) or pod-security policy to apply this minimum by default.

14.4 Hardening Drill¶

For any service requiring more than 3 caps, write a one-paragraph justification. If you can't justify, you don't need it. Common offenders worth re-auditing: anything inheriting from an old base image, anything that runs as root inside the container.

14.5 Production Readiness Slice¶

Add a CI step that fails any new image whose declared capability set exceeds the team's allowlist. Trivy, Kyverno, OPA, and Pod Security Standards (Kubernetes restricted profile) all support this check. The right gate: PR-time, not runtime - runtime is too late.