Worked example - Week 14: Linux capabilities in a real Dockerfile¶

Companion to Container Internals → Month 04 → Week 14: Capabilities. The week explains the Linux capability model and the difference between root inside the container and root with everything. This page walks one Dockerfile through dropping capabilities one at a time so you can see which knobs do what.

Start from a normal container¶

# v0 - the default
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl iproute2 libcap2-bin && rm -rf /var/lib/apt/lists/*
CMD ["sleep", "infinity"]

Build and run:

$ docker build -t cap-demo .
$ docker run --rm -d --name c cap-demo
$ docker exec c capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
  cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,
  cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep

14 capabilities granted by default - the Docker default capability set. This is not "full root." A real root user on the host has ~40 capabilities (run capsh --print outside the container). Docker already drops ~26 of them by default, including the dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE).

But 14 is still too many. Let's see what each does and drop the ones we don't need.

What's actually in use¶

The CMD is sleep infinity. It needs no capabilities at all. Prove it:

$ docker run --rm --cap-drop=ALL cap-demo sleep 2
# (exits cleanly)

But most real containers do something. Suppose we add a tiny HTTP server:

# v1 - a real workload
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 libcap2-bin && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY server.py .
CMD ["python3", "-m", "http.server", "8080"]

$ docker run --rm -p 8080:8080 --cap-drop=ALL cap-demo &
$ curl localhost:8080

Works. Python's HTTP server doesn't need any capabilities because port 8080 is unprivileged (≥1024).

Try port 80:

$ docker run --rm -p 80:80 --cap-drop=ALL cap-demo python3 -m http.server 80
PermissionError: [Errno 13] Permission denied

Binding to ports below 1024 requires CAP_NET_BIND_SERVICE. Add it back, drop everything else:

$ docker run --rm -p 80:80 \
    --cap-drop=ALL --cap-add=NET_BIND_SERVICE \
    cap-demo python3 -m http.server 80

Works. We're now running with exactly one capability, the minimum to do the job.

Map the capabilities to attacks¶

This is the part most tutorials skip. Here's why each Docker-default capability is actually dangerous if you don't need it:

Capability	What it allows	If an attacker gets RCE in the container
`CAP_CHOWN`	Change file ownership	Take over files in mounted volumes the container shouldn't own.
`CAP_DAC_OVERRIDE`	Bypass file read/write/execute permission checks	Read any file in mounted volumes, regardless of permissions.
`CAP_FOWNER`	Bypass permission checks on operations that normally require the file owner	Modify metadata on mounted files.
`CAP_FSETID`	Set setuid/setgid bits	Create privilege-escalation backdoors in writable mounted paths.
`CAP_KILL`	Send signals to any process	Kill other workloads sharing the same PID namespace.
`CAP_SETUID/SETGID`	Change process UID/GID	Pivot to other user contexts within the container.
`CAP_SETPCAP`	Change capability sets	Re-add dropped capabilities (when combined with namespaces).
`CAP_NET_BIND_SERVICE`	Bind to ports <1024	Squat on a privileged port to MITM.
`CAP_NET_RAW`	Open raw sockets, send crafted packets	ARP spoofing, packet sniffing if not isolated by network namespace.
`CAP_SYS_CHROOT`	Use `chroot()`	Limited escape vectors.
`CAP_MKNOD`	Create device files	Create `/dev/sda` and read raw disk if filesystem isn't sealed.
`CAP_AUDIT_WRITE`	Write to kernel audit log	Spam audit logs to hide other activity.
`CAP_SETFCAP`	Set file capabilities	Persist privileges on dropped binaries.

CAP_NET_RAW and CAP_MKNOD are the ones most production guides specifically call out to drop.

The drop-and-add pattern¶

Production-grade Dockerfile + run command:

FROM debian:bookworm-slim
RUN useradd -r -u 10001 app
USER 10001
COPY --chown=10001:10001 server.py /app/server.py
CMD ["python3", "/app/server.py"]

$ docker run --rm \
    --cap-drop=ALL \
    --cap-add=NET_BIND_SERVICE \
    --read-only \
    --tmpfs /tmp \
    --security-opt=no-new-privileges \
    -p 80:80 \
    your-image

That's the minimum-privilege starting point. Walk the flags:

--cap-drop=ALL --cap-add=NET_BIND_SERVICE - exactly one capability.
--read-only - root filesystem is read-only; defeats most persistence.
--tmpfs /tmp - give the app a writable scratch space (it needs somewhere to write).
--security-opt=no-new-privileges - set the NoNewPrivs bit; even setuid binaries can't gain capabilities now.
USER 10001 - non-root user inside the container. Capabilities are bounded by both the kernel ruleset and the UID; this defense-in-depth matters.

The trap¶

Most production teams set --cap-drop=ALL for "their" containers and then leave third-party sidecar containers (logging agents, service mesh proxies) with default capabilities. The third-party containers are just as exploitable and often have more of an attack surface (network exposure, mounted secrets). Audit them too.

The other trap: capability dropping is a kernel-level mechanism. It does not defend against container-escape vulnerabilities (e.g. runc CVEs). You still want a kernel-level barrier - seccomp profile (next week), AppArmor/SELinux, gVisor or kata for high-isolation needs.

Exercise¶

Take any container image you currently run. Inspect what capabilities it requests (docker inspect | jq '.[].HostConfig.CapAdd') and what it actually uses (grep Cap /proc/<pid>/status from inside).
Write the smallest possible --cap-drop/--cap-add set that lets it still work. Document what each kept capability is for.
Repeat for a Kubernetes Pod. The same fields live in securityContext.capabilities per-container.

The main Week 14 chapter covers the underlying kernel model.
The Week 15 chapter on seccomp is the syscall-level companion to capability dropping.
Kubernetes Mastery - security context covers applying these in Pod specs.
Glossary: Capability, Seccomp, NoNewPrivs in the main glossary.