Skip to content

Worked example - Week 14: Linux capabilities in a real Dockerfile

Companion to Container Internals → Month 04 → Week 14: Capabilities. The week explains the Linux capability model and the difference between root inside the container and root with everything. This page walks one Dockerfile through dropping capabilities one at a time so you can see which knobs do what.

Start from a normal container

# v0 - the default
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl iproute2 libcap2-bin && rm -rf /var/lib/apt/lists/*
CMD ["sleep", "infinity"]

Build and run:

$ docker build -t cap-demo .
$ docker run --rm -d --name c cap-demo
$ docker exec c capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
  cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,
  cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep

14 capabilities granted by default - the Docker default capability set. This is not "full root." A real root user on the host has ~40 capabilities (run capsh --print outside the container). Docker already drops ~26 of them by default, including the dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE).

But 14 is still too many. Let's see what each does and drop the ones we don't need.

What's actually in use

The CMD is sleep infinity. It needs no capabilities at all. Prove it:

$ docker run --rm --cap-drop=ALL cap-demo sleep 2
# (exits cleanly)

But most real containers do something. Suppose we add a tiny HTTP server:

# v1 - a real workload
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 libcap2-bin && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY server.py .
CMD ["python3", "-m", "http.server", "8080"]
$ docker run --rm -p 8080:8080 --cap-drop=ALL cap-demo &
$ curl localhost:8080

Works. Python's HTTP server doesn't need any capabilities because port 8080 is unprivileged (≥1024).

Try port 80:

$ docker run --rm -p 80:80 --cap-drop=ALL cap-demo python3 -m http.server 80
PermissionError: [Errno 13] Permission denied

Binding to ports below 1024 requires CAP_NET_BIND_SERVICE. Add it back, drop everything else:

$ docker run --rm -p 80:80 \
    --cap-drop=ALL --cap-add=NET_BIND_SERVICE \
    cap-demo python3 -m http.server 80

Works. We're now running with exactly one capability, the minimum to do the job.

Map the capabilities to attacks

This is the part most tutorials skip. Here's why each Docker-default capability is actually dangerous if you don't need it:

Capability What it allows If an attacker gets RCE in the container
CAP_CHOWN Change file ownership Take over files in mounted volumes the container shouldn't own.
CAP_DAC_OVERRIDE Bypass file read/write/execute permission checks Read any file in mounted volumes, regardless of permissions.
CAP_FOWNER Bypass permission checks on operations that normally require the file owner Modify metadata on mounted files.
CAP_FSETID Set setuid/setgid bits Create privilege-escalation backdoors in writable mounted paths.
CAP_KILL Send signals to any process Kill other workloads sharing the same PID namespace.
CAP_SETUID/SETGID Change process UID/GID Pivot to other user contexts within the container.
CAP_SETPCAP Change capability sets Re-add dropped capabilities (when combined with namespaces).
CAP_NET_BIND_SERVICE Bind to ports <1024 Squat on a privileged port to MITM.
CAP_NET_RAW Open raw sockets, send crafted packets ARP spoofing, packet sniffing if not isolated by network namespace.
CAP_SYS_CHROOT Use chroot() Limited escape vectors.
CAP_MKNOD Create device files Create /dev/sda and read raw disk if filesystem isn't sealed.
CAP_AUDIT_WRITE Write to kernel audit log Spam audit logs to hide other activity.
CAP_SETFCAP Set file capabilities Persist privileges on dropped binaries.

CAP_NET_RAW and CAP_MKNOD are the ones most production guides specifically call out to drop.

The drop-and-add pattern

Production-grade Dockerfile + run command:

FROM debian:bookworm-slim
RUN useradd -r -u 10001 app
USER 10001
COPY --chown=10001:10001 server.py /app/server.py
CMD ["python3", "/app/server.py"]
$ docker run --rm \
    --cap-drop=ALL \
    --cap-add=NET_BIND_SERVICE \
    --read-only \
    --tmpfs /tmp \
    --security-opt=no-new-privileges \
    -p 80:80 \
    your-image

That's the minimum-privilege starting point. Walk the flags:

  • --cap-drop=ALL --cap-add=NET_BIND_SERVICE - exactly one capability.
  • --read-only - root filesystem is read-only; defeats most persistence.
  • --tmpfs /tmp - give the app a writable scratch space (it needs somewhere to write).
  • --security-opt=no-new-privileges - set the NoNewPrivs bit; even setuid binaries can't gain capabilities now.
  • USER 10001 - non-root user inside the container. Capabilities are bounded by both the kernel ruleset and the UID; this defense-in-depth matters.

The trap

Most production teams set --cap-drop=ALL for "their" containers and then leave third-party sidecar containers (logging agents, service mesh proxies) with default capabilities. The third-party containers are just as exploitable and often have more of an attack surface (network exposure, mounted secrets). Audit them too.

The other trap: capability dropping is a kernel-level mechanism. It does not defend against container-escape vulnerabilities (e.g. runc CVEs). You still want a kernel-level barrier - seccomp profile (next week), AppArmor/SELinux, gVisor or kata for high-isolation needs.

Exercise

  1. Take any container image you currently run. Inspect what capabilities it requests (docker inspect | jq '.[].HostConfig.CapAdd') and what it actually uses (grep Cap /proc/<pid>/status from inside).
  2. Write the smallest possible --cap-drop/--cap-add set that lets it still work. Document what each kept capability is for.
  3. Repeat for a Kubernetes Pod. The same fields live in securityContext.capabilities per-container.

Comments