Worked example - Week 14: Linux capabilities in a real Dockerfile¶
Companion to Container Internals → Month 04 → Week 14: Capabilities. The week explains the Linux capability model and the difference between root inside the container and root with everything. This page walks one Dockerfile through dropping capabilities one at a time so you can see which knobs do what.
Start from a normal container¶
# v0 - the default
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
curl iproute2 libcap2-bin && rm -rf /var/lib/apt/lists/*
CMD ["sleep", "infinity"]
Build and run:
$ docker build -t cap-demo .
$ docker run --rm -d --name c cap-demo
$ docker exec c capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,
cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
14 capabilities granted by default - the Docker default capability set. This is not "full root." A real root user on the host has ~40 capabilities (run capsh --print outside the container). Docker already drops ~26 of them by default, including the dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE).
But 14 is still too many. Let's see what each does and drop the ones we don't need.
What's actually in use¶
The CMD is sleep infinity. It needs no capabilities at all. Prove it:
But most real containers do something. Suppose we add a tiny HTTP server:
# v1 - a real workload
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 libcap2-bin && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY server.py .
CMD ["python3", "-m", "http.server", "8080"]
Works. Python's HTTP server doesn't need any capabilities because port 8080 is unprivileged (≥1024).
Try port 80:
$ docker run --rm -p 80:80 --cap-drop=ALL cap-demo python3 -m http.server 80
PermissionError: [Errno 13] Permission denied
Binding to ports below 1024 requires CAP_NET_BIND_SERVICE. Add it back, drop everything else:
$ docker run --rm -p 80:80 \
--cap-drop=ALL --cap-add=NET_BIND_SERVICE \
cap-demo python3 -m http.server 80
Works. We're now running with exactly one capability, the minimum to do the job.
Map the capabilities to attacks¶
This is the part most tutorials skip. Here's why each Docker-default capability is actually dangerous if you don't need it:
| Capability | What it allows | If an attacker gets RCE in the container |
|---|---|---|
CAP_CHOWN |
Change file ownership | Take over files in mounted volumes the container shouldn't own. |
CAP_DAC_OVERRIDE |
Bypass file read/write/execute permission checks | Read any file in mounted volumes, regardless of permissions. |
CAP_FOWNER |
Bypass permission checks on operations that normally require the file owner | Modify metadata on mounted files. |
CAP_FSETID |
Set setuid/setgid bits | Create privilege-escalation backdoors in writable mounted paths. |
CAP_KILL |
Send signals to any process | Kill other workloads sharing the same PID namespace. |
CAP_SETUID/SETGID |
Change process UID/GID | Pivot to other user contexts within the container. |
CAP_SETPCAP |
Change capability sets | Re-add dropped capabilities (when combined with namespaces). |
CAP_NET_BIND_SERVICE |
Bind to ports <1024 | Squat on a privileged port to MITM. |
CAP_NET_RAW |
Open raw sockets, send crafted packets | ARP spoofing, packet sniffing if not isolated by network namespace. |
CAP_SYS_CHROOT |
Use chroot() |
Limited escape vectors. |
CAP_MKNOD |
Create device files | Create /dev/sda and read raw disk if filesystem isn't sealed. |
CAP_AUDIT_WRITE |
Write to kernel audit log | Spam audit logs to hide other activity. |
CAP_SETFCAP |
Set file capabilities | Persist privileges on dropped binaries. |
CAP_NET_RAW and CAP_MKNOD are the ones most production guides specifically call out to drop.
The drop-and-add pattern¶
Production-grade Dockerfile + run command:
FROM debian:bookworm-slim
RUN useradd -r -u 10001 app
USER 10001
COPY --chown=10001:10001 server.py /app/server.py
CMD ["python3", "/app/server.py"]
$ docker run --rm \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--read-only \
--tmpfs /tmp \
--security-opt=no-new-privileges \
-p 80:80 \
your-image
That's the minimum-privilege starting point. Walk the flags:
--cap-drop=ALL --cap-add=NET_BIND_SERVICE- exactly one capability.--read-only- root filesystem is read-only; defeats most persistence.--tmpfs /tmp- give the app a writable scratch space (it needs somewhere to write).--security-opt=no-new-privileges- set theNoNewPrivsbit; even setuid binaries can't gain capabilities now.USER 10001- non-root user inside the container. Capabilities are bounded by both the kernel ruleset and the UID; this defense-in-depth matters.
The trap¶
Most production teams set --cap-drop=ALL for "their" containers and then leave third-party sidecar containers (logging agents, service mesh proxies) with default capabilities. The third-party containers are just as exploitable and often have more of an attack surface (network exposure, mounted secrets). Audit them too.
The other trap: capability dropping is a kernel-level mechanism. It does not defend against container-escape vulnerabilities (e.g. runc CVEs). You still want a kernel-level barrier - seccomp profile (next week), AppArmor/SELinux, gVisor or kata for high-isolation needs.
Exercise¶
- Take any container image you currently run. Inspect what capabilities it requests (
docker inspect | jq '.[].HostConfig.CapAdd') and what it actually uses (grep Cap /proc/<pid>/statusfrom inside). - Write the smallest possible
--cap-drop/--cap-addset that lets it still work. Document what each kept capability is for. - Repeat for a Kubernetes Pod. The same fields live in
securityContext.capabilitiesper-container.
Related reading¶
- The main Week 14 chapter covers the underlying kernel model.
- The Week 15 chapter on seccomp is the syscall-level companion to capability dropping.
- Kubernetes Mastery - security context covers applying these in Pod specs.
- Glossary: Capability, Seccomp, NoNewPrivs in the main glossary.