Week 18 - Capabilities, Seccomp, no_new_privs¶
18.1 Conceptual Core¶
- Linux capabilities subdivide the historical "root" privilege into ~40 discrete capabilities (
CAP_NET_ADMIN,CAP_SYS_PTRACE,CAP_DAC_OVERRIDE, etc.). A process holds bounding, effective, permitted, inheritable, and ambient sets. - The principle: a service should hold only the capabilities it needs. A web server binding port <1024 needs `CAP_NET_BIND_SERVICE - not full root.
- seccomp-bpf is a syscall-level allowlist/denylist enforced by an eBPF program attached at
prctl(PR_SET_SECCOMP)time. A killer feature for sandboxing. no_new_privs(PR_SET_NO_NEW_PRIVS): once set, neither the calling task nor its descendants can gain privileges viasetuidbinaries, file capabilities, or LSM transitions. Required before applying user-space seccomp (and a generally good default).
18.2 Mechanical Detail¶
getcap,setcapto manage file capabilities.getpcaps <pid>for process caps.- systemd directives:
CapabilityBoundingSet=,AmbientCapabilities=,NoNewPrivileges=yes,SystemCallFilter=,SystemCallArchitectures=native. SystemCallFilter=@system-serviceis a curated allowlist that covers most service workloads. Combine with explicit denylists for risky calls.- For container runtimes, the Docker default seccomp profile (`/etc/docker/seccomp.json - equivalent) is a reasonable baseline; understand why each blocked syscall is blocked.
18.3 Lab-"Capabilities and Seccomp"¶
- Convert a service that runs as root to one that runs as a non-root user with only the minimum capabilities.
- Author a seccomp policy using libseccomp that allows only the syscalls the service uses. Verify by attempting denied syscalls.
- Apply via systemd
SystemCallFilter=and confirm.
18.4 Hardening Drill¶
- Review every long-running service on a host. For each: what capabilities does it actually need? Document. Tighten where possible.
18.5 Performance Tuning Slice¶
- seccomp adds a small per-syscall cost. Measure with
perf stat -e syscalls:sys_enter_*before and after.