Week 15 - Seccomp Profiles for Containers¶

15.1 Conceptual Core¶

A seccomp profile is a JSON document describing per-syscall actions: allow, log, errno (return an error), or kill (terminate the process). The container runtime compiles the JSON to a BPF filter and applies it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter). Once installed, the filter cannot be loosened - only tightened.

The default Docker profile allows ~310 syscalls and blocks ~50 (the ones rarely needed by app containers but useful to attackers - keyctl, kexec_load, umount, etc.). Multiple recent kernel CVEs have been blocked entirely by the default profile, even on unpatched hosts. Tighter custom profiles per service reduce attack surface further.

15.2 Mechanical Detail¶

Profile structure: a defaultAction plus per-syscall rules, with optional argument-value matching.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "exit_group", "futex", "mmap"],
      "action": "SCMP_ACT_ALLOW" },
    { "names": ["openat"],
      "action": "SCMP_ACT_ALLOW",
      "args": [{"index": 2, "value": 0, "op": "SCMP_CMP_MASKED_EQ", "valueTwo": 2}] }
  ]
}

Generating profiles for a specific service:

oci-seccomp-bpf-hook (Red Hat) - attach to a container, record every syscall it makes during a representative workload, emit a JSON profile. The right tool for "what does this app actually need?"
falcoctl / Falco artifacts - newer; supports community-shared profiles.
Manual: strace -c -ff -o trace.out <cmd> - enumerate syscalls under load, then deny everything else. Slower but no extra dependency.

Apply: - Docker / podman: --security-opt seccomp=profile.json. - Kubernetes pod spec: securityContext.seccompProfile.type: Localhost + localhostProfile: profiles/myapp.json (the kubelet looks under /var/lib/kubelet/seccomp/).

The trap

Recording a seccomp profile from a happy-path workload only. Edge cases (error handling, log rotation, graceful shutdown) need different syscalls; the profile blocks them in production and the service crashes mysteriously hours later. Always run the recorder through your full integration-test suite, not just the smoke test.

15.3 Lab - "Custom Seccomp"¶

Run a service under oci-seccomp-bpf-hook (or strace -ff) and exercise it with your integration tests.
Generate a tight profile (default-deny + only the recorded syscalls).
Run with the profile; verify the service works under load.
Inject a "test" syscall (e.g., setns, unshare, or mount) the service doesn't legitimately use; verify it's blocked at runtime.

15.4 Hardening Drill¶

For long-running services, ship the custom seccomp profile alongside the image (e.g., as /seccomp/profile.json baked in, or as a ConfigMap mounted into /var/lib/kubelet/seccomp/). Reference it in deployment configs. Version it with the code - a profile that goes stale relative to its app is worse than no profile.

15.5 Production Readiness Slice¶

Document a process: every new service must ship with a seccomp profile generated from a representative load test, reviewed by a peer, committed to the repo. Pre-prod CI: run with the profile in audit-only mode (SCMP_ACT_LOG), collect any unexpected syscalls, fail the build if there are deltas from the committed profile.