Container Internals¶

OCI, filesystems, runtimes, security, supply chain.

Printing this page

Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.

Container Internals & Runtimes-A 24-Week Mastery Roadmap¶

Authoring lens: Senior Container Architect. Target outcome: A graduate of this curriculum can (a) build, run, and inspect containers without a Docker daemon-using runc, skopeo, buildah, and crun directly, (b) reason from OCI specs to wire-level container behavior, (c) ship hardened images with reproducible builds, SBOMs, and signed provenance, and (d) implement a "mini-Docker" demonstrating manual orchestration of namespaces, cgroups, and rootfs.

This is not "Docker in a week." It assumes the reader has used containers and is ready to read the OCI specs and runc source as primary literature.

Repository Layout¶

File	Purpose
`00_PRELUDE_AND_PHILOSOPHY.md`	What containers actually are (and aren't); the shape of the OCI ecosystem.
`01_MONTH_OCI_FOUNDATIONS.md`	Weeks 1–4. OCI image + runtime specs, runc, crun, skopeo.
`02_MONTH_FILESYSTEMS_AND_BUILDS.md`	Weeks 5–8. OverlayFS, image layers, buildah, multi-stage, distroless.
`03_MONTH_RUNTIMES_AND_DAEMONS.md`	Weeks 9–12. containerd, CRI-O, podman, the no-daemon model, rootless.
`04_MONTH_SECURITY.md`	Weeks 13–16. Capabilities, seccomp, AppArmor/SELinux for containers, user namespaces.
`05_MONTH_SUPPLY_CHAIN.md`	Weeks 17–20. SBOM (Syft), vuln scanning (Grype/Trivy), signing (cosign), SLSA.
`06_MONTH_BUILD_YOUR_OWN.md`	Weeks 21–24. Mini-Docker capstone: Go or Rust implementation.
`APPENDIX_A_HARDENING.md`	Image hardening, runtime hardening, gVisor/Kata, rootless patterns.
`APPENDIX_B_REFERENCE_PATTERNS.md`	Common image patterns, multi-arch builds, debugging, CI/CD recipes.
`APPENDIX_C_CONTRIBUTING.md`	Contribution paths to runc, containerd, podman, buildah.
`CAPSTONE_PROJECTS.md`	Three tracks: mini-Docker, image scanning service, runtime fork.

How Each Week Is Structured¶

Conceptual Core-the why, with a mental model.
Mechanical Detail-the how, down to spec section and source location.
Lab-a hands-on exercise.
Hardening Drill-a security-relevant micro-task that compounds.
Production Readiness Slice-a CI/CD, registry, signing, or scanning task that builds a publishable template.

Each week is sized for ~12–16 focused hours.

Progression Strategy¶

OCI Foundations ──► Filesystems & Builds ──► Runtimes & Daemons
       │                    │                          │
       └──────────┬─────────┴──────────────────────────┘
                  ▼
              Security
                  │
                  ▼
             Supply Chain
                  │
                  ▼
            Build Your Own

Prerequisites¶

Comfortable on a Linux command line.
Familiar with namespaces and cgroups at a basic level (see the Linux curriculum for the deep version).
Reading-comfortable with C or Go or Rust-capstone choice depends on this.

Capstone Tracks (pick one in Month 6)¶

Mini-Docker-a from-scratch container runner in Go or Rust implementing namespaces, cgroups, OverlayFS, and a small subset of OCI spec.
Image Scanning & Signing Service-an HTTP service that ingests images, runs Syft + Grype + Trivy, attaches signed SBOMs, gates promotion via cosign-based policy.
Custom Runtime-fork runc (or write a `crun - equivalent) adding one feature: gVisor-style sandbox, custom seccomp generator, or eBPF-based observability.

Details in CAPSTONE_PROJECTS.md.

Prelude-What Containers Actually Are¶

Sit with this document for an evening before week 1.

1. There Is No Such Thing As a "Container"¶

The kernel has no concept of "container." There is no struct container in /proc. What people call a container is a bundle of kernel features applied together to a process:

One or more namespaces (PID, NET, MNT, UTS, IPC, USER, CGROUP) for isolation.
One or more cgroups v2 for resource limits.
A rootfs mounted as the process's /, usually via pivot_root and an OverlayFS stack.
A seccomp filter restricting syscalls.
An LSM label (SELinux/AppArmor) restricting object access.
A capabilities mask restricting privilege.

A "container runtime" is a program that arranges these things from a specification (OCI runtime config), then execves the user's command. That's it. Docker is not the OS. Containers are not VMs. There is no hypervisor-equivalent.

If you internalize this, the rest of the curriculum is bookkeeping.

2. The OCI Layer Cake¶

The Open Container Initiative defines three specifications that everything in the ecosystem implements:

Image Spec-what a container image is: a manifest, a config, a stack of layer tarballs, all addressable by content (SHA-256).
Runtime Spec-what a runtime configuration is: a config.json + a rootfs directory.
Distribution Spec-what a registry is: an HTTP API for pushing/pulling content-addressed blobs and manifests.

docker pull (image + distribution), docker run (runtime), docker build (image), docker push (distribution)-all four operations are OCI-spec'd. Once you can do them with runc, skopeo, and buildah, you understand the ecosystem.

3. The Tooling Map¶

Concern	OCI-spec tool	Daemon-based equivalent
Build images	`buildah`	`docker build`
Pull / push / inspect images	`skopeo`	`docker pull/push/inspect`
Run containers (low-level)	`runc` / `crun` / `youki`	hidden under dockerd
Run containers (high-level)	`podman`, `nerdctl`	`docker` CLI
Container daemon	`containerd`, `CRI-O`	`dockerd`

By month 4 you should never type docker again for any task in this curriculum, except to demonstrate equivalence with the daemon-based world.

4. Cost Model¶

A working container engineer reasons along five axes:

Axis	Question
Image	What's in this image? What's its provenance, vulnerability surface, layer count, total size?
Runtime	What namespaces, cgroups, seccomp, capabilities, LSM are applied?
Filesystem	What's the storage driver (overlay2, fuse-overlayfs, btrfs)? Is the layered FS the bottleneck?
Network	What CNI? Bridge, macvlan, host-mode? What's the per-packet overhead?
Supply chain	Is it signed? Is the SBOM accurate? What's the SLSA provenance level?

Beginner courses teach axis 1 only.

5. The Reading List¶

Primary - The OCI specs themselves (opencontainers/image-spec, opencontainers/runtime-spec, opencontainers/distribution-spec). Each is short-read all three before week 1 ends. - runc source (opencontainers/runc), particularly libcontainer/. - containerd architecture docs (containerd/containerd/docs/). - buildah and podman documentation. - Container Security (Liz Rice). Best concise text on the security model.

Secondary - Linux in Action (David Clinton)-chapters 8–10 if you want a softer on-ramp. - The CNCF Cloud Native Glossary-terminology calibration. - Aleksa Sarai's blog (cyphar.com)-runc maintainer; deep posts on rootless, user namespaces.

Adjacent (you must know) - The Linux curriculum's namespaces and cgroups chapters. If you skip them, this curriculum will not stick.

6. Curriculum Philosophy¶

Spec first, tool second. Whenever a behavior surprises you, the OCI spec is the source of truth. docker run with no flags hides ~50 default values; runc exposes them.
Daemonless by default. All weekly labs target the daemonless toolchain (buildah, skopeo, podman, runc). Learn the canonical primitives; the daemon-based version is an ergonomic skin on top.
Rootless by default once feasible. Modern Linux supports rootless containers via user namespaces. Practice it from week 9 onward.

7. What Containers Are Not For¶

Hard isolation. A container is a process with namespaces. Kernel exploits cross containers. For untrusted multi-tenant code, use a VM-class isolation layer (gVisor, Kata, Firecracker)-covered in APPENDIX_A.
Stateful systems with strict durability. Volume management adds complexity; production databases benefit from running outside containers (or with mature operators in K8s, see the Kubernetes curriculum).
GUI applications. Possible (X11/Wayland forwarding) but rarely the right tool.

You are now ready for Week 1. Open 01_MONTH_OCI_FOUNDATIONS.md.

Month 1-OCI Foundations: Specs, runc, skopeo, crun¶

Goal: by the end of week 4 you can (a) hand-author an OCI runtime config.json and run a container with runc directly, (b) push, pull, and copy images between registries with skopeo without ever touching a daemon, (c) read an image manifest, layer hashes, and config blob, and (d) explain the difference between runc and crun and pick one for a workload.

Weeks¶

Week 1 - The OCI Image Spec¶

1.1 Conceptual Core¶

An OCI image is content-addressed: every blob (layer, config, manifest) is named by sha256:<digest>. Identity = content. Immutable.
Top-level: a manifest lists the config and layers for one platform. An index lists multiple manifests for multi-platform images (linux/amd64, linux/arm64, etc.).
Layers are tar archives representing filesystem changesets, often gzip-compressed (application/vnd.oci.image.layer.v1.tar+gzip).
Configs are JSON documents describing entrypoint, env, working dir, exposed ports, volumes, labels.

1.2 Mechanical Detail¶

The manifest schema (image-spec/specs-go/v1/manifest.go in the spec repo). Keys: mediaType, schemaVersion, config (descriptor), layers ([]descriptor), annotations.
Descriptor structure: mediaType, digest, size, optional urls and annotations.
The OCI layout on local disk: a directory with oci-layout, index.json, and blobs/sha256/<digest> files. Use skopeo copy docker://nginx:latest oci:./nginx-layout:latest and inspect.
Distinguish OCI mediaType from the older Docker v2.2 mediaType-they're nearly isomorphic but not identical. skopeo and modern registries handle both.

1.3 Lab-"An Image Without Docker"¶

skopeo copy docker://alpine:3.19 oci:./alpine-layout:3.19. Inspect the layout. Read index.json, the manifest blob, the config blob.
Find a layer blob, decompress, list its contents (tar tzf <blob>).
Compute one of the layer digests yourself (sha256sum) and verify.
Modify the config (e.g., change the entrypoint) by writing a new config blob, generating a new manifest, updating index.json. Verify with skopeo inspect oci:./alpine-layout:3.19.

1.4 Hardening Drill¶

Read CVE history of registry-side spec misinterpretations (e.g., the 2018 layer-extraction symlink attacks). Internalize that any tool processing untrusted images must validate paths during extraction.

1.5 Production Readiness Slice¶

Spin up a local registry: docker run -d --rm -p 5000:5000 registry:2 (or, true to the spirit of the curriculum, run it under podman). skopeo copy oci:./alpine-layout:3.19 docker://localhost:5000/alpine:3.19. You now have a registry you control.

Week 2 - The OCI Runtime Spec, `runc`, and `crun`¶

2.1 Conceptual Core¶

A runtime bundle = a directory containing config.json (the runtime spec) + rootfs/ (the filesystem to chroot/pivot_root into).
runc create <id> reads config.json, sets up namespaces, cgroups, mounts, seccomp, capabilities, then waits. runc start <id> runs the configured command. runc state <id> shows status. runc kill <id> SIGTERM signals. runc delete <id> cleans up.
Two production runtimes both implementing the OCI runtime spec:
runc (Go, the reference; what Docker / containerd use by default).
crun (C, faster startup, lower memory, default in Podman on RHEL/Fedora).
youki (Rust, gaining ground; primary Rust implementation).

2.2 Mechanical Detail¶

config.json schema (runtime-spec/config.md). Major sections:
`process - args, env, user, capabilities, rlimits.
root - path to rootfs,readonly` flag.
mounts - list of`.
linux.namespaces, linux.uidMappings, `linux.gidMappings - isolation.
`linux.resources - cgroups settings.
`linux.seccomp - full seccomp filter.
linux.maskedPaths, `linux.readonlyPaths - host-leakage hardening.
`hooks - pre/post container lifecycle.

2.3 Lab-"Run a Container Without Docker"¶

Generate a default config: runc spec produces config.json.
Build a rootfs: mkdir rootfs && skopeo copy docker://alpine:3.19 oci:./alpine && umoci unpack --image ./alpine:3.19 ./bundle (umoci gives you both rootfs + config in one step). Or do it manually.
Run: sudo runc run mycontainer. You're inside the container.
Modify the config to: drop all capabilities except CAP_NET_BIND_SERVICE, set a memory limit of 64M, mask /proc/sys. Re-run; verify with cat /proc/self/status | grep Cap and pressure tests.
Repeat with crun. Time the startup difference (time runc run vs time crun run)-crun is typically 2–5× faster.

2.4 Hardening Drill¶

Read the default seccomp profile in runc's libcontainer/seccomp/seccomp_default.go (the equivalent profile is shipped with Docker as default.json). Note which syscalls it blocks. Review the spec's linux.seccomp schema and write a tighter custom profile for a specific service.

2.5 Production Readiness Slice¶

Add an automated CI step that lints any custom config.json against the OCI spec schema. Use runc spec --rootless and study the differences vs the privileged config-this is the foundation for Month 3's rootless work.

Week 3 - `skopeo` Deep Dive: Multi-Arch, Signing, Sync¶

3.1 Conceptual Core¶

skopeo is the image-manipulation tool that doesn't require a daemon or storage backend. It can copy between any of: docker:// (registry), oci: (local OCI layout), dir: (raw blob dir), containers-storage: (local CRI-style store), oci-archive:, docker-archive:.
skopeo is also how you do registry maintenance: mirror, sync, prune, inspect manifests without pulling layers.

3.2 Mechanical Detail¶

skopeo inspect - show a manifest without downloading layers. - -raw for the manifest as bytes. - -configfor the config blob. - -format with Go templates for scripting.
skopeo copy --all - for multi-platform images, copy the entire index (all platforms). Without - -all, skopeo selects the running platform's manifest.
`skopeo sync - mirror a registry/repo to another registry or to a local OCI dir. The reference tool for air-gapped ops.
skopeo login, skopeo logout - credentials in${XDG_RUNTIME_DIR}/containers/auth.json`.

3.3 Lab-"A Daemonless Image Pipeline"¶

Pull a multi-arch image as an OCI index. Inspect each per-platform manifest.
Write a script that, given an image reference, prints a table of platforms, layer counts, total compressed/uncompressed sizes, and labels.
Use skopeo sync to mirror three images into your local registry. Verify by pulling the mirrored versions.
Compare skopeo copy of a 1-GB image with and without - -multi-arch index-only` on the destination side.

3.4 Hardening Drill¶

Configure skopeo to verify signatures on copy ( - -policy - aware policy.json). The default policy is "insecureAcceptAnything"-change this in production.

3.5 Production Readiness Slice¶

Build a CI job: on every release tag, copy the image from a "staging" registry path to a "production" path only after a Cosign signature is verified. Implementation in week 19 (cosign verify).

Week 4 - Image Internals: Manifest Lists, Index, Annotations, Sparse Pulls¶

4.1 Conceptual Core¶

A manifest list / index points to per-platform manifests. The runtime selects the matching one. This is how docker pull nginx works on both ARM and x86.
Annotations are a key/value sidecar on manifests, configs, and layers. Standardized keys: org.opencontainers.image.source, .revision, .created, .licenses, .description. Use them; downstream tools read them.
Sparse / lazy pulls-eStargz and Zstd:chunked formats let containers start before all layers are fully transferred. containerd snapshotters (stargz-snapshotter) implement this.

4.2 Mechanical Detail¶

The index spec is in image-spec/image-index.md. Key field: manifests[] with platform descriptors (os, architecture, optional variant, os.version).
Annotations propagate through: build → manifest → registry → consumer. buildah and podman set them automatically when given the right flags.
eStargz: a TAR-compatible format with a footer containing per-file offsets. The snapshotter pulls only the metadata initially and fetches files on access.

4.3 Lab-"Build a Multi-Arch Image By Hand"¶

Build an image for linux/amd64 and linux/arm64 separately (use buildah --arch= or docker buildx).
Use skopeo to assemble a manifest list pointing to both.
Push to your local registry.
Pull from each architecture; verify the right manifest is selected.
Add OCI annotations (source, revision, created); verify they survive the pipeline.

4.4 Hardening Drill¶

Annotate every built image with provenance: source repo URL + commit SHA. This is the precursor to SLSA (week 19).

4.5 Production Readiness Slice¶

Configure containerd (week 9) to use the `stargz-snapshotter - measure container startup time for a large image (1+ GB) with vs without lazy pulling.

Month 1 Capstone Deliverable¶

A oci-foundations/ workspace: 1. runc-bundle/ - week 2's hand-rolled runtime bundle with hardening. 2.daemonless-pipeline/ - skopeo - based image-handling scripts. 3.multiarch-build/ - week 4's hand-assembled multi-arch image with annotations. 4. A RUNBOOK.md covering: registry setup, image inspection, signature verification flow.

Month 2-Filesystems and Image Builds¶

Goal: by the end of week 8 you can (a) construct an OverlayFS filesystem by hand and explain copy-up, (b) build images with buildah directly (no Dockerfile required), (c) author multi-stage Dockerfiles that produce minimal distroless images, and (d) reason about layer caching, build context, and reproducible builds.

Weeks¶

Week 5 - OverlayFS and Storage Drivers¶

5.1 Conceptual Core¶

A container's rootfs is built by stacking image layers via a union filesystem. The dominant driver on Linux is OverlayFS (in tree since 3.18). Layers are read-only lower dirs; the container's writable space is the upper dir; the visible merged view is the mount target.
On any write to a file in a lower layer, the file is copied up to the upper layer first (copy-on-write). This is what makes layered images fast to start but slow to write large files modified from a lower layer.
Other drivers: aufs (legacy), btrfs (snapshots), zfs (heavy), devicemapper (deprecated), vfs (no CoW; ultra-portable, ultra-slow).

5.2 Mechanical Detail¶

mount -t overlay overlay -o lowerdir=A:B:C,upperdir=U,workdir=W /merged. workdir is required for OverlayFS bookkeeping; it must be on the same filesystem as upperdir.
Whiteouts: a file deleted in the upper relative to the lower is represented by a char 0,0 device file. Listing/diff operations interpret these.
Opaque directories: the trusted.overlay.opaque="y" xattr marks a dir whose lower contents should be hidden.
The containerd snapshotter abstraction: each driver implements a Snapshotter interface; the snapshotter manages active and committed snapshots, writable layers, etc.

5.3 Lab-"OverlayFS By Hand"¶

Create three lower dirs with different files. Mount as overlay. Verify merged view.
Modify a file from the lower; observe copy-up in the upper.
Delete a lower file from the merged view; observe the whiteout in the upper.
Reproduce a "container layer": treat your container's tarball-extracted contents as a lower; create a fresh upper; mount; modify; tar up the upper to produce a new layer.

5.4 Hardening Drill¶

Audit OverlayFS CVEs. The class of "container escape via crafted file in lower layer" has been exploited. Mitigations: rootless mode + user namespaces, or a sandbox layer (gVisor, Kata).

5.5 Production Readiness Slice¶

Compare OverlayFS, fuse-overlayfs (rootless default), and the kernel's native rootless overlay (since 5.13) for a representative workload. Measure layer-creation, file-write, and read-many performance.

Week 6 - `buildah`: Building Images Without Dockerfiles¶

6.1 Conceptual Core¶

A Dockerfile is one DSL for building images. It is not the only one. buildah exposes the underlying primitives: create a working container, run commands in it, copy files, set config, commit to an image.
This matters because:
CI systems can build images without a privileged daemon.
You can build images programmatically (e.g., from a Go program).
You can construct images with stricter properties (provenance, reproducibility) than Dockerfiles natively allow.

6.2 Mechanical Detail¶

The buildah API has Dockerfile-equivalent commands plus richer ones:
`buildah from - start a working container from a base.
`buildah run -- - run a command inside.
`buildah copy - copy files in.
`buildah config --entrypoint='["..."]' - set config.
`buildah commit - produce an image.
`buildah unshare - enter a user namespace; lets you operate on storage as "root" without being host-root. Foundation for rootless builds.
buildah build (alias buildah bud) reads a Dockerfile and uses the same primitives.

6.3 Lab-"Image as a Shell Script"¶

Write a shell script that uses buildah from, run, copy, config, commit to produce a small Go-binary-on-alpine image. No Dockerfile.
Add reproducibility flags: - -source-date-epoch, - -timestamp, SOURCE_DATE_EPOCH env. Build twice; verify hashes match.
Build the same image with buildah bud -f Dockerfile. Compare hashes-they should be identical when both are reproducible.

6.4 Hardening Drill¶

Build everything as a non-root user (buildah unshare, rootless mode). Confirm the storage location is in ~/.local/share/containers/, not /var/lib/containers.

6.5 Production Readiness Slice¶

Wire buildah into a CI job that targets linux/amd64 and linux/arm64 from the same x86 runner using qemu-user-static. Document the multi-arch build contract.

Week 7 - Multi-Stage Builds, Distroless, Minimal Images¶

7.1 Conceptual Core¶

The point of a build image is to not be the runtime image. A modern image pipeline:
Stage 1 (build): full build environment (compiler, headers, dev tools).
Stage 2 (test): the build artifacts plus test runners.
Stage 3 (runtime): a minimal image with just the artifact.
Distroless images (Google's gcr.io/distroless/*) contain only the runtime dependencies-no shell, no package manager, no cat. Smaller attack surface, smaller image, faster startup.
Static binaries (Go with CGO_ENABLED=0, Rust with musl, Java GraalVM native-image) can run on scratch (the empty base image): typically <20 MB total.

7.2 Mechanical Detail¶

Multi-stage Dockerfile:

FROM golang:1.22-alpine AS build
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /app ./cmd/app

FROM gcr.io/distroless/static:nonroot
COPY --from=build /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

Distroless variants: static, base, cc, python3, java, etc. Pick the smallest that works.
The nonroot tag ensures the default user is UID 65532-never root.
The :debug tag adds busybox for emergency debugging-use only for one-off triage in dev.

7.3 Lab-"Three Image Diet"¶

Take a Go (or Rust, or Python) service and produce three images: 1. Naive: FROM ubuntu, build inline. Measure size. 2. Distroless: multi-stage with gcr.io/distroless/static. Measure size. 3. Scratch: static build, FROM scratch. Measure size.

Document the size delta and any operational tradeoffs (e.g., scratch has no ca-certificates -tls.Configfailures unless youCOPY --from=alpine /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/`).

7.4 Hardening Drill¶

Run docker scout cves (or trivy image) on each variant; observe that scratch and distroless have ~zero CVEs from the base, while ubuntu/alpine have many. The CVEs aren't gone-the attack surface is reduced. Internalize the difference.

7.5 Production Readiness Slice¶

Configure your CI to fail builds whose image grows by >5% vs the baseline. This forces conscious deltas; surprise growth is often a leaked dev tool.

Week 8 - Layer Caching, Build Context, Reproducibility¶

8.1 Conceptual Core¶

Image-build performance is dominated by layer cache hit rate. A miss invalidates every subsequent layer; a hit reuses upstream work.
The cache key is determined by: the parent layer's digest + the operation (the exact command, copy contents, build args). Order operations from least-frequently-changing to most-frequently-changing.
Reproducible builds = byte-identical outputs from identical inputs. Requires: pinned base images (by digest, not tag), SOURCE_DATE_EPOCH, deterministic file ordering (tar - -sort=name`), no embedded build-host info.

8.2 Mechanical Detail¶

COPY ordering: copy go.mod/package.json/Cargo.lock first, run dep install (cached on subsequent unrelated changes), then copy source. Saves dep-install time on every code-only change.
BuildKit cache mounts (RUN --mount=type=cache,target=/root/.cache/go-build): persist a build directory across image builds, even when the surrounding layer is invalidated. Massive speedup for compiled-language workflows.
.dockerignore: every byte sent to the daemon contributes to context size and may invalidate caches. Pattern after .gitignore.
Pin base images by digest: FROM golang:1.22@sha256:abc.... Tag-based pins drift silently.

8.3 Lab-"Cache and Reproducibility"¶

Take a non-trivial image; measure clean-build time and incremental-build time (single source change). Reorder Dockerfile to maximize cache hits; re-measure.
Enable BuildKit cache mounts; measure again.
Build the same image on two machines with SOURCE_DATE_EPOCH set; verify the digests match.

8.4 Hardening Drill¶

Pin every base image by digest. Document a refresh policy (e.g., monthly digest-bump PRs reviewed for security advisories).

8.5 Production Readiness Slice¶

Add a CI job that builds the image twice in fresh runners and asserts digest_run1 == digest_run2. Reproducibility regressions become P1 issues.

Month 2 Capstone Deliverable¶

A filesystems-and-builds/ workspace: 1. overlayfs-by-hand/ - week 5 lab. 2.buildah-pipeline/ - week 6 daemonless build pipeline. 3. three-image-diet/ - week 7 size comparison + tradeoff analysis. 4.reproducible-build/ - week 8 with hash-equivalence CI gate.

Month 3-Runtimes and Daemons: containerd, CRI-O, podman, Rootless¶

Goal: by the end of week 12 you can (a) deploy and operate containerd directly, (b) explain the CRI (Container Runtime Interface) and how Kubernetes drives it, (c) run rootless containers fluently with podman, and (d) reason about runtime choices (runc vs crun vs gVisor vs Kata) for a workload.

Weeks¶

Week 9 - `containerd` Architecture¶

9.1 Conceptual Core¶

containerd is a container daemon that manages: image pull/push, content storage, layered snapshotters, runtime invocation (via OCI-spec runtimes like runc/crun), task/process management.
It is not a monolithic daemon. Plugins (snapshotter, runtime, content store) are pluggable. containerd is what dockerd actually calls underneath; it is also the default runtime daemon in Kubernetes since 1.24.

9.2 Mechanical Detail¶

Architecture (read containerd/containerd/docs/architecture.md):
Content store-content-addressed blob storage.
Image store-refs and image metadata.
Snapshotters-overlayfs, btrfs, zfs, stargz, devmapper. Pluggable.
Runtime plugin-invokes OCI runtimes via the shim model (one shim process per container, decouples container lifecycle from daemon restart).
Tasks API-manage processes inside containers.
Events API-pub/sub for lifecycle events.
The ctr CLI is a debugging tool, not a user-facing CLI. For users: nerdctl (Docker-compatible), crictl (CRI-level), or higher-level (Kubernetes, Buildah).
The shim (containerd-shim-runc-v2) keeps the container alive across containerd daemon restarts. Each container has its own shim.

9.3 Lab-"containerd Without Kubernetes"¶

Install containerd and nerdctl. Configure /etc/containerd/config.toml.
Pull, run, exec, kill containers entirely via nerdctl. Confirm dockerd is not running.
Enable the stargz-snapshotter. Pull a large image with eStargz layers. Measure first-run startup time vs cold pull.
Use ctr to inspect tasks, snapshots, and content blobs at the daemon level.

9.4 Hardening Drill¶

Configure containerd to use a custom seccomp profile and AppArmor (or SELinux) profile by default. Update config.toml's [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] section.

9.5 Production Readiness Slice¶

Wire containerd metrics to Prometheus (metrics.address in config). Plot container start latency, image pull bytes, snapshotter ops/sec.

Week 10 - CRI-O and the Kubernetes CRI¶

10.1 Conceptual Core¶

CRI (Container Runtime Interface) is Kubernetes's gRPC API for talking to a container runtime. Two implementations dominate: containerd (with its CRI plugin) and CRI-O (built specifically for Kubernetes, Red Hat's choice).
CRI is narrower than full container management: it covers what kubelet needs and nothing else. No image-build, no high-level operations.
CRI-O philosophy: minimum daemon surface, OCI-spec-only, no Docker compatibility.

10.2 Mechanical Detail¶

CRI services: RuntimeService (containers, sandboxes, exec/attach/portforward) + ImageService (pull, list, remove). Defined in cri-api/pkg/apis/runtime/v1alpha2/api.proto.
The pause container (every k8s pod has one): holds the network namespace alive while application containers are restarted. CRI-O and containerd both use this pattern.
crictl - direct CRI client for debugging.crictl ps,crictl images,crictl inspect. Different fromkubectl`; talks to kubelet's runtime, not to the API server.

10.3 Lab-"CRI Direct"¶

Install CRI-O on a clean machine (or use containerd with its CRI plugin).
Use crictl runp (run pod), crictl create, crictl start to manually launch a pod-equivalent without Kubernetes. Inspect with crictl inspect.
Add an OCI hook (e.g., a pre-start hook that logs every container) by configuring CRI-O's hooks_dir.

10.4 Hardening Drill¶

Set CRI-O's default seccomp to runtime/default, AppArmor to runtime/default (Ubuntu) or SELinux to container_t (RHEL). These are the same defaults Kubernetes applies; understanding them at the runtime level demystifies pod-level errors.

10.5 Production Readiness Slice¶

Set up cri-tools/critest against your runtime; the suite verifies CRI compliance. A non-compliant runtime is a recipe for kubelet bugs in production.

Week 11 - `podman` and the Rootless Model¶

11.1 Conceptual Core¶

podman is a daemonless, drop-in docker replacement. It runs containers as the calling user, with no long-running daemon. Each podman run is a fork-exec into a conmon supervisor + runc/crun.
Rootless containers run as a non-root user on the host, with a user namespace mapping host-UID to UID 0 inside the container. The biggest single security win in the modern container ecosystem.
Rootless is now mature: works with overlay (since kernel 5.11), works with networking via slirp4netns (slow) or pasta (fast, kernel 6.0+).

11.2 Mechanical Detail¶

Rootless storage: ~/.local/share/containers/storage/. Image and container state per-user.
Rootless networking:
slirp4netns-userspace TCP/IP stack; works everywhere, slow (~1 Gbps).
pasta-newer, kernel-bypass via `vsock - like tricks; faster (~10 Gbps).
subuid/subgid files (/etc/subuid, /etc/subgid) define the host UID range mapped into the user namespace. Default 65536 IDs per user.
`podman generate systemd - produce systemd unit files for rootless containers. The recommended path for "always-on" rootless services.

11.3 Lab-"Rootless Production"¶

As a non-root user, install podman. Configure /etc/subuid, /etc/subgid.
Run a multi-container app with podman play kube (Kubernetes-YAML-as-podman-input).
Generate systemd units; install with - -user. The service starts at user login and persists across reboots (withloginctl enable-linger`).
Compare slirp4netns vs pasta networking throughput with iperf3.

11.4 Hardening Drill¶

Confirm rootless containers cannot escape: try mounting host paths, accessing host devices, inspecting host processes. Each should fail (or be remapped innocuously via the user namespace).

11.5 Production Readiness Slice¶

Convert one production-ish workload from rootful Docker/runc to rootless podman. Document the operational deltas (port-binding <1024 needs CAP_NET_BIND_SERVICE via sysctl net.ipv4.ip_unprivileged_port_start=80).

Week 12 - Sandboxed Runtimes: gVisor and Kata Containers¶

12.1 Conceptual Core¶

For untrusted workloads (multi-tenant SaaS, untrusted code execution), namespaces+cgroups+seccomp are not enough. The kernel attack surface is too large.
Two production-grade alternatives:
gVisor (runsc)-a userspace kernel that intercepts syscalls. Lower overhead than VMs; more compatible than seccomp-based sandboxes. Used in App Engine, Cloud Run.
Kata Containers-runs each container (or pod) in a lightweight VM. Hardware-accelerated isolation; higher overhead but stronger guarantees. Used by Confidential Containers and Alibaba Cloud.
Both are OCI-spec runtimes-drop-in replacements for runc in containerd/CRI-O. The OCI spec abstraction is what makes this possible.

12.2 Mechanical Detail¶

gVisor (runsc):
The Sentry component implements a Linux-compatible kernel in user space.
The Gofer component proxies file I/O.
Configure via runtimeClassName in Kubernetes; configure containerd to register runsc as an additional runtime.
Performance: I/O-bound workloads suffer most (gofer hop). CPU-bound workloads near-native.
Kata:
Each container/pod gets its own micro-VM (Firecracker, Cloud Hypervisor, or QEMU).
The kata-agent runs inside the VM; kata-runtime on the host orchestrates.
Performance: ~10–20% overhead vs runc; sub-second VM boot via Firecracker.

12.3 Lab-"Two Sandboxes"¶

Install gVisor. Register as a containerd runtime. Run nerdctl --runtime runsc against a test workload.
Install Kata. Register as a containerd runtime. Run the same workload.
Benchmark both vs runc for: startup time, syscall-heavy workload (e.g., find /usr -type f), and CPU-bound workload (e.g., sysbench cpu).
Document the tradeoffs in a markdown matrix.

12.4 Hardening Drill¶

Read the gVisor security model; identify the syscalls it does not implement (and would refuse). Compare to a default seccomp profile-gVisor is strictly stronger.

12.5 Production Readiness Slice¶

Choose the right runtime for the right workload. Document a per-workload decision matrix in your team's runbook: trusted internal services → runc/crun, customer-supplied code → runsc, regulated/PCI workloads → Kata.

Month 3 Capstone Deliverable¶

A runtimes-and-daemons/ workspace: 1. containerd-direct/ - week 9 setup + Prometheus wiring. 2.crio-no-k8s/ - week 10 manual pod operations. 3. rootless-systemd/ - week 11 podman + systemd setup. 4.sandbox-bench/ - week 12 runc/runsc/kata comparison report.

Month 4-Container Security¶

Goal: by the end of week 16 you can (a) author seccomp profiles tailored to a service, (b) decompose Linux capabilities and assign minimum sets, (c) configure SELinux and AppArmor for containerized workloads, and (d) run rootless, user-namespaced containers as the default.

Weeks¶

Week 13 - The Default Threat Model¶

13.1 Conceptual Core¶

The container default threat model is not "isolation comparable to a VM." It is "isolation comparable to a chroot with namespaces, seccomp, and capability dropping." That is good for most use cases but breaks down for:
Untrusted code execution.
Multi-tenant SaaS where tenants can submit code.
Workloads that hold secrets cross-cutting tenants.
For those, use a sandboxed runtime (Month 3 week 12).
The default-rootful, capability-rich, weak-seccomp configuration in legacy Docker installs is the worst-case starting point. Modern runtimes (rootless podman, distroless images, Kata) flip the defaults.

13.2 Mechanical Detail-Threat Surfaces¶

Container escape via kernel exploit. Mitigation: keep kernel patched; gVisor/Kata for high-value targets.
Container escape via misconfiguration. Most common: - -privileged`, mounted Docker socket, host PID/NET, writeable host paths.
Image supply-chain attack. Mitigation: SBOM, signing, allowlisted base images. Month 5.
Lateral movement via shared resources. Mitigation: PodSecurity policies, network policies, secret scoping.
Resource exhaustion (DoS). Mitigation: cgroups v2 limits.

13.3 Lab-"Audit a Real Image"¶

Pick a popular base image (nginx, redis). Run with docker scout cves and trivy image. Record findings.
Run as default; identify how many capabilities it has via capsh --print.
Re-run with - -cap-drop=ALL, - -security-opt=no-new-privileges, read-only rootfs. Identify what breaks. Fix what's needed.
Document the minimum config to run safely.

13.4 Hardening Drill¶

Establish a baseline docker run (or podman run) policy: read-only rootfs, cap-drop=ALL, no-new-privileges, default seccomp, non-root user, tmpfs for /tmp, memory and CPU limits.

13.5 Production Readiness Slice¶

Author a policy.json for skopeo/podman that requires signed images for any registry path containing prod. Verification will be wired in week 19.

Week 14 - Capabilities for Containers¶

14.1 Conceptual Core¶

Linux capabilities subdivide root privilege into ~40 named caps (CAP_NET_ADMIN, CAP_SYS_PTRACE, etc.). Container runtimes apply a bounding set before exec - the container's processes can never gain a capability outside this set, regardless of UID 0.

The default Docker bounding set is ~14 caps: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE. Each enables a specific class of syscalls.

The discipline: most workloads need zero capabilities. Drop everything (--cap-drop=ALL); add back only what testing proves you need.

14.2 Mechanical Detail¶

The capabilities you'll meet most often, with what they unlock:

CAP_NET_BIND_SERVICE - bind to ports < 1024. Common for legacy services; modern apps run on 8080+ and skip this.
CAP_NET_ADMIN - configure interfaces, iptables, routing. Network plugins (CNI), service meshes need it. App containers should not.
CAP_NET_RAW - open raw / packet sockets (ICMP, ping). Often dropped: most services don't need to ping anything from inside.
CAP_SYS_ADMIN - the kitchen sink. ~40 different operations gated by it. Avoid at all costs; equivalent to root in many threat models.
CAP_SYS_PTRACE - attach to other processes (debuggers, strace). Confine to debug containers only.
CAP_DAC_OVERRIDE - bypass file permission checks. Almost always a sign of bad file ownership; fix the ownership instead.

Read what a running container actually has: capsh --decode=$(grep CapEff /proc/<pid>/status | awk '{print $2}').

Apply in Kubernetes via pod spec:

securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]    # only if you actually need it

The trap

Granting CAP_SYS_ADMIN because "the container needs to mount something." 90% of the time the actual need is for CAP_SYS_CHROOT or a specific filesystem-related cap. SYS_ADMIN opens ~40 unrelated operations and is the single most-abused capability in misconfigured containers.

14.3 Lab - "Capability Diet"¶

For three services (e.g., a Go HTTP server, an Nginx reverse proxy, a Node.js app), run each with --cap-drop=ALL. Identify what fails (the error usually mentions the syscall - map back to the capability via capabilities(7)).
Add back capabilities one at a time. Document the minimum set per service.
Configure your container runtime (podman, containerd) or pod-security policy to apply this minimum by default.

14.4 Hardening Drill¶

For any service requiring more than 3 caps, write a one-paragraph justification. If you can't justify, you don't need it. Common offenders worth re-auditing: anything inheriting from an old base image, anything that runs as root inside the container.

14.5 Production Readiness Slice¶

Add a CI step that fails any new image whose declared capability set exceeds the team's allowlist. Trivy, Kyverno, OPA, and Pod Security Standards (Kubernetes restricted profile) all support this check. The right gate: PR-time, not runtime - runtime is too late.

Week 15 - Seccomp Profiles for Containers¶

15.1 Conceptual Core¶

A seccomp profile is a JSON document describing per-syscall actions: allow, log, errno (return an error), or kill (terminate the process). The container runtime compiles the JSON to a BPF filter and applies it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter). Once installed, the filter cannot be loosened - only tightened.

The default Docker profile allows ~310 syscalls and blocks ~50 (the ones rarely needed by app containers but useful to attackers - keyctl, kexec_load, umount, etc.). Multiple recent kernel CVEs have been blocked entirely by the default profile, even on unpatched hosts. Tighter custom profiles per service reduce attack surface further.

15.2 Mechanical Detail¶

Profile structure: a defaultAction plus per-syscall rules, with optional argument-value matching.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    { "names": ["read", "write", "exit_group", "futex", "mmap"],
      "action": "SCMP_ACT_ALLOW" },
    { "names": ["openat"],
      "action": "SCMP_ACT_ALLOW",
      "args": [{"index": 2, "value": 0, "op": "SCMP_CMP_MASKED_EQ", "valueTwo": 2}] }
  ]
}

Generating profiles for a specific service:

oci-seccomp-bpf-hook (Red Hat) - attach to a container, record every syscall it makes during a representative workload, emit a JSON profile. The right tool for "what does this app actually need?"
falcoctl / Falco artifacts - newer; supports community-shared profiles.
Manual: strace -c -ff -o trace.out <cmd> - enumerate syscalls under load, then deny everything else. Slower but no extra dependency.

Apply: - Docker / podman: --security-opt seccomp=profile.json. - Kubernetes pod spec: securityContext.seccompProfile.type: Localhost + localhostProfile: profiles/myapp.json (the kubelet looks under /var/lib/kubelet/seccomp/).

The trap

Recording a seccomp profile from a happy-path workload only. Edge cases (error handling, log rotation, graceful shutdown) need different syscalls; the profile blocks them in production and the service crashes mysteriously hours later. Always run the recorder through your full integration-test suite, not just the smoke test.

15.3 Lab - "Custom Seccomp"¶

Run a service under oci-seccomp-bpf-hook (or strace -ff) and exercise it with your integration tests.
Generate a tight profile (default-deny + only the recorded syscalls).
Run with the profile; verify the service works under load.
Inject a "test" syscall (e.g., setns, unshare, or mount) the service doesn't legitimately use; verify it's blocked at runtime.

15.4 Hardening Drill¶

For long-running services, ship the custom seccomp profile alongside the image (e.g., as /seccomp/profile.json baked in, or as a ConfigMap mounted into /var/lib/kubelet/seccomp/). Reference it in deployment configs. Version it with the code - a profile that goes stale relative to its app is worse than no profile.

15.5 Production Readiness Slice¶

Document a process: every new service must ship with a seccomp profile generated from a representative load test, reviewed by a peer, committed to the repo. Pre-prod CI: run with the profile in audit-only mode (SCMP_ACT_LOG), collect any unexpected syscalls, fail the build if there are deltas from the committed profile.

Week 16 - LSM for Containers: SELinux and AppArmor¶

16.1 Conceptual Core¶

SELinux and AppArmor add a third confinement layer (after capabilities and seccomp): what objects (files, sockets) the process may access.
Most container runtimes apply a generic profile by default:
SELinux: container_t type, with category-based isolation (multi-category security, MCS) so two containers on the same host can't access each other's files even if they share a UID.
AppArmor: runtime/default (a path-based profile generated by Docker/podman).
Custom profiles tighten further; this is what hardened multi-tenant container hosts (OpenShift, Bottlerocket) ship.

16.2 Mechanical Detail¶

SELinux for containers:
Each container gets a unique MCS category, applied via the :s0:cN,cM suffix on the type.
Policy: container.te from container-selinux package.
Volume mounts: must be relabeled (:Z mount option) or marked :z (shared label)-otherwise SELinux denies access.
Debugging: ausearch -m AVC -ts recent.
AppArmor for containers:
Profile in /etc/apparmor.d/. Load with apparmor_parser -r.
Apply with - -security-opt apparmor=my-profile`.
Tooling: aa-genprof, aa-logprof to iterate.

16.3 Lab-"MAC Per Workload"¶

On RHEL/Fedora: write a custom SELinux policy module for one service. Test enforcement.
On Ubuntu/Debian: write an AppArmor profile for the same service. Test enforcement.
Document the comparative effort and expressivity.

16.4 Hardening Drill¶

Verify your hosts run with LSM in enforcing mode (getenforce, aa-status). Permissive is for development only.

16.5 Production Readiness Slice¶

Wire LSM denial alerts: ship audit.log (SELinux) or kern.log (AppArmor) to your central log aggregator; alert on AVCs from prod containers.

Month 4 Capstone Deliverable¶

A container-security/ workspace: 1. image-audit/ - week 13's audit findings + remediation. 2.cap-diet/ - week 14's capability matrix per service. 3. seccomp-profiles/ - week 15's per-service profiles, CI-validated. 4.lsm-profiles/ - week 16's SELinux + AppArmor profiles.

A THREAT_MODEL.md covering: what the model isolates against, what it doesn't, what to use instead for the gaps.

Month 5-Container Supply Chain: SBOM, Vulnerability Scanning, Signing, SLSA¶

Goal: by the end of week 20 you can (a) generate accurate SBOMs (Syft), (b) scan for CVEs (Grype, Trivy) and triage findings, (c) sign images and verify with cosign, and (d) target SLSA Level 3 in your build pipeline.

Weeks¶

Week 17 - Software Bill of Materials (SBOM)¶

17.1 Conceptual Core¶

An SBOM is a structured manifest of every component, dependency, and license in a software artifact. Two dominant formats: SPDX and CycloneDX. Both are JSON; both are interchangeable for most uses.
Three levels of SBOM accuracy:
Source SBOM-generated from go.mod, package.json, etc., before build.
Build SBOM-generated by the build tool (buildah, goreleaser, docker buildx).
Image SBOM-generated from a built image (Syft, Trivy). May differ from source SBOM if the build introduces or strips dependencies.
For compliance, ship the image SBOM attached to the image (as an attestation in the registry) and verify on consumption.

17.2 Mechanical Detail¶

Syft (Anchore): syft <image> -o spdx-json > sbom.json. Inspects layers, identifies packages by ecosystem (apk, dpkg, rpm, npm, gomod, pip, gem, cargo, ...).
Trivy also generates SBOMs (trivy image --format cyclonedx), with overlapping but slightly different package detection.
OCI image artifacts-SBOMs can be attached to the image in the registry as separate artifacts via the OCI Reference Types specification. cosign attach sbom is the canonical command.

17.3 Lab-"SBOM Pipeline"¶

Generate an SBOM for one of your images with Syft (SPDX) and Trivy (CycloneDX). Diff the two-note where they disagree.
Attach the SBOM to the image with cosign attach sbom.
From a downstream consumer, retrieve and parse the SBOM with cosign download sbom.
Add a CI step that fails the build if the SBOM contains a known-bad license (e.g., AGPL in a closed-source project).

17.4 Hardening Drill¶

Build a script that diffs SBOMs between two image versions; produces a "what changed" report. Use this in PR review for image base-image bumps.

17.5 Production Readiness Slice¶

Require an attached SBOM for every image promoted to a prod/ registry path. Verify in CI before promotion.

Week 18 - Vulnerability Scanning: Grype, Trivy, Clair¶

18.1 Conceptual Core¶

Scanners cross-reference image contents (or SBOMs) against vulnerability databases (NVD, distro-specific advisories, GitHub Security Advisories). They emit findings with CVE IDs, severities, and (sometimes) fixed versions.
The discipline is triage, not zero-CVE. A Critical CVE in a package you don't actually exercise is still a finding, but lower priority than a High in the request-handling path.
Tools:
Grype (Anchore)-SBOM-friendly; pairs with Syft.
Trivy (Aqua)-fast, broad ecosystem coverage, also handles config (Kubernetes YAML, Terraform).
Clair (Quay)-registry-side scanning; powers Quay and Harbor's scan UIs.

18.2 Mechanical Detail¶

Severity classifications: NVD CVSS v3 score → Critical (≥9.0), High (7.0–8.9), Medium (4.0–6.9), Low (<4.0). Project-specific scores may differ.
Vulnerability Exploitability eXchange (VEX)-declares whether a CVE is actually exploitable in your context. affected, not_affected, fixed, under_investigation. Use OpenVEX or CSAF VEX to suppress non-exploitable findings without hiding them.
Allowlist / ignore files-.trivyignore, .grype.yaml. Use sparingly; document each entry's rationale.

18.3 Lab-"Triage in CI"¶

Run Trivy on an image; produce a SARIF report. Upload to GitHub Code Scanning (or your scanner of choice).
Pick three findings; for each, write a one-paragraph triage decision: fix, accept, or VEX-suppress.
Author the VEX statement using vexctl (OpenVEX). Attach to the image.
Re-scan-verify the suppressed findings are now flagged as "not exploitable" rather than disappearing entirely.

18.4 Hardening Drill¶

Set CI policy: builds fail on Critical or High vulns with available fixes. Builds warn (do not fail) on findings without fixes-but require a VEX statement within 7 days.

18.5 Production Readiness Slice¶

Set up continuous re-scanning: nightly scans of all production images against the latest vulnerability database. New critical CVEs page on-call.

Week 19 - Signing and Verification: Cosign, Sigstore¶

19.1 Conceptual Core¶

Sigstore is a free public-good infrastructure for software signing: keyless signatures via OIDC (your GitHub/Google identity becomes the signer), a transparency log (Rekor), and a CA (Fulcio).
Cosign is the CLI: sign images, verify signatures, attach attestations (SBOMs, VEX, SLSA provenance), all backed by Sigstore by default-or a private key for offline use.
The promise: every artifact you publish has a verifiable link back to who built it, what SBOM was attached, when, with cryptographic proof recorded in a public ledger.

19.2 Mechanical Detail¶

`cosign sign - keyless signing via OIDC; opens browser for auth; uploads short-lived cert + signature to Rekor.
`cosign verify --certificate-identity user@example --certificate-oidc-issuer https://accounts.google.com - verifies signer identity and issuer.
For private/offline: cosign generate-key-pair; cosign sign --key cosign.key; cosign verify --key cosign.pub.
Attestations-signed statements about an artifact: cosign attest --predicate sbom.json --type spdx <image>. The full SLSA provenance flow.
Policy verification-cosign verify with policy: only specific signers, only via specific CI workflows (GitHub Actions OIDC subject repo:org/repo:ref:refs/heads/main).

19.3 Lab-"Signing Pipeline"¶

Sign an image with cosign keyless (GitHub OIDC). Verify.
Attach SBOM and VEX as attestations.
Configure policy-controller (Sigstore's Kubernetes admission controller) to require a valid signature from your CI's OIDC subject before allowing deploys.
Try to deploy an unsigned image-observe the rejection.

19.4 Hardening Drill¶

Set registry retention policy: signed images permanent; unsigned images garbage-collected after 7 days. Forces a signing-or-discard discipline.

19.5 Production Readiness Slice¶

Wire cosign verify into skopeo policy.json-transparent verification on every pull. Document the disaster-recovery flow if your signing identity is compromised.

Week 20 - SLSA, Provenance, and Reproducibility¶

20.1 Conceptual Core¶

SLSA (Supply chain Levels for Software Artifacts) is a graduated maturity model for build-pipeline integrity. Levels 1–4:
L1: Build process documented; provenance recorded.
L2: Hosted build service; signed provenance.
L3: Hardened, isolated builds; non-falsifiable provenance.
L4: Two-party review; reproducible.
Most production CI/CD pipelines reach L2 with effort, L3 with discipline. L4 is rare.
Provenance-a signed attestation describing how an artifact was built: source repo, commit, builder identity, build invocation, dependencies. The SLSA Provenance v1.0 schema is the standard.

20.2 Mechanical Detail¶

GitHub Actions has a built-in OIDC token that includes the workflow's repo + ref + sha. slsa-github-generator consumes this to produce SLSA L3 provenance for releases.
The cosign attest --type slsaprovenance flow attaches the provenance to the image.
Reproducibility-ideally every commit produces a byte-identical artifact. goreleaser supports this; buildah --source-date-epoch plus pinned base images plus deterministic file ordering plus no embedded build host info makes it possible.

20.3 Lab-"SLSA L3 in CI"¶

Set up a GitHub Actions workflow that builds, scans, signs, and produces SLSA L3 provenance for an image on every release tag.
Verify end-to-end: pull the image, retrieve its attestations, validate the provenance points back to the correct commit and CI run.
Reproducibility: rebuild the same tag from a fresh runner; verify image digest stability.

20.4 Hardening Drill¶

Document the kill chain: an attacker compromises which exact thing in your pipeline, and what does each SLSA level actually mitigate? Be concrete.

20.5 Production Readiness Slice¶

Promotion gate: a Kubernetes admission policy (Kyverno, OPA, or policy-controller) that requires SLSA L3 provenance from your CI's identity for any production deploy.

Month 5 Capstone Deliverable¶

A supply-chain/ workspace: 1. sbom-pipeline/ - week 17 SBOM generation + attachment + diff tooling. 2.vuln-triage/ - week 18 scanner config + VEX statements. 3. cosign-flow/ - week 19 signing + admission verification. 4.slsa-l3/ - week 20 reproducible reproducible build with verified provenance.

A SUPPLY_CHAIN.md documenting the full provenance flow from source commit to running container.

Month 6-Build Your Own: Mini-Docker From Scratch¶

Goal: by the end of week 24 you have implemented a working "container runner" in Go or Rust, demonstrating manual orchestration of namespaces, cgroups, OverlayFS, and a small subset of OCI runtime spec. The artifact is a portfolio piece you can defend in any senior containers interview.

Weeks¶

Week 21 - Scaffolding: Project Setup, OCI Bundle Reading¶

21.1 Conceptual Core¶

The mini-Docker takes an OCI runtime bundle (a directory with config.json and rootfs/), sets up the appropriate kernel features, executes the configured command, and supervises until exit.
Scope: the project will not implement all of the OCI runtime spec-focus on the core: namespaces, capabilities, mounts, cgroups v2 (memory + cpu + pids), seccomp.
Two language tracks:
Go-leverages runc/libcontainer learnings, golang.org/x/sys/unix for syscalls. Closer to runc.
Rust-leverages nix crate for syscalls; closer to youki. Stronger memory safety; uses unsafe sparingly.

21.2 Mechanical Detail¶

Project layout (Go example):

minidocker/
  cmd/minidocker/main.go         # CLI: create, start, run, kill, delete
  internal/
    bundle/                       # parse config.json
    ns/                           # namespace setup
    mount/                        # rootfs mount, masked paths
    cgroup/                       # cgroup v2 limits
    seccomp/                      # filter compilation
    cap/                          # capability dropping
    runtime/                      # the orchestrator
  examples/
    bundle-alpine/
      config.json
      rootfs/                     # umoci-extracted Alpine

Subcommands:
`minidocker run - create + start in one step (foreground).
minidocker create <id> / start <id> / `delete - split lifecycle.
`minidocker state - print state.

21.3 Lab-"Parse and Run"¶

Implement config.json parsing (the runtime-spec repo has a Go reference type definition).
Implement a no-isolation mode: just chdir(rootfs), chroot(rootfs), execve. Verify it runs.
Add command-line plumbing for the lifecycle subcommands.

21.4 Hardening Drill¶

Validate config.json against the spec's JSON schema. Reject malformed bundles before any syscall.

21.5 Production Readiness Slice¶

Add unit tests with a synthetic bundle. CI runs them on every commit.

Week 22 - Namespaces and Process Isolation¶

22.1 Conceptual Core¶

The runtime needs to clone (or fork+unshare) into the configured namespaces, set up UID/GID maps for user namespaces, configure UTS hostname, and pivot_root into the rootfs.
The classic two-process pattern: parent forks child with CLONE_NEWPID | CLONE_NEWNS | ...; parent writes UID/GID maps for the child; child waits via pipe for parent to finish setup; child performs final setup (mount /proc, pivot_root); child execs.

22.2 Mechanical Detail¶

In Go, golang.org/x/sys/unix.Clone does not exist directly; use syscall.SysProcAttr{Cloneflags: ...} on exec.Cmd, or use the lower-level syscall.Syscall(SYS_CLONE, ...).
The runc "init" pattern: a self-re-exec into the binary with a sentinel argument signaling "I'm the container init." The first invocation does the setup; the re-exec performs pivot_root and final execve. Read runc/libcontainer/standard_init_linux.go.
UID/GID mapping: write to /proc/<child-pid>/uid_map and gid_map. For non-root parents, also write setgroups deny first.
pivot_root requires a mount(MS_PRIVATE) of the parent mount before the call (to avoid leaking mounts back to the host).

22.3 Lab-"Namespaces Working"¶

Implement the parent/child fork-with-clone-flags. Verify lsns -p <pid> shows new namespaces.
Implement pivot_root into the rootfs. Verify / inside the container is the bundle's rootfs/.
Implement /proc mount inside the new PID namespace. Verify ps shows only the container's processes.
Implement UID/GID mapping for user-namespaced runs.

22.4 Hardening Drill¶

Mask /proc/kcore, /proc/keys, /proc/timer_list, /proc/sched_debug, /proc/scsi, /sys/firmware. Make /proc/asound, /proc/bus, /proc/fs, /proc/irq, /proc/sys, /proc/sysrq-trigger read-only. (Same as runtime-spec's maskedPaths and readonlyPaths.)

22.5 Production Readiness Slice¶

Run runc's integration tests against your runtime if feasible (they're spec-compliance tests). At minimum, run a representative subset of the OCI runtime test suite.

Week 23 - Cgroups v2, Capabilities, Seccomp, OverlayFS¶

23.1 Conceptual Core¶

The remaining isolation layers: cgroups for resource limits, capabilities for privilege restriction, seccomp for syscall filtering, OverlayFS for the rootfs (if not already prepared by umoci).
Each layer applies at a specific lifecycle moment. Get the order wrong and the container starts but isolation is incomplete.

23.2 Mechanical Detail¶

Cgroups v2:
Create /sys/fs/cgroup/<container-id>/.
Write +memory +cpu +pids to cgroup.subtree_control of the parent.
Write <pid> to the child cgroup's cgroup.procs after fork, before exec.
Set limits: memory.max, memory.high, cpu.max (<quota> <period>), pids.max.
Capabilities: drop via cap_set_proc (libcap) or prctl(PR_CAPBSET_DROP) for bounding set + capset(2) for effective. Apply after setup syscalls that need them, before execve.
Seccomp: compile the OCI seccomp profile to a BPF program (the libseccomp library does this). Apply with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...). Requires no_new_privs (prctl(PR_SET_NO_NEW_PRIVS)).
OverlayFS: if the bundle has separate lower/upper/work dirs, mount overlay. Otherwise rootfs is a single dir already-just bind mount it.

23.3 Lab-"All The Layers"¶

Implement cgroup v2 setup. Verify memory.max=64M actually limits the container.
Implement capability dropping. Verify getcap/capsh --print inside the container.
Implement seccomp filter loading. Verify a denied syscall fails.
(Optional) Implement OverlayFS rootfs construction from a multi-layer image.

23.4 Hardening Drill¶

Default seccomp profile from `containers/common/pkg/seccomp - port to your runtime as the default.

23.5 Production Readiness Slice¶

Add chaos tests: deliberately malformed configs, runaway memory in the container, seccomp-blocked syscalls. Verify the runtime's error paths are clean.

Week 24 - Polish, Defense, Distribution¶

24.1 Conceptual Core¶

The final week is polish. Integration tests, documentation, performance profiling, and a publishable release.

24.2 Mechanical Detail-Polish Checklist¶

All OCI lifecycle commands implemented (create, start, state, kill, delete).
State persistence: /run/minidocker/<id>/state.json so state works across the supervisor's restart.
Hooks: prestart, poststart, poststop from the OCI spec (at least skeletal support).
Console / TTY support if the spec sets terminal: true.
Signal forwarding from supervisor to PID 1 inside.
Cleanup on error: cgroups removed, mounts unmounted, namespaces released.

24.3 Lab-"Defend the Project"¶

Schedule a 45-minute mock review: - Live demo: build, run a container, exec into it, observe isolation. - Walk through the lifecycle code with the OCI spec open beside it. - Demo a hardened run (cgroups + caps + seccomp + LSM) and verify isolation. - Compare with runc/crun: what's missing? What's different? Why is your design simpler?

24.4 Hardening Drill¶

Run your runtime against runc's integration test suite (where applicable). Document which subset passes; explain the gaps.

24.5 Production Readiness Slice¶

Tag v0.1.0. Generate a release artifact (goreleaser for Go; cargo dist for Rust). Sign with cosign. Publish.

Month 6 Deliverable¶

The mini-Docker, plus the aggregated container-mastery/ repo containing every prior month's deliverable.

Appendix A-Container Hardening Reference¶

Cumulative hardening checklist. By week 24 the reader's container-baseline/ template should encode every section.

A.1 Image Hardening¶

A.2 Runtime Hardening (per `docker run` / `podman run` / Kubernetes pod)¶

A.3 Daemon / Host Hardening¶

Rootless runtime (podman rootless mode).
If using a daemon: dockerd or containerd running as a hardened systemd unit (apply Linux/05_security_and_hardening.md directives).
LSM enforcing on the host.
Modern kernel (≥ 5.10 for full rootless features).
OverlayFS instead of vfs.
Disk quotas (xfs_quota) or - -storage-opt overlay.size=` for runaway-image protection.

A.4 Sandbox Tier (when needed)¶

For untrusted code: - gVisor (runsc): registered as a containerd runtime. Mark untrusted workloads with runtimeClassName: gvisor (Kubernetes) or - -runtime=runsc` (nerdctl). - Kata Containers: per-container micro-VM. Higher overhead, stronger isolation. - Firecracker: micro-VM as a runtime. Used by AWS Lambda.

Decision matrix: | Workload | Runtime | |---|---| | Trusted internal services | runc / crun | | Customer-supplied code | runsc (gVisor) | | Regulated / multi-tenant | Kata | | Serverless, very-short-lived | Firecracker / runsc |

A.5 Supply-Chain Hardening¶

All images signed with cosign.
All images have attached SBOMs.
All images have SLSA L3 (or higher) provenance.
Admission policy verifies signatures at pull time.
Vulnerability scanning continuous; new criticals page on-call.
VEX statements for non-exploitable findings.
Registry retention: signed images retained, unsigned discarded.

A.6 The `container-baseline/` Template¶

container-baseline/
  Dockerfile.template          # multi-stage, distroless final
  build.sh                     # buildah-based reproducible build
  cosign-policy.yaml           # admission policy (signature + SLSA)
  seccomp/
    default.json               # tighter than runtime/default
  apparmor/                    # per-service profiles
  selinux/                     # type-enforcement modules
  ci/
    build.yml                  # build + scan + sign + attest
    promote.yml                # verify + tag + push to prod registry
  scripts/
    audit-image.sh             # trivy + grype + syft, formatted
  RUNBOOK.md
  THREAT_MODEL.md

Every image you ship after week 24 should be built using this template.

Appendix B-Reference Patterns¶

Reference recipes for the patterns you'll reach for repeatedly.

B.1 Multi-Stage Distroless (Go)¶

FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/root/.cache/go-build \
    --mount=type=cache,target=/go/pkg/mod \
    go mod download
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /out/app ./cmd/app

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

B.2 Multi-Stage Distroless (Rust)¶

FROM rust:1.78 AS build
WORKDIR /src
COPY . .
RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/src/target \
    cargo build --release && cp target/release/app /out/app

FROM gcr.io/distroless/cc-debian12:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

For static Rust (musl), use gcr.io/distroless/static.

B.3 Multi-Stage Distroless (Python)¶

FROM python:3.12-slim AS build
WORKDIR /app
COPY requirements.txt .
RUN pip install --target=/install -r requirements.txt
COPY . .

FROM gcr.io/distroless/python3-debian12:nonroot
COPY --from=build /install /pkg
COPY --from=build /app /app
ENV PYTHONPATH=/pkg
USER nonroot:nonroot
ENTRYPOINT ["python", "/app/main.py"]

B.4 Multi-Arch Build with `buildah`¶

#!/usr/bin/env bash
set -euo pipefail
IMAGE=registry.local/myapp
TAG=v1.0.0

for arch in amd64 arm64; do
  buildah build --arch=$arch --manifest $IMAGE:$TAG -t $IMAGE:$TAG-$arch .
done

buildah manifest push --all $IMAGE:$TAG docker://$IMAGE:$TAG

B.5 Reproducible Build¶

SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct)
buildah build \
  --timestamp $SOURCE_DATE_EPOCH \
  --pull-never \
  --layers=false \
  -t myapp:$TAG .

Combine with deterministic dependency ordering, pinned base by digest, and no RUN apt-get update (use a frozen package mirror).

B.6 Rootless Podman + Systemd¶

podman run --name myapp --rm -d ...
podman generate systemd --new --name myapp > ~/.config/systemd/user/myapp.service
systemctl --user daemon-reload
systemctl --user enable --now myapp.service
loginctl enable-linger

B.7 OCI Hooks¶

{
  "hooks": {
    "prestart": [
      {"path": "/usr/local/bin/network-setup", "args": ["network-setup"], "timeout": 5}
    ],
    "poststop": [
      {"path": "/usr/local/bin/cleanup", "args": ["cleanup"]}
    ]
  }
}

Hooks are how custom CNI, GPU device injection, and metric collectors wire in.

B.8 CI/CD Skeleton (GitHub Actions)¶

name: build
on:
  push:
    branches: [main]
    tags: ['v*']

permissions:
  id-token: write          # required for cosign keyless
  contents: read
  packages: write

jobs:
  build-scan-sign:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - id: build
        uses: docker/build-push-action@v6
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          provenance: mode=max
          sbom: true
      - name: Scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ghcr.io/${{ github.repository }}:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1
      - uses: sigstore/cosign-installer@v3
      - name: Sign
        run: cosign sign --yes ghcr.io/${{ github.repository }}@${{ steps.build.outputs.digest }}
      - name: Attest SBOM
        run: cosign attest --yes --type spdx --predicate sbom.json ghcr.io/${{ github.repository }}@${{ steps.build.outputs.digest }}

B.9 Debugging Recipes¶

No shell in the image? kubectl debug -it <pod> --image=busybox --target=<container> adds an ephemeral debug container in the same namespace.
Inspect a layer? skopeo inspect docker://image:tag for manifest, skopeo copy docker://image:tag oci:./local:tag to dump everything.
What's running inside? From the host: nsenter -t <pid> -a /bin/sh enters all namespaces of the target.
Why is it slow? nsenter -t <pid> -a perf top profiles inside the container.
What syscalls is it making? strace -f -p <host-pid> works through namespaces.

Appendix C-Contributing to the Container Ecosystem¶

The container ecosystem is in a few major repos, all on GitHub, all with active maintainer communities.

C.1 The Project Map¶

Project	Language	Scope	Difficulty	Notes
`opencontainers/runc`	Go	Reference OCI runtime	Hard	Slower review; correctness-first
`opencontainers/runtime-spec`	Markdown	Spec	Medium	Doc PRs welcome; behavior changes need TOB consensus
`opencontainers/image-spec`	Markdown	Spec	Medium	Same as above
`containers/podman`	Go	Daemonless container manager	Medium	Active, friendly
`containers/buildah`	Go	Image build library/CLI	Medium	Active, friendly
`containers/skopeo`	Go	Image transport tool	Easy–Medium	Smaller surface, good first patches
`containerd/containerd`	Go	Container daemon	Hard	Large; subsystem-specific reviewers
`cri-o/cri-o`	Go	Kubernetes CRI runtime	Medium	Friendly to newcomers
`containers/crun`	C	Fast OCI runtime	Hard	Single maintainer historically; high bar
`containers/youki`	Rust	Rust OCI runtime	Medium	Smaller, more approachable than runc
`sigstore/cosign`	Go	Image signing	Medium	Active, growing
`anchore/syft` & `anchore/grype`	Go	SBOM + vuln scan	Medium	Friendly
`aquasecurity/trivy`	Go	Vuln + config scan	Medium	High velocity
`google/go-containerregistry`	Go	OCI registry library	Easy–Medium	Underused gem; good starter

C.2 First-Issue On-Ramps¶

Easy¶

skopeo: documentation fixes, new transport options, bug reports with reproductions.
go-containerregistry: examples, error-message improvements, new registry-mirror tests.

Medium¶

podman: bug fixes for edge cases (look at the kind/bug + good-first-issue labels).
buildah: feature parity gaps with docker build BuildKit features.
cri-o: small CRI-protocol corrections.
syft: new package format detectors (e.g., a niche language ecosystem).
cosign: documentation, integration examples, smaller bug fixes.

Hard¶

runc: anything in libcontainer/ is high stakes-security-critical.
containerd: snapshotters, the CRI plugin, runtime shims.
youki: larger features as the project still has gaps vs runc.
crun: C, low-level, single-maintainer ergonomics.

C.3 The Workflow (typical GitHub project)¶

File or claim an issue. Read the contributing guide first; some projects require pre-discussion.
Fork, branch, code.
Sign-off your commits. All container projects require DCO (git commit -s).
Run the test suite locally. make test or per-project equivalent.
Open a PR. Reference the issue. Describe the change and the testing.
Address review. Most projects use squash-merge; rebase your branch on main as needed.
Merge. Maintainer approves; CI green; merged.

Cycle time varies: skopeo / cosign / syft are fast (days). runc / containerd are slow (weeks).

C.4 The OCI Process (spec changes)¶

Spec changes are governed by the OCI Technical Oversight Board. The process: 1. Open an issue describing the proposed change. 2. Build consensus on the issue thread (this is the slowest step). 3. Open a PR with the spec change. 4. The TOB reviews; super-maintainers approve. 5. Merge; release the spec on the next cadence.

This is not a fast path. Reserve it for genuine spec-shaped problems; everything else fits in a tool repo.

C.5 Calibration¶

A reasonable goal for a curriculum graduate:

By end of week 23: a PR open against skopeo, syft, cosign, or another approachable repo (a doc fix or small bug fix is sufficient).
By end of capstone: that PR merged.
6 months post-curriculum: a substantive contribution-a new transport in skopeo, a new package detector in syft, a new feature in podman.

The container ecosystem is genuinely welcoming to newcomers. The path from "user" to "contributor" is shorter than in most kernel-adjacent projects.

Capstone Projects-Three Tracks, One Choice¶

Pick one. The work performed here is what you describe in interviews.

Track 1-Mini-Docker (the curriculum's default)¶

Outcome: a from-scratch container runner in Go or Rust, demonstrating manual orchestration of namespaces, cgroups v2, OverlayFS, capabilities, seccomp, and a working subset of OCI runtime spec.

Functional spec¶

Read an OCI runtime bundle (config.json + rootfs/).
Lifecycle: create, start, state, kill, delete, plus run (create+start).
Apply: PID/NET/MNT/UTS/IPC/USER/CGROUP namespaces, capability dropping, seccomp filter, cgroups-v2 (memory/cpu/pids), pivot_root, masked/readonly paths.
Spec compliance: passes a meaningful subset of the OCI runtime spec validator.

Non-functional spec¶

Implemented in <2,500 lines of code (excluding tests).
Memory footprint <20 MB resident.
Container-start latency <100 ms on a baseline workload (matches runc within 2×).

Architecture sketch¶

Top-level CLI parses the lifecycle subcommand.
Runtime orchestrator: state machine over the bundle's lifecycle.
Init re-exec pattern (like runc's "init"): the binary self-re-execs as the container's PID 1 supervisor; performs final mount + capability + seccomp setup; execs the user command.
State persistence in /run/<runtime-name>/<id>/state.json.

Test rigor¶

Unit tests for: bundle parsing, namespace setup, cgroup writes, seccomp compilation.
Integration tests: run real bundles (Alpine, BusyBox, a small Go binary) and assert behavior.
Spec compliance test against (a subset of) runtime-tools/validation.
Chaos: malformed configs, missing rootfs, OOM, exhausted PIDs.

Hardening pass¶

Default seccomp profile equivalent to containers/common/pkg/seccomp.
Default capability set: CAP_NET_BIND_SERVICE only.
Read-only rootfs unless explicitly opted in.
Rootless support (user namespace path tested).

Acceptance criteria¶

Public repo, documented architecture.
A README walkthrough: from runc spec to running nginx under your runtime.
Integration tests in CI, passing.
A short paper (3–5 pages) comparing your runtime to runc/crun/youki: what's similar, what's different, what you skipped and why.

Skills exercised¶

All months-but heaviest on Months 1 (OCI specs), 2 (filesystems), 3 (runtimes), 4 (security primitives).

Track 2-Image Scanning & Signing Service¶

Outcome: an HTTP service that ingests OCI image references, runs Syft + Grype + Trivy, attaches signed SBOMs and VEX statements, and exposes a policy-gated promotion API.

Functional spec¶

`POST /scan {image} - scan, return findings.
`POST /promote {image, target} - verify the image meets policy (signature, SBOM, vuln thresholds, SLSA provenance) before copying to a higher-trust registry.
`GET /audit/{image} - full triage report, including VEX state per finding.
Policy as YAML; reload without restart.
Plugin model: scanner backends and signature verifiers loaded at startup.

Non-functional spec¶

50 concurrent scans without degradation.
Sub-second policy evaluation given pre-fetched attestations.
Admission API compatible with Kubernetes ImagePolicyWebhook.

Architecture sketch¶

Workers consuming a scan queue.
skopeo for image fetching; syft and trivy shelled out (or embedded as Go libs).
cosign Go SDK for signature verification.
Postgres for findings storage; Prometheus for metrics; OTel for traces.

Test rigor¶

Unit tests for policy evaluation.
Integration tests against a local registry with known-good and known-bad images.
Property tests: policy decisions are deterministic given inputs.

Hardening pass¶

Service runs rootless in its own container.
mTLS between worker and registries.
Findings DB encrypted at rest.

Acceptance criteria¶

Public repo, README with end-to-end demo.
Demonstrate full flow: image with critical CVE → scan → signed VEX → policy decision → promotion gated correctly.

Skills exercised¶

Months 5 (supply chain) heavily; Months 1, 4 supporting.

Track 3-Custom Runtime Fork¶

Outcome: a fork of runc (or youki) adding one substantial feature: gVisor-style sandbox, a custom seccomp generator, or eBPF-based per-container observability.

Suggested scopes¶

runc-trace: a runc fork that, before execve, attaches an eBPF program tracing syscalls and emitting per-container telemetry to a userspace consumer. Useful for forensic environments.
runc-autoseccomp: generates a tight seccomp profile during a "learning" run, then enforces it. Removes the manual seccomp-authoring burden.
runc-cap: stricter capabilities defaults; introspects the rootfs and disables capability sets the binary doesn't appear to need.

Acceptance¶

Forked from upstream at a tagged commit; documented merge plan to maintain rebase against upstream.
The new feature is opt-in via config.json annotations or a CLI flag.
Test coverage equivalent to upstream's for the touched code.
A short blog post explaining the feature, the design, and the upstream-contribution plan.

Skills exercised¶

All months. Track 3 is for the candidate who wants to contribute upstream eventually.

Cross-Track Requirements¶

container-baseline/ template (Appendix A) integrated.
ADRs (≥3).
THREAT_MODEL.md.
Defense readiness: 45-minute walkthrough.

Worked example - Week 14: Linux capabilities in a real Dockerfile¶

Companion to Container Internals → Month 04 → Week 14: Capabilities. The week explains the Linux capability model and the difference between root inside the container and root with everything. This page walks one Dockerfile through dropping capabilities one at a time so you can see which knobs do what.

Start from a normal container¶

# v0 - the default
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl iproute2 libcap2-bin && rm -rf /var/lib/apt/lists/*
CMD ["sleep", "infinity"]

Build and run:

$ docker build -t cap-demo .
$ docker run --rm -d --name c cap-demo
$ docker exec c capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
  cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,
  cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep

14 capabilities granted by default - the Docker default capability set. This is not "full root." A real root user on the host has ~40 capabilities (run capsh --print outside the container). Docker already drops ~26 of them by default, including the dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE).

But 14 is still too many. Let's see what each does and drop the ones we don't need.

What's actually in use¶

The CMD is sleep infinity. It needs no capabilities at all. Prove it:

$ docker run --rm --cap-drop=ALL cap-demo sleep 2
# (exits cleanly)

But most real containers do something. Suppose we add a tiny HTTP server:

# v1 - a real workload
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 libcap2-bin && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY server.py .
CMD ["python3", "-m", "http.server", "8080"]

$ docker run --rm -p 8080:8080 --cap-drop=ALL cap-demo &
$ curl localhost:8080

Works. Python's HTTP server doesn't need any capabilities because port 8080 is unprivileged (≥1024).

Try port 80:

$ docker run --rm -p 80:80 --cap-drop=ALL cap-demo python3 -m http.server 80
PermissionError: [Errno 13] Permission denied

Binding to ports below 1024 requires CAP_NET_BIND_SERVICE. Add it back, drop everything else:

$ docker run --rm -p 80:80 \
    --cap-drop=ALL --cap-add=NET_BIND_SERVICE \
    cap-demo python3 -m http.server 80

Works. We're now running with exactly one capability, the minimum to do the job.

Map the capabilities to attacks¶

This is the part most tutorials skip. Here's why each Docker-default capability is actually dangerous if you don't need it:

Capability	What it allows	If an attacker gets RCE in the container
`CAP_CHOWN`	Change file ownership	Take over files in mounted volumes the container shouldn't own.
`CAP_DAC_OVERRIDE`	Bypass file read/write/execute permission checks	Read any file in mounted volumes, regardless of permissions.
`CAP_FOWNER`	Bypass permission checks on operations that normally require the file owner	Modify metadata on mounted files.
`CAP_FSETID`	Set setuid/setgid bits	Create privilege-escalation backdoors in writable mounted paths.
`CAP_KILL`	Send signals to any process	Kill other workloads sharing the same PID namespace.
`CAP_SETUID/SETGID`	Change process UID/GID	Pivot to other user contexts within the container.
`CAP_SETPCAP`	Change capability sets	Re-add dropped capabilities (when combined with namespaces).
`CAP_NET_BIND_SERVICE`	Bind to ports <1024	Squat on a privileged port to MITM.
`CAP_NET_RAW`	Open raw sockets, send crafted packets	ARP spoofing, packet sniffing if not isolated by network namespace.
`CAP_SYS_CHROOT`	Use `chroot()`	Limited escape vectors.
`CAP_MKNOD`	Create device files	Create `/dev/sda` and read raw disk if filesystem isn't sealed.
`CAP_AUDIT_WRITE`	Write to kernel audit log	Spam audit logs to hide other activity.
`CAP_SETFCAP`	Set file capabilities	Persist privileges on dropped binaries.

CAP_NET_RAW and CAP_MKNOD are the ones most production guides specifically call out to drop.

The drop-and-add pattern¶

Production-grade Dockerfile + run command:

FROM debian:bookworm-slim
RUN useradd -r -u 10001 app
USER 10001
COPY --chown=10001:10001 server.py /app/server.py
CMD ["python3", "/app/server.py"]

$ docker run --rm \
    --cap-drop=ALL \
    --cap-add=NET_BIND_SERVICE \
    --read-only \
    --tmpfs /tmp \
    --security-opt=no-new-privileges \
    -p 80:80 \
    your-image

That's the minimum-privilege starting point. Walk the flags:

--cap-drop=ALL --cap-add=NET_BIND_SERVICE - exactly one capability.
--read-only - root filesystem is read-only; defeats most persistence.
--tmpfs /tmp - give the app a writable scratch space (it needs somewhere to write).
--security-opt=no-new-privileges - set the NoNewPrivs bit; even setuid binaries can't gain capabilities now.
USER 10001 - non-root user inside the container. Capabilities are bounded by both the kernel ruleset and the UID; this defense-in-depth matters.

The trap¶

Most production teams set --cap-drop=ALL for "their" containers and then leave third-party sidecar containers (logging agents, service mesh proxies) with default capabilities. The third-party containers are just as exploitable and often have more of an attack surface (network exposure, mounted secrets). Audit them too.

The other trap: capability dropping is a kernel-level mechanism. It does not defend against container-escape vulnerabilities (e.g. runc CVEs). You still want a kernel-level barrier - seccomp profile (next week), AppArmor/SELinux, gVisor or kata for high-isolation needs.

Exercise¶

Take any container image you currently run. Inspect what capabilities it requests (docker inspect | jq '.[].HostConfig.CapAdd') and what it actually uses (grep Cap /proc/<pid>/status from inside).
Write the smallest possible --cap-drop/--cap-add set that lets it still work. Document what each kept capability is for.
Repeat for a Kubernetes Pod. The same fields live in securityContext.capabilities per-container.

The main Week 14 chapter covers the underlying kernel model.
The Week 15 chapter on seccomp is the syscall-level companion to capability dropping.
Kubernetes Mastery - security context covers applying these in Pod specs.
Glossary: Capability, Seccomp, NoNewPrivs in the main glossary.

Container Internals¶