Container Internals¶
OCI, filesystems, runtimes, security, supply chain.
Printing this page
Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.
Container Internals & Runtimes-A 24-Week Mastery Roadmap¶
Authoring lens: Senior Container Architect.
Target outcome: A graduate of this curriculum can (a) build, run, and inspect containers without a Docker daemon-using runc, skopeo, buildah, and crun directly, (b) reason from OCI specs to wire-level container behavior, (c) ship hardened images with reproducible builds, SBOMs, and signed provenance, and (d) implement a "mini-Docker" demonstrating manual orchestration of namespaces, cgroups, and rootfs.
This is not "Docker in a week." It assumes the reader has used containers and is ready to read the OCI specs and runc source as primary literature.
Repository Layout¶
| File | Purpose |
|---|---|
00_PRELUDE_AND_PHILOSOPHY.md |
What containers actually are (and aren't); the shape of the OCI ecosystem. |
01_MONTH_OCI_FOUNDATIONS.md |
Weeks 1–4. OCI image + runtime specs, runc, crun, skopeo. |
02_MONTH_FILESYSTEMS_AND_BUILDS.md |
Weeks 5–8. OverlayFS, image layers, buildah, multi-stage, distroless. |
03_MONTH_RUNTIMES_AND_DAEMONS.md |
Weeks 9–12. containerd, CRI-O, podman, the no-daemon model, rootless. |
04_MONTH_SECURITY.md |
Weeks 13–16. Capabilities, seccomp, AppArmor/SELinux for containers, user namespaces. |
05_MONTH_SUPPLY_CHAIN.md |
Weeks 17–20. SBOM (Syft), vuln scanning (Grype/Trivy), signing (cosign), SLSA. |
06_MONTH_BUILD_YOUR_OWN.md |
Weeks 21–24. Mini-Docker capstone: Go or Rust implementation. |
APPENDIX_A_HARDENING.md |
Image hardening, runtime hardening, gVisor/Kata, rootless patterns. |
APPENDIX_B_REFERENCE_PATTERNS.md |
Common image patterns, multi-arch builds, debugging, CI/CD recipes. |
APPENDIX_C_CONTRIBUTING.md |
Contribution paths to runc, containerd, podman, buildah. |
CAPSTONE_PROJECTS.md |
Three tracks: mini-Docker, image scanning service, runtime fork. |
How Each Week Is Structured¶
- Conceptual Core-the why, with a mental model.
- Mechanical Detail-the how, down to spec section and source location.
- Lab-a hands-on exercise.
- Hardening Drill-a security-relevant micro-task that compounds.
- Production Readiness Slice-a CI/CD, registry, signing, or scanning task that builds a publishable template.
Each week is sized for ~12–16 focused hours.
Progression Strategy¶
OCI Foundations ──► Filesystems & Builds ──► Runtimes & Daemons
│ │ │
└──────────┬─────────┴──────────────────────────┘
▼
Security
│
▼
Supply Chain
│
▼
Build Your Own
Prerequisites¶
- Comfortable on a Linux command line.
- Familiar with namespaces and cgroups at a basic level (see the Linux curriculum for the deep version).
- Reading-comfortable with C or Go or Rust-capstone choice depends on this.
Capstone Tracks (pick one in Month 6)¶
- Mini-Docker-a from-scratch container runner in Go or Rust implementing namespaces, cgroups, OverlayFS, and a small subset of OCI spec.
- Image Scanning & Signing Service-an HTTP service that ingests images, runs Syft + Grype + Trivy, attaches signed SBOMs, gates promotion via cosign-based policy.
- Custom Runtime-fork
runc(or write a `crun - equivalent) adding one feature: gVisor-style sandbox, custom seccomp generator, or eBPF-based observability.
Details in CAPSTONE_PROJECTS.md.
Prelude-What Containers Actually Are¶
Sit with this document for an evening before week 1.
1. There Is No Such Thing As a "Container"¶
The kernel has no concept of "container." There is no struct container in /proc. What people call a container is a bundle of kernel features applied together to a process:
- One or more namespaces (PID, NET, MNT, UTS, IPC, USER, CGROUP) for isolation.
- One or more cgroups v2 for resource limits.
- A rootfs mounted as the process's
/, usually viapivot_rootand an OverlayFS stack. - A seccomp filter restricting syscalls.
- An LSM label (SELinux/AppArmor) restricting object access.
- A capabilities mask restricting privilege.
A "container runtime" is a program that arranges these things from a specification (OCI runtime config), then execves the user's command. That's it. Docker is not the OS. Containers are not VMs. There is no hypervisor-equivalent.
If you internalize this, the rest of the curriculum is bookkeeping.
2. The OCI Layer Cake¶
The Open Container Initiative defines three specifications that everything in the ecosystem implements:
- Image Spec-what a container image is: a manifest, a config, a stack of layer tarballs, all addressable by content (SHA-256).
- Runtime Spec-what a runtime configuration is: a
config.json+ a rootfs directory. - Distribution Spec-what a registry is: an HTTP API for pushing/pulling content-addressed blobs and manifests.
docker pull (image + distribution), docker run (runtime), docker build (image), docker push (distribution)-all four operations are OCI-spec'd. Once you can do them with runc, skopeo, and buildah, you understand the ecosystem.
3. The Tooling Map¶
| Concern | OCI-spec tool | Daemon-based equivalent |
|---|---|---|
| Build images | buildah |
docker build |
| Pull / push / inspect images | skopeo |
docker pull/push/inspect |
| Run containers (low-level) | runc / crun / youki |
hidden under dockerd |
| Run containers (high-level) | podman, nerdctl |
docker CLI |
| Container daemon | containerd, CRI-O |
dockerd |
By month 4 you should never type docker again for any task in this curriculum, except to demonstrate equivalence with the daemon-based world.
4. Cost Model¶
A working container engineer reasons along five axes:
| Axis | Question |
|---|---|
| Image | What's in this image? What's its provenance, vulnerability surface, layer count, total size? |
| Runtime | What namespaces, cgroups, seccomp, capabilities, LSM are applied? |
| Filesystem | What's the storage driver (overlay2, fuse-overlayfs, btrfs)? Is the layered FS the bottleneck? |
| Network | What CNI? Bridge, macvlan, host-mode? What's the per-packet overhead? |
| Supply chain | Is it signed? Is the SBOM accurate? What's the SLSA provenance level? |
Beginner courses teach axis 1 only.
5. The Reading List¶
Primary
- The OCI specs themselves (opencontainers/image-spec, opencontainers/runtime-spec, opencontainers/distribution-spec). Each is short-read all three before week 1 ends.
- runc source (opencontainers/runc), particularly libcontainer/.
- containerd architecture docs (containerd/containerd/docs/).
- buildah and podman documentation.
- Container Security (Liz Rice). Best concise text on the security model.
Secondary
- Linux in Action (David Clinton)-chapters 8–10 if you want a softer on-ramp.
- The CNCF Cloud Native Glossary-terminology calibration.
- Aleksa Sarai's blog (cyphar.com)-runc maintainer; deep posts on rootless, user namespaces.
Adjacent (you must know) - The Linux curriculum's namespaces and cgroups chapters. If you skip them, this curriculum will not stick.
6. Curriculum Philosophy¶
- Spec first, tool second. Whenever a behavior surprises you, the OCI spec is the source of truth.
docker runwith no flags hides ~50 default values;runcexposes them. - Daemonless by default. All weekly labs target the daemonless toolchain (
buildah,skopeo,podman,runc). Learn the canonical primitives; the daemon-based version is an ergonomic skin on top. - Rootless by default once feasible. Modern Linux supports rootless containers via user namespaces. Practice it from week 9 onward.
7. What Containers Are Not For¶
- Hard isolation. A container is a process with namespaces. Kernel exploits cross containers. For untrusted multi-tenant code, use a VM-class isolation layer (gVisor, Kata, Firecracker)-covered in
APPENDIX_A. - Stateful systems with strict durability. Volume management adds complexity; production databases benefit from running outside containers (or with mature operators in K8s, see the Kubernetes curriculum).
- GUI applications. Possible (X11/Wayland forwarding) but rarely the right tool.
You are now ready for Week 1. Open 01_MONTH_OCI_FOUNDATIONS.md.
Month 1-OCI Foundations: Specs, runc, skopeo, crun¶
Goal: by the end of week 4 you can (a) hand-author an OCI runtime config.json and run a container with runc directly, (b) push, pull, and copy images between registries with skopeo without ever touching a daemon, (c) read an image manifest, layer hashes, and config blob, and (d) explain the difference between runc and crun and pick one for a workload.
Weeks¶
- Week 1 - The OCI Image Spec
- Week 2 - The OCI Runtime Spec,
runc, andcrun - Week 3 -
skopeoDeep Dive: Multi-Arch, Signing, Sync - Week 4 - Image Internals: Manifest Lists, Index, Annotations, Sparse Pulls
Week 1 - The OCI Image Spec¶
1.1 Conceptual Core¶
- An OCI image is content-addressed: every blob (layer, config, manifest) is named by
sha256:<digest>. Identity = content. Immutable. - Top-level: a manifest lists the config and layers for one platform. An index lists multiple manifests for multi-platform images (
linux/amd64,linux/arm64, etc.). - Layers are tar archives representing filesystem changesets, often gzip-compressed (
application/vnd.oci.image.layer.v1.tar+gzip). - Configs are JSON documents describing entrypoint, env, working dir, exposed ports, volumes, labels.
1.2 Mechanical Detail¶
- The manifest schema (image-spec/specs-go/v1/manifest.go in the spec repo). Keys:
mediaType,schemaVersion,config(descriptor),layers([]descriptor),annotations. - Descriptor structure:
mediaType,digest,size, optionalurlsandannotations. - The OCI layout on local disk: a directory with
oci-layout,index.json, andblobs/sha256/<digest>files. Useskopeo copy docker://nginx:latest oci:./nginx-layout:latestand inspect. - Distinguish OCI mediaType from the older Docker v2.2 mediaType-they're nearly isomorphic but not identical.
skopeoand modern registries handle both.
1.3 Lab-"An Image Without Docker"¶
skopeo copy docker://alpine:3.19 oci:./alpine-layout:3.19. Inspect the layout. Readindex.json, the manifest blob, the config blob.- Find a layer blob, decompress, list its contents (
tar tzf <blob>). - Compute one of the layer digests yourself (
sha256sum) and verify. - Modify the config (e.g., change the entrypoint) by writing a new config blob, generating a new manifest, updating
index.json. Verify withskopeo inspect oci:./alpine-layout:3.19.
1.4 Hardening Drill¶
- Read CVE history of registry-side spec misinterpretations (e.g., the 2018 layer-extraction symlink attacks). Internalize that any tool processing untrusted images must validate paths during extraction.
1.5 Production Readiness Slice¶
- Spin up a local registry:
docker run -d --rm -p 5000:5000 registry:2(or, true to the spirit of the curriculum, run it underpodman).skopeo copy oci:./alpine-layout:3.19 docker://localhost:5000/alpine:3.19. You now have a registry you control.
Week 2 - The OCI Runtime Spec, runc, and crun¶
2.1 Conceptual Core¶
- A runtime bundle = a directory containing
config.json(the runtime spec) +rootfs/(the filesystem to chroot/pivot_root into). runc create <id>readsconfig.json, sets up namespaces, cgroups, mounts, seccomp, capabilities, then waits.runc start <id>runs the configured command.runc state <id>shows status.runc kill <id> SIGTERMsignals.runc delete <id>cleans up.- Two production runtimes both implementing the OCI runtime spec:
runc(Go, the reference; what Docker / containerd use by default).crun(C, faster startup, lower memory, default in Podman on RHEL/Fedora).youki(Rust, gaining ground; primary Rust implementation).
2.2 Mechanical Detail¶
config.jsonschema (runtime-spec/config.md). Major sections:- `process - args, env, user, capabilities, rlimits.
root - path to rootfs,readonly` flag.mounts - list of`.linux.namespaces,linux.uidMappings, `linux.gidMappings - isolation.- `linux.resources - cgroups settings.
- `linux.seccomp - full seccomp filter.
linux.maskedPaths, `linux.readonlyPaths - host-leakage hardening.- `hooks - pre/post container lifecycle.
2.3 Lab-"Run a Container Without Docker"¶
- Generate a default config:
runc specproducesconfig.json. - Build a rootfs:
mkdir rootfs && skopeo copy docker://alpine:3.19 oci:./alpine && umoci unpack --image ./alpine:3.19 ./bundle(umoci gives you both rootfs + config in one step). Or do it manually. - Run:
sudo runc run mycontainer. You're inside the container. - Modify the config to: drop all capabilities except
CAP_NET_BIND_SERVICE, set a memory limit of 64M, mask/proc/sys. Re-run; verify withcat /proc/self/status | grep Capand pressure tests. - Repeat with
crun. Time the startup difference (time runc runvstime crun run)-crunis typically 2–5× faster.
2.4 Hardening Drill¶
- Read the default seccomp profile in
runc'slibcontainer/seccomp/seccomp_default.go(the equivalent profile is shipped with Docker asdefault.json). Note which syscalls it blocks. Review the spec'slinux.seccompschema and write a tighter custom profile for a specific service.
2.5 Production Readiness Slice¶
- Add an automated CI step that lints any custom
config.jsonagainst the OCI spec schema. Userunc spec --rootlessand study the differences vs the privileged config-this is the foundation for Month 3's rootless work.
Week 3 - skopeo Deep Dive: Multi-Arch, Signing, Sync¶
3.1 Conceptual Core¶
skopeois the image-manipulation tool that doesn't require a daemon or storage backend. It can copy between any of:docker://(registry),oci:(local OCI layout),dir:(raw blob dir),containers-storage:(local CRI-style store),oci-archive:,docker-archive:.skopeois also how you do registry maintenance: mirror, sync, prune, inspect manifests without pulling layers.
3.2 Mechanical Detail¶
skopeo inspect - show a manifest without downloading layers. - -rawfor the manifest as bytes. - -configfor the config blob. - -formatwith Go templates for scripting.skopeo copy --all - for multi-platform images, copy the entire index (all platforms). Without - -all,skopeoselects the running platform's manifest.- `skopeo sync - mirror a registry/repo to another registry or to a local OCI dir. The reference tool for air-gapped ops.
skopeo login,skopeo logout - credentials in${XDG_RUNTIME_DIR}/containers/auth.json`.
3.3 Lab-"A Daemonless Image Pipeline"¶
- Pull a multi-arch image as an OCI index. Inspect each per-platform manifest.
- Write a script that, given an image reference, prints a table of platforms, layer counts, total compressed/uncompressed sizes, and labels.
- Use
skopeo syncto mirror three images into your local registry. Verify by pulling the mirrored versions. - Compare
skopeo copyof a 1-GB image with and without - -multi-arch index-only` on the destination side.
3.4 Hardening Drill¶
- Configure
skopeoto verify signatures on copy ( - -policy - aware policy.json). The default policy is "insecureAcceptAnything"-change this in production.
3.5 Production Readiness Slice¶
- Build a CI job: on every release tag, copy the image from a "staging" registry path to a "production" path only after a Cosign signature is verified. Implementation in week 19 (
cosign verify).
Week 4 - Image Internals: Manifest Lists, Index, Annotations, Sparse Pulls¶
4.1 Conceptual Core¶
- A manifest list / index points to per-platform manifests. The runtime selects the matching one. This is how
docker pull nginxworks on both ARM and x86. - Annotations are a key/value sidecar on manifests, configs, and layers. Standardized keys:
org.opencontainers.image.source,.revision,.created,.licenses,.description. Use them; downstream tools read them. - Sparse / lazy pulls-
eStargzandZstd:chunkedformats let containers start before all layers are fully transferred.containerdsnapshotters (stargz-snapshotter) implement this.
4.2 Mechanical Detail¶
- The index spec is in
image-spec/image-index.md. Key field:manifests[]withplatformdescriptors (os,architecture, optionalvariant,os.version). - Annotations propagate through: build → manifest → registry → consumer.
buildahandpodmanset them automatically when given the right flags. - eStargz: a TAR-compatible format with a footer containing per-file offsets. The snapshotter pulls only the metadata initially and fetches files on access.
4.3 Lab-"Build a Multi-Arch Image By Hand"¶
- Build an image for
linux/amd64andlinux/arm64separately (usebuildah --arch=ordocker buildx). - Use
skopeoto assemble a manifest list pointing to both. - Push to your local registry.
- Pull from each architecture; verify the right manifest is selected.
- Add OCI annotations (
source,revision,created); verify they survive the pipeline.
4.4 Hardening Drill¶
- Annotate every built image with provenance: source repo URL + commit SHA. This is the precursor to SLSA (week 19).
4.5 Production Readiness Slice¶
- Configure
containerd(week 9) to use the `stargz-snapshotter - measure container startup time for a large image (1+ GB) with vs without lazy pulling.
Month 1 Capstone Deliverable¶
A oci-foundations/ workspace:
1. runc-bundle/ - week 2's hand-rolled runtime bundle with hardening.
2.daemonless-pipeline/ - skopeo - based image-handling scripts.
3.multiarch-build/ - week 4's hand-assembled multi-arch image with annotations.
4. A RUNBOOK.md covering: registry setup, image inspection, signature verification flow.
Month 2-Filesystems and Image Builds¶
Goal: by the end of week 8 you can (a) construct an OverlayFS filesystem by hand and explain copy-up, (b) build images with buildah directly (no Dockerfile required), (c) author multi-stage Dockerfiles that produce minimal distroless images, and (d) reason about layer caching, build context, and reproducible builds.
Weeks¶
- Week 5 - OverlayFS and Storage Drivers
- Week 6 -
buildah: Building Images Without Dockerfiles - Week 7 - Multi-Stage Builds, Distroless, Minimal Images
- Week 8 - Layer Caching, Build Context, Reproducibility
Week 5 - OverlayFS and Storage Drivers¶
5.1 Conceptual Core¶
- A container's rootfs is built by stacking image layers via a union filesystem. The dominant driver on Linux is OverlayFS (in tree since 3.18). Layers are read-only lower dirs; the container's writable space is the upper dir; the visible merged view is the mount target.
- On any write to a file in a lower layer, the file is copied up to the upper layer first (copy-on-write). This is what makes layered images fast to start but slow to write large files modified from a lower layer.
- Other drivers:
aufs(legacy),btrfs(snapshots),zfs(heavy),devicemapper(deprecated),vfs(no CoW; ultra-portable, ultra-slow).
5.2 Mechanical Detail¶
mount -t overlay overlay -o lowerdir=A:B:C,upperdir=U,workdir=W /merged.workdiris required for OverlayFS bookkeeping; it must be on the same filesystem asupperdir.- Whiteouts: a file deleted in the upper relative to the lower is represented by a
char 0,0device file. Listing/diff operations interpret these. - Opaque directories: the
trusted.overlay.opaque="y"xattr marks a dir whose lower contents should be hidden. - The
containerdsnapshotter abstraction: each driver implements aSnapshotterinterface; the snapshotter manages active and committed snapshots, writable layers, etc.
5.3 Lab-"OverlayFS By Hand"¶
- Create three lower dirs with different files. Mount as overlay. Verify merged view.
- Modify a file from the lower; observe copy-up in the upper.
- Delete a lower file from the merged view; observe the whiteout in the upper.
- Reproduce a "container layer": treat your container's tarball-extracted contents as a lower; create a fresh upper; mount; modify; tar up the upper to produce a new layer.
5.4 Hardening Drill¶
- Audit OverlayFS CVEs. The class of "container escape via crafted file in lower layer" has been exploited. Mitigations: rootless mode + user namespaces, or a sandbox layer (gVisor, Kata).
5.5 Production Readiness Slice¶
- Compare OverlayFS, fuse-overlayfs (rootless default), and the kernel's native rootless overlay (since 5.13) for a representative workload. Measure layer-creation, file-write, and read-many performance.
Week 6 - buildah: Building Images Without Dockerfiles¶
6.1 Conceptual Core¶
- A Dockerfile is one DSL for building images. It is not the only one.
buildahexposes the underlying primitives: create a working container, run commands in it, copy files, set config, commit to an image. - This matters because:
- CI systems can build images without a privileged daemon.
- You can build images programmatically (e.g., from a Go program).
- You can construct images with stricter properties (provenance, reproducibility) than Dockerfiles natively allow.
6.2 Mechanical Detail¶
- The
buildahAPI has Dockerfile-equivalent commands plus richer ones: - `buildah from
- start a working container from a base. - `buildah run
-- - run a command inside. - `buildah copy
- copy files in. - `buildah config --entrypoint='["..."]'
- set config. - `buildah commit
- produce an image. - `buildah unshare - enter a user namespace; lets you operate on storage as "root" without being host-root. Foundation for rootless builds.
buildah build(aliasbuildah bud) reads a Dockerfile and uses the same primitives.
6.3 Lab-"Image as a Shell Script"¶
- Write a shell script that uses
buildah from,run,copy,config,committo produce a small Go-binary-on-alpineimage. No Dockerfile. - Add reproducibility flags: - -source-date-epoch
, - -timestamp,SOURCE_DATE_EPOCHenv. Build twice; verify hashes match. - Build the same image with
buildah bud -f Dockerfile. Compare hashes-they should be identical when both are reproducible.
6.4 Hardening Drill¶
- Build everything as a non-root user (
buildah unshare, rootless mode). Confirm the storage location is in~/.local/share/containers/, not/var/lib/containers.
6.5 Production Readiness Slice¶
- Wire
buildahinto a CI job that targetslinux/amd64andlinux/arm64from the same x86 runner usingqemu-user-static. Document the multi-arch build contract.
Week 7 - Multi-Stage Builds, Distroless, Minimal Images¶
7.1 Conceptual Core¶
- The point of a build image is to not be the runtime image. A modern image pipeline:
- Stage 1 (build): full build environment (compiler, headers, dev tools).
- Stage 2 (test): the build artifacts plus test runners.
- Stage 3 (runtime): a minimal image with just the artifact.
- Distroless images (Google's
gcr.io/distroless/*) contain only the runtime dependencies-no shell, no package manager, nocat. Smaller attack surface, smaller image, faster startup. - Static binaries (Go with
CGO_ENABLED=0, Rust with musl, Java GraalVM native-image) can run onscratch(the empty base image): typically <20 MB total.
7.2 Mechanical Detail¶
- Multi-stage Dockerfile:
- Distroless variants:
static,base,cc,python3,java, etc. Pick the smallest that works. - The
nonroottag ensures the default user is UID 65532-never root. - The
:debugtag adds busybox for emergency debugging-use only for one-off triage in dev.
7.3 Lab-"Three Image Diet"¶
Take a Go (or Rust, or Python) service and produce three images:
1. Naive: FROM ubuntu, build inline. Measure size.
2. Distroless: multi-stage with gcr.io/distroless/static. Measure size.
3. Scratch: static build, FROM scratch. Measure size.
Document the size delta and any operational tradeoffs (e.g., scratch has no ca-certificates -tls.Configfailures unless youCOPY --from=alpine /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/`).
7.4 Hardening Drill¶
- Run
docker scout cves(ortrivy image) on each variant; observe that scratch and distroless have ~zero CVEs from the base, while ubuntu/alpine have many. The CVEs aren't gone-the attack surface is reduced. Internalize the difference.
7.5 Production Readiness Slice¶
- Configure your CI to fail builds whose image grows by >5% vs the baseline. This forces conscious deltas; surprise growth is often a leaked dev tool.
Week 8 - Layer Caching, Build Context, Reproducibility¶
8.1 Conceptual Core¶
- Image-build performance is dominated by layer cache hit rate. A miss invalidates every subsequent layer; a hit reuses upstream work.
- The cache key is determined by: the parent layer's digest + the operation (the exact command, copy contents, build args). Order operations from least-frequently-changing to most-frequently-changing.
- Reproducible builds = byte-identical outputs from identical inputs. Requires: pinned base images (by digest, not tag),
SOURCE_DATE_EPOCH, deterministic file ordering (tar - -sort=name`), no embedded build-host info.
8.2 Mechanical Detail¶
COPYordering: copygo.mod/package.json/Cargo.lockfirst, run dep install (cached on subsequent unrelated changes), then copy source. Saves dep-install time on every code-only change.- BuildKit cache mounts (
RUN --mount=type=cache,target=/root/.cache/go-build): persist a build directory across image builds, even when the surrounding layer is invalidated. Massive speedup for compiled-language workflows. .dockerignore: every byte sent to the daemon contributes to context size and may invalidate caches. Pattern after.gitignore.- Pin base images by digest:
FROM golang:1.22@sha256:abc.... Tag-based pins drift silently.
8.3 Lab-"Cache and Reproducibility"¶
- Take a non-trivial image; measure clean-build time and incremental-build time (single source change). Reorder Dockerfile to maximize cache hits; re-measure.
- Enable BuildKit cache mounts; measure again.
- Build the same image on two machines with
SOURCE_DATE_EPOCHset; verify the digests match.
8.4 Hardening Drill¶
- Pin every base image by digest. Document a refresh policy (e.g., monthly digest-bump PRs reviewed for security advisories).
8.5 Production Readiness Slice¶
- Add a CI job that builds the image twice in fresh runners and asserts
digest_run1 == digest_run2. Reproducibility regressions become P1 issues.
Month 2 Capstone Deliverable¶
A filesystems-and-builds/ workspace:
1. overlayfs-by-hand/ - week 5 lab.
2.buildah-pipeline/ - week 6 daemonless build pipeline.
3. three-image-diet/ - week 7 size comparison + tradeoff analysis.
4.reproducible-build/ - week 8 with hash-equivalence CI gate.
Month 3-Runtimes and Daemons: containerd, CRI-O, podman, Rootless¶
Goal: by the end of week 12 you can (a) deploy and operate containerd directly, (b) explain the CRI (Container Runtime Interface) and how Kubernetes drives it, (c) run rootless containers fluently with podman, and (d) reason about runtime choices (runc vs crun vs gVisor vs Kata) for a workload.
Weeks¶
- Week 9 -
containerdArchitecture - Week 10 - CRI-O and the Kubernetes CRI
- Week 11 -
podmanand the Rootless Model - Week 12 - Sandboxed Runtimes: gVisor and Kata Containers
Week 9 - containerd Architecture¶
9.1 Conceptual Core¶
containerdis a container daemon that manages: image pull/push, content storage, layered snapshotters, runtime invocation (via OCI-spec runtimes likerunc/crun), task/process management.- It is not a monolithic daemon. Plugins (snapshotter, runtime, content store) are pluggable.
containerdis whatdockerdactually calls underneath; it is also the default runtime daemon in Kubernetes since 1.24.
9.2 Mechanical Detail¶
- Architecture (read
containerd/containerd/docs/architecture.md): - Content store-content-addressed blob storage.
- Image store-refs and image metadata.
- Snapshotters-
overlayfs,btrfs,zfs,stargz,devmapper. Pluggable. - Runtime plugin-invokes OCI runtimes via the
shimmodel (one shim process per container, decouples container lifecycle from daemon restart). - Tasks API-manage processes inside containers.
- Events API-pub/sub for lifecycle events.
- The
ctrCLI is a debugging tool, not a user-facing CLI. For users:nerdctl(Docker-compatible),crictl(CRI-level), or higher-level (Kubernetes, Buildah). - The shim (
containerd-shim-runc-v2) keeps the container alive acrosscontainerddaemon restarts. Each container has its own shim.
9.3 Lab-"containerd Without Kubernetes"¶
- Install
containerdandnerdctl. Configure/etc/containerd/config.toml. - Pull, run, exec, kill containers entirely via
nerdctl. Confirmdockerdis not running. - Enable the
stargz-snapshotter. Pull a large image with eStargz layers. Measure first-run startup time vs cold pull. - Use
ctrto inspect tasks, snapshots, and content blobs at the daemon level.
9.4 Hardening Drill¶
- Configure
containerdto use a custom seccomp profile and AppArmor (or SELinux) profile by default. Updateconfig.toml's[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]section.
9.5 Production Readiness Slice¶
- Wire
containerdmetrics to Prometheus (metrics.addressin config). Plot container start latency, image pull bytes, snapshotter ops/sec.
Week 10 - CRI-O and the Kubernetes CRI¶
10.1 Conceptual Core¶
- CRI (Container Runtime Interface) is Kubernetes's gRPC API for talking to a container runtime. Two implementations dominate:
containerd(with its CRI plugin) andCRI-O(built specifically for Kubernetes, Red Hat's choice). - CRI is narrower than full container management: it covers what kubelet needs and nothing else. No image-build, no high-level operations.
- CRI-O philosophy: minimum daemon surface, OCI-spec-only, no Docker compatibility.
10.2 Mechanical Detail¶
- CRI services:
RuntimeService(containers, sandboxes, exec/attach/portforward) +ImageService(pull, list, remove). Defined incri-api/pkg/apis/runtime/v1alpha2/api.proto. - The pause container (every k8s pod has one): holds the network namespace alive while application containers are restarted. CRI-O and containerd both use this pattern.
crictl - direct CRI client for debugging.crictl ps,crictl images,crictl inspect. Different fromkubectl`; talks to kubelet's runtime, not to the API server.
10.3 Lab-"CRI Direct"¶
- Install
CRI-Oon a clean machine (or usecontainerdwith its CRI plugin). - Use
crictl runp(run pod),crictl create,crictl startto manually launch a pod-equivalent without Kubernetes. Inspect withcrictl inspect. - Add an OCI hook (e.g., a pre-start hook that logs every container) by configuring CRI-O's
hooks_dir.
10.4 Hardening Drill¶
- Set CRI-O's default seccomp to
runtime/default, AppArmor toruntime/default(Ubuntu) or SELinux tocontainer_t(RHEL). These are the same defaults Kubernetes applies; understanding them at the runtime level demystifies pod-level errors.
10.5 Production Readiness Slice¶
- Set up
cri-tools/critestagainst your runtime; the suite verifies CRI compliance. A non-compliant runtime is a recipe for kubelet bugs in production.
Week 11 - podman and the Rootless Model¶
11.1 Conceptual Core¶
podmanis a daemonless, drop-indockerreplacement. It runs containers as the calling user, with no long-running daemon. Eachpodman runis a fork-exec into aconmonsupervisor +runc/crun.- Rootless containers run as a non-root user on the host, with a user namespace mapping host-UID to UID 0 inside the container. The biggest single security win in the modern container ecosystem.
- Rootless is now mature: works with overlay (since kernel 5.11), works with networking via
slirp4netns(slow) orpasta(fast, kernel 6.0+).
11.2 Mechanical Detail¶
- Rootless storage:
~/.local/share/containers/storage/. Image and container state per-user. - Rootless networking:
slirp4netns-userspace TCP/IP stack; works everywhere, slow (~1 Gbps).pasta-newer, kernel-bypass via `vsock - like tricks; faster (~10 Gbps).subuid/subgidfiles (/etc/subuid,/etc/subgid) define the host UID range mapped into the user namespace. Default 65536 IDs per user.- `podman generate systemd - produce systemd unit files for rootless containers. The recommended path for "always-on" rootless services.
11.3 Lab-"Rootless Production"¶
- As a non-root user, install
podman. Configure/etc/subuid,/etc/subgid. - Run a multi-container app with
podman play kube(Kubernetes-YAML-as-podman-input). - Generate systemd units; install with - -user
. The service starts at user login and persists across reboots (withloginctl enable-linger`). - Compare
slirp4netnsvspastanetworking throughput withiperf3.
11.4 Hardening Drill¶
- Confirm rootless containers cannot escape: try mounting host paths, accessing host devices, inspecting host processes. Each should fail (or be remapped innocuously via the user namespace).
11.5 Production Readiness Slice¶
- Convert one production-ish workload from rootful Docker/
runcto rootlesspodman. Document the operational deltas (port-binding <1024 needsCAP_NET_BIND_SERVICEviasysctl net.ipv4.ip_unprivileged_port_start=80).
Week 12 - Sandboxed Runtimes: gVisor and Kata Containers¶
12.1 Conceptual Core¶
- For untrusted workloads (multi-tenant SaaS, untrusted code execution), namespaces+cgroups+seccomp are not enough. The kernel attack surface is too large.
- Two production-grade alternatives:
- gVisor (
runsc)-a userspace kernel that intercepts syscalls. Lower overhead than VMs; more compatible than seccomp-based sandboxes. Used in App Engine, Cloud Run. - Kata Containers-runs each container (or pod) in a lightweight VM. Hardware-accelerated isolation; higher overhead but stronger guarantees. Used by Confidential Containers and Alibaba Cloud.
- Both are OCI-spec runtimes-drop-in replacements for
runcincontainerd/CRI-O. The OCI spec abstraction is what makes this possible.
12.2 Mechanical Detail¶
- gVisor (
runsc): - The Sentry component implements a Linux-compatible kernel in user space.
- The Gofer component proxies file I/O.
- Configure via
runtimeClassNamein Kubernetes; configure containerd to registerrunscas an additional runtime. - Performance: I/O-bound workloads suffer most (gofer hop). CPU-bound workloads near-native.
- Kata:
- Each container/pod gets its own micro-VM (Firecracker, Cloud Hypervisor, or QEMU).
- The kata-agent runs inside the VM; kata-runtime on the host orchestrates.
- Performance: ~10–20% overhead vs runc; sub-second VM boot via Firecracker.
12.3 Lab-"Two Sandboxes"¶
- Install gVisor. Register as a containerd runtime. Run
nerdctl --runtime runscagainst a test workload. - Install Kata. Register as a containerd runtime. Run the same workload.
- Benchmark both vs
runcfor: startup time, syscall-heavy workload (e.g.,find /usr -type f), and CPU-bound workload (e.g.,sysbench cpu). - Document the tradeoffs in a markdown matrix.
12.4 Hardening Drill¶
- Read the gVisor security model; identify the syscalls it does not implement (and would refuse). Compare to a default seccomp profile-gVisor is strictly stronger.
12.5 Production Readiness Slice¶
- Choose the right runtime for the right workload. Document a per-workload decision matrix in your team's runbook: trusted internal services →
runc/crun, customer-supplied code →runsc, regulated/PCI workloads → Kata.
Month 3 Capstone Deliverable¶
A runtimes-and-daemons/ workspace:
1. containerd-direct/ - week 9 setup + Prometheus wiring.
2.crio-no-k8s/ - week 10 manual pod operations.
3. rootless-systemd/ - week 11 podman + systemd setup.
4.sandbox-bench/ - week 12 runc/runsc/kata comparison report.
Month 4-Container Security¶
Goal: by the end of week 16 you can (a) author seccomp profiles tailored to a service, (b) decompose Linux capabilities and assign minimum sets, (c) configure SELinux and AppArmor for containerized workloads, and (d) run rootless, user-namespaced containers as the default.
Weeks¶
- Week 13 - The Default Threat Model
- Week 14 - Capabilities for Containers
- Week 15 - Seccomp Profiles for Containers
- Week 16 - LSM for Containers: SELinux and AppArmor
Week 13 - The Default Threat Model¶
13.1 Conceptual Core¶
- The container default threat model is not "isolation comparable to a VM." It is "isolation comparable to a chroot with namespaces, seccomp, and capability dropping." That is good for most use cases but breaks down for:
- Untrusted code execution.
- Multi-tenant SaaS where tenants can submit code.
- Workloads that hold secrets cross-cutting tenants.
- For those, use a sandboxed runtime (Month 3 week 12).
- The default-rootful, capability-rich, weak-seccomp configuration in legacy Docker installs is the worst-case starting point. Modern runtimes (rootless podman, distroless images, Kata) flip the defaults.
13.2 Mechanical Detail-Threat Surfaces¶
- Container escape via kernel exploit. Mitigation: keep kernel patched; gVisor/Kata for high-value targets.
- Container escape via misconfiguration. Most common: - -privileged`, mounted Docker socket, host PID/NET, writeable host paths.
- Image supply-chain attack. Mitigation: SBOM, signing, allowlisted base images. Month 5.
- Lateral movement via shared resources. Mitigation: PodSecurity policies, network policies, secret scoping.
- Resource exhaustion (DoS). Mitigation: cgroups v2 limits.
13.3 Lab-"Audit a Real Image"¶
- Pick a popular base image (
nginx,redis). Run withdocker scout cvesandtrivy image. Record findings. - Run as default; identify how many capabilities it has via
capsh --print. - Re-run with - -cap-drop=ALL
, - -security-opt=no-new-privileges, read-only rootfs. Identify what breaks. Fix what's needed. - Document the minimum config to run safely.
13.4 Hardening Drill¶
- Establish a baseline
docker run(orpodman run) policy: read-only rootfs, cap-drop=ALL, no-new-privileges, default seccomp, non-root user, tmpfs for/tmp, memory and CPU limits.
13.5 Production Readiness Slice¶
- Author a
policy.jsonforskopeo/podmanthat requires signed images for any registry path containingprod. Verification will be wired in week 19.
Week 14 - Capabilities for Containers¶
14.1 Conceptual Core¶
Linux capabilities subdivide root privilege into ~40 named caps (CAP_NET_ADMIN, CAP_SYS_PTRACE, etc.). Container runtimes apply a bounding set before exec - the container's processes can never gain a capability outside this set, regardless of UID 0.
The default Docker bounding set is ~14 caps: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW, SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE, SYS_CHROOT, KILL, AUDIT_WRITE. Each enables a specific class of syscalls.
The discipline: most workloads need zero capabilities. Drop everything (--cap-drop=ALL); add back only what testing proves you need.
14.2 Mechanical Detail¶
The capabilities you'll meet most often, with what they unlock:
CAP_NET_BIND_SERVICE- bind to ports < 1024. Common for legacy services; modern apps run on 8080+ and skip this.CAP_NET_ADMIN- configure interfaces, iptables, routing. Network plugins (CNI), service meshes need it. App containers should not.CAP_NET_RAW- open raw / packet sockets (ICMP,ping). Often dropped: most services don't need to ping anything from inside.CAP_SYS_ADMIN- the kitchen sink. ~40 different operations gated by it. Avoid at all costs; equivalent to root in many threat models.CAP_SYS_PTRACE- attach to other processes (debuggers, strace). Confine to debug containers only.CAP_DAC_OVERRIDE- bypass file permission checks. Almost always a sign of bad file ownership; fix the ownership instead.
Read what a running container actually has: capsh --decode=$(grep CapEff /proc/<pid>/status | awk '{print $2}').
Apply in Kubernetes via pod spec:
securityContext:
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"] # only if you actually need it
The trap
Granting CAP_SYS_ADMIN because "the container needs to mount something." 90% of the time the actual need is for CAP_SYS_CHROOT or a specific filesystem-related cap. SYS_ADMIN opens ~40 unrelated operations and is the single most-abused capability in misconfigured containers.
14.3 Lab - "Capability Diet"¶
- For three services (e.g., a Go HTTP server, an Nginx reverse proxy, a Node.js app), run each with
--cap-drop=ALL. Identify what fails (the error usually mentions the syscall - map back to the capability viacapabilities(7)). - Add back capabilities one at a time. Document the minimum set per service.
- Configure your container runtime (podman, containerd) or pod-security policy to apply this minimum by default.
14.4 Hardening Drill¶
For any service requiring more than 3 caps, write a one-paragraph justification. If you can't justify, you don't need it. Common offenders worth re-auditing: anything inheriting from an old base image, anything that runs as root inside the container.
14.5 Production Readiness Slice¶
Add a CI step that fails any new image whose declared capability set exceeds the team's allowlist. Trivy, Kyverno, OPA, and Pod Security Standards (Kubernetes restricted profile) all support this check. The right gate: PR-time, not runtime - runtime is too late.
Week 15 - Seccomp Profiles for Containers¶
15.1 Conceptual Core¶
A seccomp profile is a JSON document describing per-syscall actions: allow, log, errno (return an error), or kill (terminate the process). The container runtime compiles the JSON to a BPF filter and applies it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filter). Once installed, the filter cannot be loosened - only tightened.
The default Docker profile allows ~310 syscalls and blocks ~50 (the ones rarely needed by app containers but useful to attackers - keyctl, kexec_load, umount, etc.). Multiple recent kernel CVEs have been blocked entirely by the default profile, even on unpatched hosts. Tighter custom profiles per service reduce attack surface further.
15.2 Mechanical Detail¶
Profile structure: a defaultAction plus per-syscall rules, with optional argument-value matching.
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{ "names": ["read", "write", "exit_group", "futex", "mmap"],
"action": "SCMP_ACT_ALLOW" },
{ "names": ["openat"],
"action": "SCMP_ACT_ALLOW",
"args": [{"index": 2, "value": 0, "op": "SCMP_CMP_MASKED_EQ", "valueTwo": 2}] }
]
}
Generating profiles for a specific service:
oci-seccomp-bpf-hook(Red Hat) - attach to a container, record every syscall it makes during a representative workload, emit a JSON profile. The right tool for "what does this app actually need?"falcoctl/ Falco artifacts - newer; supports community-shared profiles.- Manual:
strace -c -ff -o trace.out <cmd>- enumerate syscalls under load, then deny everything else. Slower but no extra dependency.
Apply:
- Docker / podman: --security-opt seccomp=profile.json.
- Kubernetes pod spec: securityContext.seccompProfile.type: Localhost + localhostProfile: profiles/myapp.json (the kubelet looks under /var/lib/kubelet/seccomp/).
The trap
Recording a seccomp profile from a happy-path workload only. Edge cases (error handling, log rotation, graceful shutdown) need different syscalls; the profile blocks them in production and the service crashes mysteriously hours later. Always run the recorder through your full integration-test suite, not just the smoke test.
15.3 Lab - "Custom Seccomp"¶
- Run a service under
oci-seccomp-bpf-hook(orstrace -ff) and exercise it with your integration tests. - Generate a tight profile (default-deny + only the recorded syscalls).
- Run with the profile; verify the service works under load.
- Inject a "test" syscall (e.g.,
setns,unshare, ormount) the service doesn't legitimately use; verify it's blocked at runtime.
15.4 Hardening Drill¶
For long-running services, ship the custom seccomp profile alongside the image (e.g., as /seccomp/profile.json baked in, or as a ConfigMap mounted into /var/lib/kubelet/seccomp/). Reference it in deployment configs. Version it with the code - a profile that goes stale relative to its app is worse than no profile.
15.5 Production Readiness Slice¶
Document a process: every new service must ship with a seccomp profile generated from a representative load test, reviewed by a peer, committed to the repo. Pre-prod CI: run with the profile in audit-only mode (SCMP_ACT_LOG), collect any unexpected syscalls, fail the build if there are deltas from the committed profile.
Week 16 - LSM for Containers: SELinux and AppArmor¶
16.1 Conceptual Core¶
- SELinux and AppArmor add a third confinement layer (after capabilities and seccomp): what objects (files, sockets) the process may access.
- Most container runtimes apply a generic profile by default:
- SELinux:
container_ttype, with category-based isolation (multi-category security, MCS) so two containers on the same host can't access each other's files even if they share a UID. - AppArmor:
runtime/default(a path-based profile generated by Docker/podman). - Custom profiles tighten further; this is what hardened multi-tenant container hosts (OpenShift, Bottlerocket) ship.
16.2 Mechanical Detail¶
- SELinux for containers:
- Each container gets a unique MCS category, applied via the
:s0:cN,cMsuffix on the type. - Policy:
container.tefromcontainer-selinuxpackage. - Volume mounts: must be relabeled (
:Zmount option) or marked:z(shared label)-otherwise SELinux denies access. - Debugging:
ausearch -m AVC -ts recent. - AppArmor for containers:
- Profile in
/etc/apparmor.d/. Load withapparmor_parser -r. - Apply with - -security-opt apparmor=my-profile`.
- Tooling:
aa-genprof,aa-logprofto iterate.
16.3 Lab-"MAC Per Workload"¶
- On RHEL/Fedora: write a custom SELinux policy module for one service. Test enforcement.
- On Ubuntu/Debian: write an AppArmor profile for the same service. Test enforcement.
- Document the comparative effort and expressivity.
16.4 Hardening Drill¶
- Verify your hosts run with LSM in enforcing mode (
getenforce,aa-status). Permissive is for development only.
16.5 Production Readiness Slice¶
- Wire LSM denial alerts: ship
audit.log(SELinux) orkern.log(AppArmor) to your central log aggregator; alert on AVCs from prod containers.
Month 4 Capstone Deliverable¶
A container-security/ workspace:
1. image-audit/ - week 13's audit findings + remediation.
2.cap-diet/ - week 14's capability matrix per service.
3. seccomp-profiles/ - week 15's per-service profiles, CI-validated.
4.lsm-profiles/ - week 16's SELinux + AppArmor profiles.
A THREAT_MODEL.md covering: what the model isolates against, what it doesn't, what to use instead for the gaps.
Month 5-Container Supply Chain: SBOM, Vulnerability Scanning, Signing, SLSA¶
Goal: by the end of week 20 you can (a) generate accurate SBOMs (Syft), (b) scan for CVEs (Grype, Trivy) and triage findings, (c) sign images and verify with cosign, and (d) target SLSA Level 3 in your build pipeline.
Weeks¶
- Week 17 - Software Bill of Materials (SBOM)
- Week 18 - Vulnerability Scanning: Grype, Trivy, Clair
- Week 19 - Signing and Verification: Cosign, Sigstore
- Week 20 - SLSA, Provenance, and Reproducibility
Week 17 - Software Bill of Materials (SBOM)¶
17.1 Conceptual Core¶
- An SBOM is a structured manifest of every component, dependency, and license in a software artifact. Two dominant formats: SPDX and CycloneDX. Both are JSON; both are interchangeable for most uses.
- Three levels of SBOM accuracy:
- Source SBOM-generated from
go.mod,package.json, etc., before build. - Build SBOM-generated by the build tool (
buildah,goreleaser,docker buildx). - Image SBOM-generated from a built image (Syft, Trivy). May differ from source SBOM if the build introduces or strips dependencies.
- For compliance, ship the image SBOM attached to the image (as an attestation in the registry) and verify on consumption.
17.2 Mechanical Detail¶
- Syft (Anchore):
syft <image> -o spdx-json > sbom.json. Inspects layers, identifies packages by ecosystem (apk, dpkg, rpm, npm, gomod, pip, gem, cargo, ...). - Trivy also generates SBOMs (
trivy image --format cyclonedx), with overlapping but slightly different package detection. - OCI image artifacts-SBOMs can be attached to the image in the registry as separate artifacts via the OCI Reference Types specification.
cosign attach sbomis the canonical command.
17.3 Lab-"SBOM Pipeline"¶
- Generate an SBOM for one of your images with Syft (SPDX) and Trivy (CycloneDX). Diff the two-note where they disagree.
- Attach the SBOM to the image with
cosign attach sbom. - From a downstream consumer, retrieve and parse the SBOM with
cosign download sbom. - Add a CI step that fails the build if the SBOM contains a known-bad license (e.g., AGPL in a closed-source project).
17.4 Hardening Drill¶
- Build a script that diffs SBOMs between two image versions; produces a "what changed" report. Use this in PR review for image base-image bumps.
17.5 Production Readiness Slice¶
- Require an attached SBOM for every image promoted to a
prod/registry path. Verify in CI before promotion.
Week 18 - Vulnerability Scanning: Grype, Trivy, Clair¶
18.1 Conceptual Core¶
- Scanners cross-reference image contents (or SBOMs) against vulnerability databases (NVD, distro-specific advisories, GitHub Security Advisories). They emit findings with CVE IDs, severities, and (sometimes) fixed versions.
- The discipline is triage, not zero-CVE. A
CriticalCVE in a package you don't actually exercise is still a finding, but lower priority than aHighin the request-handling path. - Tools:
- Grype (Anchore)-SBOM-friendly; pairs with Syft.
- Trivy (Aqua)-fast, broad ecosystem coverage, also handles config (Kubernetes YAML, Terraform).
- Clair (Quay)-registry-side scanning; powers Quay and Harbor's scan UIs.
18.2 Mechanical Detail¶
- Severity classifications: NVD CVSS v3 score → Critical (≥9.0), High (7.0–8.9), Medium (4.0–6.9), Low (<4.0). Project-specific scores may differ.
- Vulnerability Exploitability eXchange (VEX)-declares whether a CVE is actually exploitable in your context.
affected,not_affected,fixed,under_investigation. Use OpenVEX or CSAF VEX to suppress non-exploitable findings without hiding them. - Allowlist / ignore files-
.trivyignore,.grype.yaml. Use sparingly; document each entry's rationale.
18.3 Lab-"Triage in CI"¶
- Run Trivy on an image; produce a SARIF report. Upload to GitHub Code Scanning (or your scanner of choice).
- Pick three findings; for each, write a one-paragraph triage decision: fix, accept, or VEX-suppress.
- Author the VEX statement using
vexctl(OpenVEX). Attach to the image. - Re-scan-verify the suppressed findings are now flagged as "not exploitable" rather than disappearing entirely.
18.4 Hardening Drill¶
- Set CI policy: builds fail on
CriticalorHighvulns with available fixes. Builds warn (do not fail) on findings without fixes-but require a VEX statement within 7 days.
18.5 Production Readiness Slice¶
- Set up continuous re-scanning: nightly scans of all production images against the latest vulnerability database. New critical CVEs page on-call.
Week 19 - Signing and Verification: Cosign, Sigstore¶
19.1 Conceptual Core¶
- Sigstore is a free public-good infrastructure for software signing: keyless signatures via OIDC (your GitHub/Google identity becomes the signer), a transparency log (Rekor), and a CA (Fulcio).
- Cosign is the CLI: sign images, verify signatures, attach attestations (SBOMs, VEX, SLSA provenance), all backed by Sigstore by default-or a private key for offline use.
- The promise: every artifact you publish has a verifiable link back to who built it, what SBOM was attached, when, with cryptographic proof recorded in a public ledger.
19.2 Mechanical Detail¶
- `cosign sign
- keyless signing via OIDC; opens browser for auth; uploads short-lived cert + signature to Rekor. - `cosign verify
--certificate-identity user@example --certificate-oidc-issuer https://accounts.google.com - verifies signer identity and issuer. - For private/offline:
cosign generate-key-pair;cosign sign --key cosign.key;cosign verify --key cosign.pub. - Attestations-signed statements about an artifact:
cosign attest --predicate sbom.json --type spdx <image>. The full SLSA provenance flow. - Policy verification-
cosign verifywith policy: only specific signers, only via specific CI workflows (GitHub Actions OIDC subjectrepo:org/repo:ref:refs/heads/main).
19.3 Lab-"Signing Pipeline"¶
- Sign an image with cosign keyless (GitHub OIDC). Verify.
- Attach SBOM and VEX as attestations.
- Configure
policy-controller(Sigstore's Kubernetes admission controller) to require a valid signature from your CI's OIDC subject before allowing deploys. - Try to deploy an unsigned image-observe the rejection.
19.4 Hardening Drill¶
- Set registry retention policy: signed images permanent; unsigned images garbage-collected after 7 days. Forces a signing-or-discard discipline.
19.5 Production Readiness Slice¶
- Wire
cosign verifyintoskopeopolicy.json-transparent verification on every pull. Document the disaster-recovery flow if your signing identity is compromised.
Week 20 - SLSA, Provenance, and Reproducibility¶
20.1 Conceptual Core¶
- SLSA (Supply chain Levels for Software Artifacts) is a graduated maturity model for build-pipeline integrity. Levels 1–4:
- L1: Build process documented; provenance recorded.
- L2: Hosted build service; signed provenance.
- L3: Hardened, isolated builds; non-falsifiable provenance.
- L4: Two-party review; reproducible.
- Most production CI/CD pipelines reach L2 with effort, L3 with discipline. L4 is rare.
- Provenance-a signed attestation describing how an artifact was built: source repo, commit, builder identity, build invocation, dependencies. The SLSA Provenance v1.0 schema is the standard.
20.2 Mechanical Detail¶
- GitHub Actions has a built-in OIDC token that includes the workflow's repo + ref + sha.
slsa-github-generatorconsumes this to produce SLSA L3 provenance for releases. - The
cosign attest --type slsaprovenanceflow attaches the provenance to the image. - Reproducibility-ideally every commit produces a byte-identical artifact.
goreleasersupports this;buildah --source-date-epochplus pinned base images plus deterministic file ordering plus no embedded build host info makes it possible.
20.3 Lab-"SLSA L3 in CI"¶
- Set up a GitHub Actions workflow that builds, scans, signs, and produces SLSA L3 provenance for an image on every release tag.
- Verify end-to-end: pull the image, retrieve its attestations, validate the provenance points back to the correct commit and CI run.
- Reproducibility: rebuild the same tag from a fresh runner; verify image digest stability.
20.4 Hardening Drill¶
- Document the kill chain: an attacker compromises which exact thing in your pipeline, and what does each SLSA level actually mitigate? Be concrete.
20.5 Production Readiness Slice¶
- Promotion gate: a Kubernetes admission policy (Kyverno, OPA, or
policy-controller) that requires SLSA L3 provenance from your CI's identity for any production deploy.
Month 5 Capstone Deliverable¶
A supply-chain/ workspace:
1. sbom-pipeline/ - week 17 SBOM generation + attachment + diff tooling.
2.vuln-triage/ - week 18 scanner config + VEX statements.
3. cosign-flow/ - week 19 signing + admission verification.
4.slsa-l3/ - week 20 reproducible reproducible build with verified provenance.
A SUPPLY_CHAIN.md documenting the full provenance flow from source commit to running container.
Month 6-Build Your Own: Mini-Docker From Scratch¶
Goal: by the end of week 24 you have implemented a working "container runner" in Go or Rust, demonstrating manual orchestration of namespaces, cgroups, OverlayFS, and a small subset of OCI runtime spec. The artifact is a portfolio piece you can defend in any senior containers interview.
Weeks¶
- Week 21 - Scaffolding: Project Setup, OCI Bundle Reading
- Week 22 - Namespaces and Process Isolation
- Week 23 - Cgroups v2, Capabilities, Seccomp, OverlayFS
- Week 24 - Polish, Defense, Distribution
Week 21 - Scaffolding: Project Setup, OCI Bundle Reading¶
21.1 Conceptual Core¶
- The mini-Docker takes an OCI runtime bundle (a directory with
config.jsonandrootfs/), sets up the appropriate kernel features, executes the configured command, and supervises until exit. - Scope: the project will not implement all of the OCI runtime spec-focus on the core: namespaces, capabilities, mounts, cgroups v2 (memory + cpu + pids), seccomp.
- Two language tracks:
- Go-leverages
runc/libcontainerlearnings,golang.org/x/sys/unixfor syscalls. Closer to runc. - Rust-leverages
nixcrate for syscalls; closer to youki. Stronger memory safety; uses unsafe sparingly.
21.2 Mechanical Detail¶
- Project layout (Go example):
minidocker/ cmd/minidocker/main.go # CLI: create, start, run, kill, delete internal/ bundle/ # parse config.json ns/ # namespace setup mount/ # rootfs mount, masked paths cgroup/ # cgroup v2 limits seccomp/ # filter compilation cap/ # capability dropping runtime/ # the orchestrator examples/ bundle-alpine/ config.json rootfs/ # umoci-extracted Alpine - Subcommands:
- `minidocker run
- create + start in one step (foreground). minidocker create <id>/start <id>/ `delete- split lifecycle. - `minidocker state
- print state.
21.3 Lab-"Parse and Run"¶
- Implement
config.jsonparsing (theruntime-specrepo has a Go reference type definition). - Implement a no-isolation mode: just
chdir(rootfs),chroot(rootfs),execve. Verify it runs. - Add command-line plumbing for the lifecycle subcommands.
21.4 Hardening Drill¶
- Validate
config.jsonagainst the spec's JSON schema. Reject malformed bundles before any syscall.
21.5 Production Readiness Slice¶
- Add unit tests with a synthetic bundle. CI runs them on every commit.
Week 22 - Namespaces and Process Isolation¶
22.1 Conceptual Core¶
- The runtime needs to
clone(orfork+unshare) into the configured namespaces, set up UID/GID maps for user namespaces, configure UTS hostname, andpivot_rootinto the rootfs. - The classic two-process pattern: parent forks child with
CLONE_NEWPID | CLONE_NEWNS | ...; parent writes UID/GID maps for the child; child waits via pipe for parent to finish setup; child performs final setup (mount /proc, pivot_root); child execs.
22.2 Mechanical Detail¶
- In Go,
golang.org/x/sys/unix.Clonedoes not exist directly; usesyscall.SysProcAttr{Cloneflags: ...}onexec.Cmd, or use the lower-levelsyscall.Syscall(SYS_CLONE, ...). - The
runc"init" pattern: a self-re-exec into the binary with a sentinel argument signaling "I'm the container init." The first invocation does the setup; the re-exec performspivot_rootand finalexecve. Readrunc/libcontainer/standard_init_linux.go. - UID/GID mapping: write to
/proc/<child-pid>/uid_mapandgid_map. For non-root parents, also writesetgroups denyfirst. pivot_rootrequires amount(MS_PRIVATE)of the parent mount before the call (to avoid leaking mounts back to the host).
22.3 Lab-"Namespaces Working"¶
- Implement the parent/child fork-with-clone-flags. Verify
lsns -p <pid>shows new namespaces. - Implement
pivot_rootinto the rootfs. Verify/inside the container is the bundle'srootfs/. - Implement
/procmount inside the new PID namespace. Verifypsshows only the container's processes. - Implement UID/GID mapping for user-namespaced runs.
22.4 Hardening Drill¶
- Mask
/proc/kcore,/proc/keys,/proc/timer_list,/proc/sched_debug,/proc/scsi,/sys/firmware. Make/proc/asound,/proc/bus,/proc/fs,/proc/irq,/proc/sys,/proc/sysrq-triggerread-only. (Same as runtime-spec'smaskedPathsandreadonlyPaths.)
22.5 Production Readiness Slice¶
- Run
runc's integration tests against your runtime if feasible (they're spec-compliance tests). At minimum, run a representative subset of the OCI runtime test suite.
Week 23 - Cgroups v2, Capabilities, Seccomp, OverlayFS¶
23.1 Conceptual Core¶
- The remaining isolation layers: cgroups for resource limits, capabilities for privilege restriction, seccomp for syscall filtering, OverlayFS for the rootfs (if not already prepared by
umoci). - Each layer applies at a specific lifecycle moment. Get the order wrong and the container starts but isolation is incomplete.
23.2 Mechanical Detail¶
- Cgroups v2:
- Create
/sys/fs/cgroup/<container-id>/. - Write
+memory +cpu +pidstocgroup.subtree_controlof the parent. - Write
<pid>to the child cgroup'scgroup.procsafter fork, before exec. - Set limits:
memory.max,memory.high,cpu.max(<quota> <period>),pids.max. - Capabilities: drop via
cap_set_proc(libcap) orprctl(PR_CAPBSET_DROP)for bounding set +capset(2)for effective. Apply after setup syscalls that need them, beforeexecve. - Seccomp: compile the OCI seccomp profile to a BPF program (the
libseccomplibrary does this). Apply withprctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...). Requiresno_new_privs(prctl(PR_SET_NO_NEW_PRIVS)). - OverlayFS: if the bundle has separate lower/upper/work dirs, mount overlay. Otherwise rootfs is a single dir already-just
bind mountit.
23.3 Lab-"All The Layers"¶
- Implement cgroup v2 setup. Verify
memory.max=64Mactually limits the container. - Implement capability dropping. Verify
getcap/capsh --printinside the container. - Implement seccomp filter loading. Verify a denied syscall fails.
- (Optional) Implement OverlayFS rootfs construction from a multi-layer image.
23.4 Hardening Drill¶
- Default seccomp profile from `containers/common/pkg/seccomp - port to your runtime as the default.
23.5 Production Readiness Slice¶
- Add chaos tests: deliberately malformed configs, runaway memory in the container, seccomp-blocked syscalls. Verify the runtime's error paths are clean.
Week 24 - Polish, Defense, Distribution¶
24.1 Conceptual Core¶
The final week is polish. Integration tests, documentation, performance profiling, and a publishable release.
24.2 Mechanical Detail-Polish Checklist¶
- All OCI lifecycle commands implemented (
create,start,state,kill,delete). - State persistence:
/run/minidocker/<id>/state.jsonsostateworks across the supervisor's restart. - Hooks:
prestart,poststart,poststopfrom the OCI spec (at least skeletal support). - Console / TTY support if the spec sets
terminal: true. - Signal forwarding from supervisor to PID 1 inside.
- Cleanup on error: cgroups removed, mounts unmounted, namespaces released.
24.3 Lab-"Defend the Project"¶
Schedule a 45-minute mock review:
- Live demo: build, run a container, exec into it, observe isolation.
- Walk through the lifecycle code with the OCI spec open beside it.
- Demo a hardened run (cgroups + caps + seccomp + LSM) and verify isolation.
- Compare with runc/crun: what's missing? What's different? Why is your design simpler?
24.4 Hardening Drill¶
- Run your runtime against
runc's integration test suite (where applicable). Document which subset passes; explain the gaps.
24.5 Production Readiness Slice¶
- Tag
v0.1.0. Generate a release artifact (goreleaserfor Go;cargo distfor Rust). Sign with cosign. Publish.
Month 6 Deliverable¶
The mini-Docker, plus the aggregated container-mastery/ repo containing every prior month's deliverable.
Appendix A-Container Hardening Reference¶
Cumulative hardening checklist. By week 24 the reader's container-baseline/ template should encode every section.
A.1 Image Hardening¶
- Use distroless or scratch base images.
- Pin base by digest, not tag.
- Multi-stage build; runtime image has no compiler, no shell (unless required).
- Run as non-root user (UID >0); set explicit
USER. - No secrets baked in the image; verify with
trivy image --scanners secret. - All packages installed have a justification.
- OCI annotations:
source,revision,created,licenses. - Reproducible: identical commit → identical digest.
- SBOM attached as attestation.
- Signed with cosign.
- CI gate: vulnerabilities triaged or VEX'd.
A.2 Runtime Hardening (per docker run / podman run / Kubernetes pod)¶
- - -read-only` rootfs.
-
tmpfsfor/tmp,/run. - - -cap-drop=ALL`, add only what's needed.
- - -security-opt=no-new-privileges`.
- Custom seccomp profile (not the default if you can do better).
- LSM profile (SELinux:
container_t; AppArmor:runtime/defaultor custom). - Resource limits: - -memory
, - -cpus, - -pids-limit`. - Network: don't use - -network=host`. Use bridges or CNI.
- No - -privileged
. No - -pid=host. No - -ipc=host`. - No mounted Docker socket from inside the container.
- Volume mounts: only what's needed; consider
:ro,:Z(SELinux relabel). - Run the container in a user namespace ( - -userns=auto` in podman; rootless by default).
A.3 Daemon / Host Hardening¶
- Rootless runtime (
podmanrootless mode). - If using a daemon:
dockerdorcontainerdrunning as a hardened systemd unit (applyLinux/05_security_and_hardening.mddirectives). - LSM enforcing on the host.
- Modern kernel (≥ 5.10 for full rootless features).
- OverlayFS instead of
vfs. - Disk quotas (
xfs_quota) or - -storage-opt overlay.size=` for runaway-image protection.
A.4 Sandbox Tier (when needed)¶
For untrusted code:
- gVisor (runsc): registered as a containerd runtime. Mark untrusted workloads with runtimeClassName: gvisor (Kubernetes) or - -runtime=runsc` (nerdctl).
- Kata Containers: per-container micro-VM. Higher overhead, stronger isolation.
- Firecracker: micro-VM as a runtime. Used by AWS Lambda.
Decision matrix: | Workload | Runtime | |---|---| | Trusted internal services | runc / crun | | Customer-supplied code | runsc (gVisor) | | Regulated / multi-tenant | Kata | | Serverless, very-short-lived | Firecracker / runsc |
A.5 Supply-Chain Hardening¶
- All images signed with cosign.
- All images have attached SBOMs.
- All images have SLSA L3 (or higher) provenance.
- Admission policy verifies signatures at pull time.
- Vulnerability scanning continuous; new criticals page on-call.
- VEX statements for non-exploitable findings.
- Registry retention: signed images retained, unsigned discarded.
A.6 The container-baseline/ Template¶
container-baseline/
Dockerfile.template # multi-stage, distroless final
build.sh # buildah-based reproducible build
cosign-policy.yaml # admission policy (signature + SLSA)
seccomp/
default.json # tighter than runtime/default
apparmor/ # per-service profiles
selinux/ # type-enforcement modules
ci/
build.yml # build + scan + sign + attest
promote.yml # verify + tag + push to prod registry
scripts/
audit-image.sh # trivy + grype + syft, formatted
RUNBOOK.md
THREAT_MODEL.md
Every image you ship after week 24 should be built using this template.
Appendix B-Reference Patterns¶
Reference recipes for the patterns you'll reach for repeatedly.
B.1 Multi-Stage Distroless (Go)¶
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN --mount=type=cache,target=/root/.cache/go-build \
--mount=type=cache,target=/go/pkg/mod \
go mod download
COPY . .
RUN --mount=type=cache,target=/root/.cache/go-build \
CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /out/app ./cmd/app
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]
B.2 Multi-Stage Distroless (Rust)¶
FROM rust:1.78 AS build
WORKDIR /src
COPY . .
RUN --mount=type=cache,target=/usr/local/cargo/registry \
--mount=type=cache,target=/src/target \
cargo build --release && cp target/release/app /out/app
FROM gcr.io/distroless/cc-debian12:nonroot
COPY --from=build /out/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]
For static Rust (musl), use gcr.io/distroless/static.
B.3 Multi-Stage Distroless (Python)¶
FROM python:3.12-slim AS build
WORKDIR /app
COPY requirements.txt .
RUN pip install --target=/install -r requirements.txt
COPY . .
FROM gcr.io/distroless/python3-debian12:nonroot
COPY --from=build /install /pkg
COPY --from=build /app /app
ENV PYTHONPATH=/pkg
USER nonroot:nonroot
ENTRYPOINT ["python", "/app/main.py"]
B.4 Multi-Arch Build with buildah¶
#!/usr/bin/env bash
set -euo pipefail
IMAGE=registry.local/myapp
TAG=v1.0.0
for arch in amd64 arm64; do
buildah build --arch=$arch --manifest $IMAGE:$TAG -t $IMAGE:$TAG-$arch .
done
buildah manifest push --all $IMAGE:$TAG docker://$IMAGE:$TAG
B.5 Reproducible Build¶
SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct)
buildah build \
--timestamp $SOURCE_DATE_EPOCH \
--pull-never \
--layers=false \
-t myapp:$TAG .
RUN apt-get update (use a frozen package mirror).
B.6 Rootless Podman + Systemd¶
podman run --name myapp --rm -d ...
podman generate systemd --new --name myapp > ~/.config/systemd/user/myapp.service
systemctl --user daemon-reload
systemctl --user enable --now myapp.service
loginctl enable-linger
B.7 OCI Hooks¶
{
"hooks": {
"prestart": [
{"path": "/usr/local/bin/network-setup", "args": ["network-setup"], "timeout": 5}
],
"poststop": [
{"path": "/usr/local/bin/cleanup", "args": ["cleanup"]}
]
}
}
B.8 CI/CD Skeleton (GitHub Actions)¶
name: build
on:
push:
branches: [main]
tags: ['v*']
permissions:
id-token: write # required for cosign keyless
contents: read
packages: write
jobs:
build-scan-sign:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- id: build
uses: docker/build-push-action@v6
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
provenance: mode=max
sbom: true
- name: Scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ghcr.io/${{ github.repository }}:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1
- uses: sigstore/cosign-installer@v3
- name: Sign
run: cosign sign --yes ghcr.io/${{ github.repository }}@${{ steps.build.outputs.digest }}
- name: Attest SBOM
run: cosign attest --yes --type spdx --predicate sbom.json ghcr.io/${{ github.repository }}@${{ steps.build.outputs.digest }}
B.9 Debugging Recipes¶
- No shell in the image?
kubectl debug -it <pod> --image=busybox --target=<container>adds an ephemeral debug container in the same namespace. - Inspect a layer?
skopeo inspect docker://image:tagfor manifest,skopeo copy docker://image:tag oci:./local:tagto dump everything. - What's running inside? From the host:
nsenter -t <pid> -a /bin/shenters all namespaces of the target. - Why is it slow?
nsenter -t <pid> -a perf topprofiles inside the container. - What syscalls is it making?
strace -f -p <host-pid>works through namespaces.
Appendix C-Contributing to the Container Ecosystem¶
The container ecosystem is in a few major repos, all on GitHub, all with active maintainer communities.
C.1 The Project Map¶
| Project | Language | Scope | Difficulty | Notes |
|---|---|---|---|---|
opencontainers/runc |
Go | Reference OCI runtime | Hard | Slower review; correctness-first |
opencontainers/runtime-spec |
Markdown | Spec | Medium | Doc PRs welcome; behavior changes need TOB consensus |
opencontainers/image-spec |
Markdown | Spec | Medium | Same as above |
containers/podman |
Go | Daemonless container manager | Medium | Active, friendly |
containers/buildah |
Go | Image build library/CLI | Medium | Active, friendly |
containers/skopeo |
Go | Image transport tool | Easy–Medium | Smaller surface, good first patches |
containerd/containerd |
Go | Container daemon | Hard | Large; subsystem-specific reviewers |
cri-o/cri-o |
Go | Kubernetes CRI runtime | Medium | Friendly to newcomers |
containers/crun |
C | Fast OCI runtime | Hard | Single maintainer historically; high bar |
containers/youki |
Rust | Rust OCI runtime | Medium | Smaller, more approachable than runc |
sigstore/cosign |
Go | Image signing | Medium | Active, growing |
anchore/syft & anchore/grype |
Go | SBOM + vuln scan | Medium | Friendly |
aquasecurity/trivy |
Go | Vuln + config scan | Medium | High velocity |
google/go-containerregistry |
Go | OCI registry library | Easy–Medium | Underused gem; good starter |
C.2 First-Issue On-Ramps¶
Easy¶
skopeo: documentation fixes, new transport options, bug reports with reproductions.go-containerregistry: examples, error-message improvements, new registry-mirror tests.
Medium¶
podman: bug fixes for edge cases (look at thekind/bug+good-first-issuelabels).buildah: feature parity gaps withdocker buildBuildKit features.cri-o: small CRI-protocol corrections.syft: new package format detectors (e.g., a niche language ecosystem).cosign: documentation, integration examples, smaller bug fixes.
Hard¶
runc: anything inlibcontainer/is high stakes-security-critical.containerd: snapshotters, the CRI plugin, runtime shims.youki: larger features as the project still has gaps vs runc.crun: C, low-level, single-maintainer ergonomics.
C.3 The Workflow (typical GitHub project)¶
- File or claim an issue. Read the contributing guide first; some projects require pre-discussion.
- Fork, branch, code.
- Sign-off your commits. All container projects require DCO (
git commit -s). - Run the test suite locally.
make testor per-project equivalent. - Open a PR. Reference the issue. Describe the change and the testing.
- Address review. Most projects use squash-merge; rebase your branch on main as needed.
- Merge. Maintainer approves; CI green; merged.
Cycle time varies: skopeo / cosign / syft are fast (days). runc / containerd are slow (weeks).
C.4 The OCI Process (spec changes)¶
Spec changes are governed by the OCI Technical Oversight Board. The process: 1. Open an issue describing the proposed change. 2. Build consensus on the issue thread (this is the slowest step). 3. Open a PR with the spec change. 4. The TOB reviews; super-maintainers approve. 5. Merge; release the spec on the next cadence.
This is not a fast path. Reserve it for genuine spec-shaped problems; everything else fits in a tool repo.
C.5 Calibration¶
A reasonable goal for a curriculum graduate:
- By end of week 23: a PR open against
skopeo,syft,cosign, or another approachable repo (a doc fix or small bug fix is sufficient). - By end of capstone: that PR merged.
- 6 months post-curriculum: a substantive contribution-a new transport in skopeo, a new package detector in syft, a new feature in podman.
The container ecosystem is genuinely welcoming to newcomers. The path from "user" to "contributor" is shorter than in most kernel-adjacent projects.
Capstone Projects-Three Tracks, One Choice¶
Pick one. The work performed here is what you describe in interviews.
Track 1-Mini-Docker (the curriculum's default)¶
Outcome: a from-scratch container runner in Go or Rust, demonstrating manual orchestration of namespaces, cgroups v2, OverlayFS, capabilities, seccomp, and a working subset of OCI runtime spec.
Functional spec¶
- Read an OCI runtime bundle (
config.json+rootfs/). - Lifecycle:
create,start,state,kill,delete, plusrun(create+start). - Apply: PID/NET/MNT/UTS/IPC/USER/CGROUP namespaces, capability dropping, seccomp filter, cgroups-v2 (memory/cpu/pids),
pivot_root, masked/readonly paths. - Spec compliance: passes a meaningful subset of the OCI runtime spec validator.
Non-functional spec¶
- Implemented in <2,500 lines of code (excluding tests).
- Memory footprint <20 MB resident.
- Container-start latency <100 ms on a baseline workload (matches
runcwithin 2×).
Architecture sketch¶
- Top-level CLI parses the lifecycle subcommand.
- Runtime orchestrator: state machine over the bundle's lifecycle.
- Init re-exec pattern (like
runc's "init"): the binary self-re-execs as the container's PID 1 supervisor; performs final mount + capability + seccomp setup; execs the user command. - State persistence in
/run/<runtime-name>/<id>/state.json.
Test rigor¶
- Unit tests for: bundle parsing, namespace setup, cgroup writes, seccomp compilation.
- Integration tests: run real bundles (Alpine, BusyBox, a small Go binary) and assert behavior.
- Spec compliance test against (a subset of)
runtime-tools/validation. - Chaos: malformed configs, missing rootfs, OOM, exhausted PIDs.
Hardening pass¶
- Default seccomp profile equivalent to
containers/common/pkg/seccomp. - Default capability set:
CAP_NET_BIND_SERVICEonly. - Read-only rootfs unless explicitly opted in.
- Rootless support (user namespace path tested).
Acceptance criteria¶
- Public repo, documented architecture.
- A README walkthrough: from
runc specto runningnginxunder your runtime. - Integration tests in CI, passing.
- A short paper (3–5 pages) comparing your runtime to
runc/crun/youki: what's similar, what's different, what you skipped and why.
Skills exercised¶
- All months-but heaviest on Months 1 (OCI specs), 2 (filesystems), 3 (runtimes), 4 (security primitives).
Track 2-Image Scanning & Signing Service¶
Outcome: an HTTP service that ingests OCI image references, runs Syft + Grype + Trivy, attaches signed SBOMs and VEX statements, and exposes a policy-gated promotion API.
Functional spec¶
- `POST /scan {image} - scan, return findings.
- `POST /promote {image, target} - verify the image meets policy (signature, SBOM, vuln thresholds, SLSA provenance) before copying to a higher-trust registry.
- `GET /audit/{image} - full triage report, including VEX state per finding.
- Policy as YAML; reload without restart.
- Plugin model: scanner backends and signature verifiers loaded at startup.
Non-functional spec¶
- 50 concurrent scans without degradation.
- Sub-second policy evaluation given pre-fetched attestations.
- Admission API compatible with Kubernetes ImagePolicyWebhook.
Architecture sketch¶
- Workers consuming a scan queue.
skopeofor image fetching;syftandtrivyshelled out (or embedded as Go libs).cosignGo SDK for signature verification.- Postgres for findings storage; Prometheus for metrics; OTel for traces.
Test rigor¶
- Unit tests for policy evaluation.
- Integration tests against a local registry with known-good and known-bad images.
- Property tests: policy decisions are deterministic given inputs.
Hardening pass¶
- Service runs rootless in its own container.
- mTLS between worker and registries.
- Findings DB encrypted at rest.
Acceptance criteria¶
- Public repo, README with end-to-end demo.
- Demonstrate full flow: image with critical CVE → scan → signed VEX → policy decision → promotion gated correctly.
Skills exercised¶
- Months 5 (supply chain) heavily; Months 1, 4 supporting.
Track 3-Custom Runtime Fork¶
Outcome: a fork of runc (or youki) adding one substantial feature: gVisor-style sandbox, a custom seccomp generator, or eBPF-based per-container observability.
Suggested scopes¶
runc-trace: a runc fork that, beforeexecve, attaches an eBPF program tracing syscalls and emitting per-container telemetry to a userspace consumer. Useful for forensic environments.runc-autoseccomp: generates a tight seccomp profile during a "learning" run, then enforces it. Removes the manual seccomp-authoring burden.runc-cap: stricter capabilities defaults; introspects the rootfs and disables capability sets the binary doesn't appear to need.
Acceptance¶
- Forked from upstream at a tagged commit; documented merge plan to maintain rebase against upstream.
- The new feature is opt-in via
config.jsonannotations or a CLI flag. - Test coverage equivalent to upstream's for the touched code.
- A short blog post explaining the feature, the design, and the upstream-contribution plan.
Skills exercised¶
- All months. Track 3 is for the candidate who wants to contribute upstream eventually.
Cross-Track Requirements¶
container-baseline/template (Appendix A) integrated.- ADRs (≥3).
THREAT_MODEL.md.- Defense readiness: 45-minute walkthrough.
Worked example - Week 14: Linux capabilities in a real Dockerfile¶
Companion to Container Internals → Month 04 → Week 14: Capabilities. The week explains the Linux capability model and the difference between root inside the container and root with everything. This page walks one Dockerfile through dropping capabilities one at a time so you can see which knobs do what.
Start from a normal container¶
# v0 - the default
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
curl iproute2 libcap2-bin && rm -rf /var/lib/apt/lists/*
CMD ["sleep", "infinity"]
Build and run:
$ docker build -t cap-demo .
$ docker run --rm -d --name c cap-demo
$ docker exec c capsh --print
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,
cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=ep
14 capabilities granted by default - the Docker default capability set. This is not "full root." A real root user on the host has ~40 capabilities (run capsh --print outside the container). Docker already drops ~26 of them by default, including the dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE).
But 14 is still too many. Let's see what each does and drop the ones we don't need.
What's actually in use¶
The CMD is sleep infinity. It needs no capabilities at all. Prove it:
But most real containers do something. Suppose we add a tiny HTTP server:
# v1 - a real workload
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 libcap2-bin && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY server.py .
CMD ["python3", "-m", "http.server", "8080"]
Works. Python's HTTP server doesn't need any capabilities because port 8080 is unprivileged (≥1024).
Try port 80:
$ docker run --rm -p 80:80 --cap-drop=ALL cap-demo python3 -m http.server 80
PermissionError: [Errno 13] Permission denied
Binding to ports below 1024 requires CAP_NET_BIND_SERVICE. Add it back, drop everything else:
$ docker run --rm -p 80:80 \
--cap-drop=ALL --cap-add=NET_BIND_SERVICE \
cap-demo python3 -m http.server 80
Works. We're now running with exactly one capability, the minimum to do the job.
Map the capabilities to attacks¶
This is the part most tutorials skip. Here's why each Docker-default capability is actually dangerous if you don't need it:
| Capability | What it allows | If an attacker gets RCE in the container |
|---|---|---|
CAP_CHOWN |
Change file ownership | Take over files in mounted volumes the container shouldn't own. |
CAP_DAC_OVERRIDE |
Bypass file read/write/execute permission checks | Read any file in mounted volumes, regardless of permissions. |
CAP_FOWNER |
Bypass permission checks on operations that normally require the file owner | Modify metadata on mounted files. |
CAP_FSETID |
Set setuid/setgid bits | Create privilege-escalation backdoors in writable mounted paths. |
CAP_KILL |
Send signals to any process | Kill other workloads sharing the same PID namespace. |
CAP_SETUID/SETGID |
Change process UID/GID | Pivot to other user contexts within the container. |
CAP_SETPCAP |
Change capability sets | Re-add dropped capabilities (when combined with namespaces). |
CAP_NET_BIND_SERVICE |
Bind to ports <1024 | Squat on a privileged port to MITM. |
CAP_NET_RAW |
Open raw sockets, send crafted packets | ARP spoofing, packet sniffing if not isolated by network namespace. |
CAP_SYS_CHROOT |
Use chroot() |
Limited escape vectors. |
CAP_MKNOD |
Create device files | Create /dev/sda and read raw disk if filesystem isn't sealed. |
CAP_AUDIT_WRITE |
Write to kernel audit log | Spam audit logs to hide other activity. |
CAP_SETFCAP |
Set file capabilities | Persist privileges on dropped binaries. |
CAP_NET_RAW and CAP_MKNOD are the ones most production guides specifically call out to drop.
The drop-and-add pattern¶
Production-grade Dockerfile + run command:
FROM debian:bookworm-slim
RUN useradd -r -u 10001 app
USER 10001
COPY --chown=10001:10001 server.py /app/server.py
CMD ["python3", "/app/server.py"]
$ docker run --rm \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--read-only \
--tmpfs /tmp \
--security-opt=no-new-privileges \
-p 80:80 \
your-image
That's the minimum-privilege starting point. Walk the flags:
--cap-drop=ALL --cap-add=NET_BIND_SERVICE- exactly one capability.--read-only- root filesystem is read-only; defeats most persistence.--tmpfs /tmp- give the app a writable scratch space (it needs somewhere to write).--security-opt=no-new-privileges- set theNoNewPrivsbit; even setuid binaries can't gain capabilities now.USER 10001- non-root user inside the container. Capabilities are bounded by both the kernel ruleset and the UID; this defense-in-depth matters.
The trap¶
Most production teams set --cap-drop=ALL for "their" containers and then leave third-party sidecar containers (logging agents, service mesh proxies) with default capabilities. The third-party containers are just as exploitable and often have more of an attack surface (network exposure, mounted secrets). Audit them too.
The other trap: capability dropping is a kernel-level mechanism. It does not defend against container-escape vulnerabilities (e.g. runc CVEs). You still want a kernel-level barrier - seccomp profile (next week), AppArmor/SELinux, gVisor or kata for high-isolation needs.
Exercise¶
- Take any container image you currently run. Inspect what capabilities it requests (
docker inspect | jq '.[].HostConfig.CapAdd') and what it actually uses (grep Cap /proc/<pid>/statusfrom inside). - Write the smallest possible
--cap-drop/--cap-addset that lets it still work. Document what each kept capability is for. - Repeat for a Kubernetes Pod. The same fields live in
securityContext.capabilitiesper-container.
Related reading¶
- The main Week 14 chapter covers the underlying kernel model.
- The Week 15 chapter on seccomp is the syscall-level companion to capability dropping.
- Kubernetes Mastery - security context covers applying these in Pod specs.
- Glossary: Capability, Seccomp, NoNewPrivs in the main glossary.