Capstone Catalog¶

Every path ends with a capstone month. The capstone is one substantial project, designed and defended end-to-end. Pick one whose constraints match what you want to learn -- distributed-systems engineering, ML infrastructure, kernel work, or any of the other paths' terminal tracks.

8 paths with capstones.

AI Systems¶

Full capstone-projects page →

Pick one. The work performed here is what you describe in interviews and link from a portfolio.

Track 1-Inference Engine: A Mini-vLLM¶

Outcome: an LLM inference server you wrote, with paged KV-cache, continuous batching, and at least one of (FP8 weights, AWQ INT4, speculative decoding). Benchmarked within 2× of production vLLM on a 7B model.

Functional spec¶

HTTP API: POST /v1/completions and POST /v1/chat/completions (subset of OpenAI's API).
Server-sent-events streaming output.
Continuous batching with paged KV-cache.
One quantization scheme (your choice: AWQ W4A16 with Marlin kernel, or FP8 weights via TransformerEngine).
Optionally: prefix caching, speculative decoding.
Health, readiness, metrics endpoints.

Non-functional spec¶

Throughput within 2× of production vLLM for a 7B model on the same hardware. (vLLM is the bar; matching it is implausible in 24 weeks. Within 2× is achievable and impressive.)
TTFT p99 < 1s for 1K-token prompts under steady-state load.
TPOT p99 < 30 ms after first token.
Memory: stable under 8-hour load; no leaks.

Architecture sketch¶

HTTP server (FastAPI/Axum)
        │
        ▼
Request queue ──► Scheduler (Python or Rust)
                       │
                       ▼
                  Block manager (page table, free list)
                       │
                       ▼
              Model runner (Python/Triton/CUDA)
                       │
                       ▼
                  Token streamer

Test rigor¶

Unit tests for the block manager (allocate/free/leak detection).
Integration: warmup load, sustained load, mixed-prompt-length load.
Correctness: outputs match a reference HF implementation for greedy decoding.
Stress: kill-9 the process under load; restart; verify recovery.

Hardening pass¶

pprof - style metrics;nsys` profile of one full request lifecycle committed to the repo.
ncu profile of the attention kernel.
Cost/quality matrix.

Acceptance criteria¶

Public repo with build + run + benchmark scripts.
A README with: architecture diagram, benchmark table, profiling artifacts, "what's next" section.
A blog post explaining one non-obvious decision (e.g., your block size choice, your eviction policy).

Skills exercised¶

All months. Heaviest on Months 2 (kernels), 3 (framework integration), 5 (serving).

Track 2-Training Systems: FSDP From Scratch¶

Outcome: a working sharded data-parallel training implementation, written from scratch on top of torch.distributed primitives. Trains a small transformer on 4–8 GPUs across 1–2 nodes with documented scaling efficiency.

Functional spec¶

A MyFSDP wrapper that:
Shards parameters across ranks.
Allgathers parameters before forward; frees after.
Reduces gradients via reduce-scatter.
Supports activation checkpointing.
Mixed-precision (BF16 compute, FP32 master).
Gradient accumulation.
Resumable checkpoints.
A reference training script that uses it to train a ~500M parameter transformer on a tokenized corpus.

Non-functional spec¶

Scaling efficiency ≥85% on 8 GPUs (single node) vs single-GPU baseline.
Scaling efficiency ≥75% on 16 GPUs across 2 nodes.
Resumed run produces identical loss (within 1e-4) compared to a continuous run.
Throughput within 30% of PyTorch's native FSDP-2 on the same workload.

Test rigor¶

Numerical correctness: 4-rank MyFSDP matches single-rank reference for one full training step (allclose at 1e-3 in BF16).
Memory measurement: peak HBM matches model_size / num_ranks + activation_overhead.
Failure injection: kill one rank mid-epoch; observe NCCL timeout; document the recovery path.

Hardening pass¶

NCCL tuning (NCCL_IB_HCA, NCCL_SOCKET_IFNAME, NCCL_TOPO_FILE) documented.
nsys profile showing allgather/reduce-scatter overlap with compute.
Cost calculation (GPU-hours × $/hr × scaling-efficiency).

Acceptance criteria¶

Public repo with infra-as-code (Terraform/Ansible) for bringing up a 2-node cluster, plus the FSDP code, plus the training script.
A SCALING_REPORT.md with the efficiency numbers, the optimization journey (each tuning step's effect), and one comparison against native FSDP-2.

Skills exercised¶

All months. Heaviest on Months 3 (framework), 4 (distributed).

Track 3-GPU Kernel Track: A Competitive Fused Attention¶

Outcome: a fused attention kernel in Triton (and optionally CUTLASS), competitive with FlashAttention-2 for at least one common shape regime, complete with profiling, autograd, and a tested PyTorch integration.

Functional spec¶

A Triton kernel implementing causal flash-attention (forward + backward).
Configurable for: BF16 / FP16, head dim 64/128, GQA support.
Drop-in replacement for F.scaled_dot_product_attention for the supported shape range.
Numerically equivalent to the reference (allclose at 1e-3 BF16).

Non-functional spec¶

Within 1.5× of FlashAttention-2 forward+backward time at one chosen shape (e.g., B=4, H=32, S=4096, D=128).
Validated on at least two GPU classes (e.g., A100 + H100 if both accessible; A100 + RTX 4090 acceptable).
Compiles via torch.compile without graph breaks when used in a small transformer.

Test rigor¶

Correctness: random input testing against reference attention; gradient testing against torch.autograd.gradcheck (FP32 reference).
Perf: ncu reports for forward and backward; attached to the repo.
Edge cases: short sequences, variable-length (padding-aware), large batch.

Hardening pass¶

Autotune configs documented with rationale.
Performance-regression CI (lock benchmark numbers; alert on >5% regression).

Acceptance criteria¶

Public repo with the kernel, tests, benchmarks, and a clear README.
One submitted PR (even if not merged) to a real project: vLLM (as a backend), Liger Kernel, or pytorch/ao.
A blog post analyzing one design choice in the kernel-block sizes, software pipelining stages, register-pressure tradeoffs.

Skills exercised¶

All months, but most concentrated on Months 2 (GPU programming), 3 (framework integration), and the inference math from Month 5.

Cross-Track Requirements¶

Regardless of track:

ai-systems-baseline/ template (Appendix A) integrated.
ADRs: ≥3 for non-obvious decisions.
Threat model: at minimum, one page covering input attacks, supply-chain risk, and infrastructure failure modes.
Cost model: per the workload, what's the steady-state $/hour and $/output-unit?
Defense readiness: a 60-minute walkthrough you can deliver to a peer or hiring manager.

The track choice signals career direction: - Track 1 (Inference) → inference-engineer roles at frontier labs, latency-sensitive serving teams, model-serving startups. - Track 2 (Training) → training-infra roles at frontier labs, large-scale-training teams, framework engineering teams. - Track 3 (Kernels) → GPU performance engineering, compiler/runtime teams (NVIDIA, OpenAI, Meta), specialized inference accelerator teams.

Pick based on where you want the next interview loop, not on what looks easiest.

Container Internals¶

Full capstone-projects page →

Pick one. The work performed here is what you describe in interviews.

Track 1-Mini-Docker (the curriculum's default)¶

Outcome: a from-scratch container runner in Go or Rust, demonstrating manual orchestration of namespaces, cgroups v2, OverlayFS, capabilities, seccomp, and a working subset of OCI runtime spec.

Functional spec¶

Read an OCI runtime bundle (config.json + rootfs/).
Lifecycle: create, start, state, kill, delete, plus run (create+start).
Apply: PID/NET/MNT/UTS/IPC/USER/CGROUP namespaces, capability dropping, seccomp filter, cgroups-v2 (memory/cpu/pids), pivot_root, masked/readonly paths.
Spec compliance: passes a meaningful subset of the OCI runtime spec validator.

Non-functional spec¶

Implemented in <2,500 lines of code (excluding tests).
Memory footprint <20 MB resident.
Container-start latency <100 ms on a baseline workload (matches runc within 2×).

Architecture sketch¶

Top-level CLI parses the lifecycle subcommand.
Runtime orchestrator: state machine over the bundle's lifecycle.
Init re-exec pattern (like runc's "init"): the binary self-re-execs as the container's PID 1 supervisor; performs final mount + capability + seccomp setup; execs the user command.
State persistence in /run/<runtime-name>/<id>/state.json.

Test rigor¶

Unit tests for: bundle parsing, namespace setup, cgroup writes, seccomp compilation.
Integration tests: run real bundles (Alpine, BusyBox, a small Go binary) and assert behavior.
Spec compliance test against (a subset of) runtime-tools/validation.
Chaos: malformed configs, missing rootfs, OOM, exhausted PIDs.

Hardening pass¶

Default seccomp profile equivalent to containers/common/pkg/seccomp.
Default capability set: CAP_NET_BIND_SERVICE only.
Read-only rootfs unless explicitly opted in.
Rootless support (user namespace path tested).

Acceptance criteria¶

Public repo, documented architecture.
A README walkthrough: from runc spec to running nginx under your runtime.
Integration tests in CI, passing.
A short paper (3–5 pages) comparing your runtime to runc/crun/youki: what's similar, what's different, what you skipped and why.

Skills exercised¶

All months-but heaviest on Months 1 (OCI specs), 2 (filesystems), 3 (runtimes), 4 (security primitives).

Track 2-Image Scanning & Signing Service¶

Outcome: an HTTP service that ingests OCI image references, runs Syft + Grype + Trivy, attaches signed SBOMs and VEX statements, and exposes a policy-gated promotion API.

Functional spec¶

`POST /scan {image} - scan, return findings.
`POST /promote {image, target} - verify the image meets policy (signature, SBOM, vuln thresholds, SLSA provenance) before copying to a higher-trust registry.
`GET /audit/{image} - full triage report, including VEX state per finding.
Policy as YAML; reload without restart.
Plugin model: scanner backends and signature verifiers loaded at startup.

Non-functional spec¶

50 concurrent scans without degradation.
Sub-second policy evaluation given pre-fetched attestations.
Admission API compatible with Kubernetes ImagePolicyWebhook.

Architecture sketch¶

Workers consuming a scan queue.
skopeo for image fetching; syft and trivy shelled out (or embedded as Go libs).
cosign Go SDK for signature verification.
Postgres for findings storage; Prometheus for metrics; OTel for traces.

Test rigor¶

Unit tests for policy evaluation.
Integration tests against a local registry with known-good and known-bad images.
Property tests: policy decisions are deterministic given inputs.

Hardening pass¶

Service runs rootless in its own container.
mTLS between worker and registries.
Findings DB encrypted at rest.

Acceptance criteria¶

Public repo, README with end-to-end demo.
Demonstrate full flow: image with critical CVE → scan → signed VEX → policy decision → promotion gated correctly.

Skills exercised¶

Months 5 (supply chain) heavily; Months 1, 4 supporting.

Track 3-Custom Runtime Fork¶

Outcome: a fork of runc (or youki) adding one substantial feature: gVisor-style sandbox, a custom seccomp generator, or eBPF-based per-container observability.

Suggested scopes¶

runc-trace: a runc fork that, before execve, attaches an eBPF program tracing syscalls and emitting per-container telemetry to a userspace consumer. Useful for forensic environments.
runc-autoseccomp: generates a tight seccomp profile during a "learning" run, then enforces it. Removes the manual seccomp-authoring burden.
runc-cap: stricter capabilities defaults; introspects the rootfs and disables capability sets the binary doesn't appear to need.

Acceptance¶

Forked from upstream at a tagged commit; documented merge plan to maintain rebase against upstream.
The new feature is opt-in via config.json annotations or a CLI flag.
Test coverage equivalent to upstream's for the touched code.
A short blog post explaining the feature, the design, and the upstream-contribution plan.

Skills exercised¶

All months. Track 3 is for the candidate who wants to contribute upstream eventually.

Cross-Track Requirements¶

container-baseline/ template (Appendix A) integrated.
ADRs (≥3).
THREAT_MODEL.md.
Defense readiness: 45-minute walkthrough.

Go Mastery¶

Full capstone-projects page →

The Month 6 capstone is the deliverable that converts this curriculum from study into evidence. Pick one track. The work performed here is what you describe in interviews and link from a portfolio.

Track 1-Distributed Storage: A Raft-Replicated KV Store¶

Outcome: a 3+ node Raft-replicated key-value store with linearizable reads, snapshots, online membership changes, and a `jepsen - style fault-injection harness verifying linearizability.

Functional spec¶

gRPC API: Get(key), Put(key, value), Delete(key), Watch(prefix) stream.
Cluster API: AddNode, RemoveNode, Leadership.
Linearizable reads via read-index.
Snapshots every N entries (default 10K) with InstallSnapshot to recovering followers.
Persistent WAL via Pebble or BoltDB.
TLS between nodes; mutual auth via x509.

Non-functional spec¶

Sustained 50K writes/sec on commodity hardware (3-node, NVMe).
Sub-10 ms write latency p99 under 50% utilization.
Recovery time (leader change → fully available) under 1 s for a 3-node cluster.
Survives a single-node crash without data loss; survives a network partition with a clear majority.

Architecture sketch¶

One goroutine per node consumes from etcd-io/raft Ready channel.
Apply loop: stream committed entries → state machine → respond to clients.
Network: gRPC with a long-lived bidi stream per peer pair.
State machine: a sharded map[string][]byte with versioning for Watch.

Test rigor¶

Unit: state-machine transitions, log-truncation invariants.
Integration: 3-node local cluster via t.Run, exercise membership.
Fault injection: a "nemesis" goroutine that randomly partitions, pauses, crashes nodes; client offer history fed to a linearizability checker (Knossos in Clojure-via-process, or a lightweight Go port).
Race-clean under sustained load.

Hardening pass¶

goreleaser, cosign signing, SBOM via cyclonedx-gomod.
GOMEMLIMIT from cgroup; automaxprocs.
PGO with a representative workload.
pprof + runtime/trace capture endpoints.
OTel traces across the Raft RPC layer (custom interceptor).
A RUNBOOK.md covering: leader-stuck triage, log-corruption recovery, snapshot-restore procedure.

Acceptance criteria¶

Public repo with all of the above.
A README that includes: a topology diagram, a load-test latency CDF, a Jepsen-style report.
Defensible answer to: "What happens during a network partition where a majority can elect a new leader but the old leader is still up?"

Skills exercised¶

Months 3 (concurrency), 5 (gRPC, observability), 6.21–6.22 (Raft, distributed storage).

Track 2-Service Mesh: A gRPC Microservices Mesh¶

Outcome: a multi-service mesh demonstrating a custom service registry, health checking, deadline propagation, retries, outlier ejection, and end-to-end OTel tracing across at least four interconnected services.

Functional spec¶

A Registry service: gRPC interface for Register, Deregister, Watch, LookupHealthy. Backed by an in-memory store with optional Raft replication (composes with Track 1).
A Sidecar library that:
Resolves service names via the registry (custom gRPC resolver.Builder).
Implements client-side load balancing with round-robin + outlier ejection.
Propagates OTel context, deadlines, and a request_id.
Adds retry policy via service config.
Four demo services (e.g., user, order, inventory, payment) with a fan-out call graph that exercises retries, timeouts, and partial failures.
A mesh-cli for service inspection and chaos injection.

Non-functional spec¶

Sub-millisecond p99 sidecar overhead per RPC.
Outlier ejection within 10 s of an endpoint going bad.
Deadline propagation: an inbound 1 s deadline must result in downstream calls seeing strictly less than 1 s remaining.

Architecture sketch¶

Each service runs the sidecar library in-process (no separate sidecar binary-keep it simple, defensible).
Registry uses etcd-io/raft if Track 1 also chosen; otherwise a single-instance with TLS.
Service discovery uses long-poll Watch via gRPC server-streaming.

Test rigor¶

Unit: resolver, balancer, interceptor stacks.
Integration: spin all four services in-process, exercise the call graph with testcontainers for the registry's Postgres if used.
Chaos: a chaos-injector middleware that drops/delays/errors random %.
Latency tests with ghz at multiple QPS levels.

Hardening pass¶

pprof everywhere; OTel everywhere.
goleak per-service.
A reproducible Docker Compose stack and a one-command make demo that brings it up with Jaeger and Prometheus.
Alarms wired: Prometheus rules on per-service error rate, p99 latency, registry watch lag.

Acceptance criteria¶

All four services deployable with make demo.
A flame graph demonstrating where sidecar overhead lives.
A trace screenshot showing deadline-propagated failure across the call chain.
Defensible answer to: "What happens if the registry leader is down for 30 seconds?"

Skills exercised¶

Months 3 (concurrency), 5 (gRPC mastery, observability), 6 (capstone defense, performance).

Track 3-Streaming Pipeline: A Kafka-Compatible Ingestion + Stream Processor¶

Outcome: a Kafka-protocol-compatible (subset) broker plus a stream-processing framework, with at-least-once delivery, exactly-once-effective consumer offsets, and replay.

Functional spec¶

Broker: implements a subset of the Kafka wire protocol (Produce, Fetch, Metadata, ListOffsets, OffsetCommit, OffsetFetch). Disk-backed log per partition; segment + index files.
Stream processor: a small framework letting users write func(input Stream[T]) Stream[U] with operators (Map, Filter, Window, Aggregate, Join).
Consumer: offset management, rebalance protocol (subset).
Producer: idempotent producer (within session).
Compatibility: works with franz-go (the leading Kafka Go client) for at least Produce/Fetch.

Non-functional spec¶

200K msgs/sec sustained on a single partition (commodity NVMe).
Sub-50 ms producer ack p99 with acks=all.
Replay from arbitrary offset.
Crash-recoverable: WAL fsync semantics documented.

Architecture sketch¶

One goroutine per partition for the disk-write path.
mmap'd index files; sequential append to log files.
Replication: Raft per partition (composes with Track 1) or a simpler primary-backup with a documented data-loss window.

Test rigor¶

Unit: log segment boundary handling, offset arithmetic, index lookup.
Integration: produce-and-consume tests against franz-go.
Fuzz: protocol parser fuzzed against malformed records.
Crash test: kill -9 during write; restart; verify WAL recovery.

Hardening pass¶

pprof for the hot path (the produce-write loop must be 0 allocs/op per record).
PGO with a sustained-throughput profile.
runtime/trace artifact showing zero scheduler stalls under load.

Acceptance criteria¶

Public repo, a reference-grade README.
A throughput/latency benchmark vs. real Kafka on the same hardware.
A replay demo showing rewinding consumer offset to a specific timestamp.

Skills exercised¶

Months 2 (memory + GC tuning, allocation discipline), 3 (concurrency at 200K msgs/sec), 5 (observability), 6.22 (storage patterns).

Cross-Track Requirements¶

Regardless of track:

Hardening template integrated. The hardening/ template from Appendix A applies.
Architectural Decision Records (ADRs). At least three for the capstone, each ~1 page.
Threat model. One page minimum, no matter the track.
Defense readiness. You should be able to walk a reviewer through the code in 45 minutes and answer "what fails first under load / fuzzing / a malicious input / a network partition?"

The track choice signals career direction: Track 1 for distributed-systems infrastructure roles, Track 2 for platform/SRE/networking roles, Track 3 for data-infra/streaming roles. Pick based on where you want the next interview loop, not on what looks easiest.

Java Mastery¶

Full capstone-projects page →

Three tracks. Pick one. Each is sized for the four-week Month 6 schedule in 06_MONTH_CAPSTONE.md. Each forces the full curriculum into one artifact.

Common requirements across tracks: - Built on Java 25 LTS, virtual threads + structured concurrency where appropriate. - gRPC or HTTP/2 over java.net.http for cross-component RPC. - Full observability: Micrometer Prometheus metrics, OpenTelemetry traces, structured JSON logs with trace-ID correlation. - Continuous JFR recording, GC logs to disk, heap-dump-on-OOM. - Testcontainers-based integration tests, jqwik property tests for core invariants, jcstress for custom concurrency. - One JMH benchmark suite covering at least one hot path. - Public GitHub repo, README, design doc, runbook, demo script.

Track 1 - Distributed Storage: Raft-Backed Key-Value Store¶

Goal¶

A 3-to-5-node replicated KV store. Linearizable single-key reads/writes. Snapshot/restore. Membership changes (add/remove node) optional but encouraged.

Required reading¶

Diego Ongaro, In Search of an Understandable Consensus Algorithm (the Raft paper).
Ongaro's PhD thesis chapters on log compaction and membership.
One mature Raft implementation for reference: Atomix, jraft (SOFAStack), or Hashicorp's raft (Go) - read it, don't copy it.

Core scope (must-have)¶

Leader election with randomized timeouts.
Log replication with majority commit.
Linearizable reads via read-index or leader leases.
Persistent log on disk (append-only segments + index).
gRPC for inter-node communication and client API.
Snapshot + restore.

Stretch scope¶

Joint-consensus membership changes (JEP-style "stable until I touch it" config).
Leader leases for fast reads without read-index.
Multi-Raft (sharding by key range).
Linearizability checker (Jepsen-style; e.g. Knossos invoked from CI).

Failure scenarios to test¶

Kill leader mid-write. New leader within timeout. No committed entries lost.
Network partition isolating the leader. Minority side cannot make progress; majority elects new leader.
Slow disk on a follower. Throughput degrades to the slowest acknowledged majority; does not stall the cluster.
Restart an entire node. Recovers from disk log + snapshot.

Why this track¶

Forces you into: virtual threads under load, structured concurrency for fan-out RPCs, careful JMM reasoning for the consensus state machine, gRPC, persistence, JFR-driven tuning, failure injection.

Track 2 - Service Mesh: gRPC Microservice Mesh with Custom Control Plane¶

Goal¶

A working microservices mesh with multiple backend services, a service registry, a load-balancing client-side proxy, deadline propagation, circuit-breaker-driven outlier ejection, and end-to-end OpenTelemetry tracing.

Required reading¶

The gRPC documentation on name resolution, load balancing, and retries.
The Envoy/Istio docs on the data-plane / control-plane split (for terminology; you don't need to use Envoy).
Resilience4j docs.
Sam Newman, Building Microservices, 2nd ed., chapters on resilience and observability.

Core scope (must-have)¶

A registry service (in-memory, persisted, or backed by etcd via jetcd) where backends register/deregister with TTL.
At least three backend services with non-trivial dependencies among them (e.g., inventory → pricing → tax).
A client-side proxy library (gRPC interceptors) that:
Resolves a service name to instances from the registry.
Load-balances across them (round-robin minimum; weighted/pick-the-shortest-queue stretch).
Propagates deadlines and trace context.
Circuit-breaks per-instance based on error/latency.
Ejects outliers from the LB pool.
Full OpenTelemetry trace per top-level request, end-to-end through all services.
Prometheus metrics: per-service request count, latency histogram, error rate, circuit-breaker state.

Stretch scope¶

mTLS between services (use java.security and a local CA).
A "canary" feature: route N% of traffic to a different version of a backend.
Outlier detection based on per-instance success-rate (Envoy's algorithm is a fine reference).
Throttling / rate limiting at the proxy.

Failure scenarios to test¶

Kill one backend instance mid-request. Client retries to another instance; the failed instance is ejected from the LB pool.
Slow backend (inject 5s sleeps). Deadline propagation cancels the request through the chain.
Cascading failure: backend B fails; circuit breaker opens; backend A degrades gracefully (returns cached / partial data, not stalls).
Registry restart. Services re-register; client refreshes its view.

Why this track¶

Forces you into: gRPC depth, virtual-thread-friendly RPC patterns, Resilience4j composition, deadline propagation (subtle), OpenTelemetry instrumentation depth, multi-service operations.

Track 3 - Streaming Pipeline: Kafka-Style Ingest with Replay¶

Goal¶

A single-broker (stretch: multi-broker) message-streaming system with a producer API, a consumer API, durable segmented log storage, consumer groups, and at-least-once delivery with replay from offset.

Required reading¶

Kafka documentation: the log abstraction, segments, indexes, consumer groups, offset commits.
Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction.
LMAX Disruptor paper (for the in-broker hot path inspiration).

Core scope (must-have)¶

A broker process with TCP wire protocol (custom, or simplified Kafka-compatible).
Topics with partitions; each partition is a segmented append-only log on disk.
Producer API: send batched, acknowledged messages.
Consumer API with offset commit (manual + auto).
Consumer groups with partition rebalancing on join/leave.
At-least-once delivery: bounded duplication on consumer restart; no loss.
Replay from arbitrary offset.

Stretch scope¶

Multi-broker with leader-per-partition (effectively, Raft per partition - combines with Track 1 ideas).
Stream processing API (map / filter / window / aggregate) on top of the broker.
Compaction (latest-value-per-key) topics.
Schema registry integration (Avro / protobuf).

Failure scenarios to test¶

Kill a consumer mid-batch. Restart; consumer resumes from last committed offset; no loss; bounded duplication.
Broker crash. Replay log; producer's unacknowledged sends retried; consumer's view of committed offsets preserved.
Slow consumer. Lag grows; producer is not blocked (broker-side buffering is bounded; producers acknowledge at-send-time, not at-consumer-receipt).
Disk full. Backpressure to producers; no corruption.

Why this track¶

Forces you into: file I/O at performance (segment management, mmap or FileChannel), virtual threads for many concurrent connections, careful JMM reasoning for the hot path, JMH-driven optimization, backpressure design, JFR-driven tuning under sustained load.

Track-Independent Defense Checklist¶

By end of Month 6, regardless of track, you have:

A public repo with CI green.
A README that lets a stranger build and run the system in five commands.
A design doc (3–8 pages) explaining choices and rejected alternatives.
A runbook (RUNBOOK.md) with the top 5 alerts and mitigation steps.
A postmortem-style writeup (5–15 pages): what you built, what surprised you, what you would do differently.
A demo script (10–20 min) walking through happy path + at least one chaos scenario, with live dashboards.
JMH results for at least one critical operation, with notes on what limits throughput.
A JFR recording from a steady-state load run, committed in docs/perf/, with annotations on what to look at.
A hardening/ checklist from APPENDIX_A ticked through.

If you finish early¶

Pick one and do it: - Submit a tiny upstream patch to OpenJDK (see APPENDIX_C) using something you learned during the capstone. - Write a blog post on one specific surprise from your work (e.g., "what I learned about virtual-thread pinning building a Raft implementation"). - Port the project to GraalVM native-image and document the trade-offs. - Migrate one component from blocking to reactive (or vice versa) and JMH the difference.

The capstone is the end of the curriculum, not the end of the work. Treat it as the launch pad.

Kubernetes¶

Full capstone-projects page →

Pick one. The work performed here is what you describe in interviews.

Track 1-Hard Way: A Production-Grade Cluster From Scratch¶

Outcome: a multi-node Kubernetes cluster brought up on bare metal or cloud VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy, GitOps-managed workloads, and a documented operational runbook.

Functional spec¶

3 control-plane nodes + 3 workers (minimum). HAProxy or cloud LB in front of the apiservers.
etcd with mTLS, encryption-at-rest, scheduled snapshots to off-cluster storage.
CNI: Cilium with kube-proxy replacement, Hubble enabled.
Service mesh: Istio (ambient) or Linkerd, mTLS strict between services.
Storage: a real CSI driver (local-path for dev; OpenEBS / Longhorn / cloud CSI for "real").
Observability: Prometheus + Grafana + Loki + Tempo + OTel Collector.
GitOps: ArgoCD or Flux managing platform addons.
Policy: Pod Security restricted, NetworkPolicy default-deny, Kyverno or Gatekeeper enforcing org rules.
Backup: Velero scheduled, restore tested.

Non-functional spec¶

CIS benchmark green (kube-bench ≥90% pass).
Cluster rebuild from scratch in <2 hours via Ansible/Terraform.
Zero-downtime kubelet upgrades (drain + replace pattern).
A demo: kill a control-plane node; cluster recovers without intervention.

Acceptance¶

Public repo with provisioning playbooks and runbooks.
A 30-minute screencast walking the assessor through bring-up, an incident drill, and a tenant onboarding.
A RUNBOOK.md covering: cluster provisioning, node addition/removal, etcd backup/restore, certificate rotation, upgrade procedure, top 5 incident types and remediation.

Skills exercised¶

All months-but Months 1, 2, 6 most heavily.

Track 2-Platform: GitOps Multi-Tenant PaaS¶

Outcome: a self-service developer platform built on Kubernetes that demonstrates onboarding a new team in <30 minutes, with policy guardrails, infra-from-code, and full observability.

Functional spec¶

Tenant model: each tenant gets a Namespace, ResourceQuota, LimitRange, RBAC bindings, default NetworkPolicy, monitoring scrape config, GitOps Application (Argo) entry-all materialized from a single tenant claim (Crossplane Composition or Helm chart).
Self-service: developers commit a manifest.yaml to their repo; ArgoCD/Flux picks it up; their app deploys.
Policy: Kyverno or OPA Gatekeeper enforces: image signatures, no latest tags, mandatory labels, resource limits, no privileged Pods.
Observability: each tenant's metrics/logs/traces are isolated (via labels and Loki/Prom multi-tenancy); a per-tenant Grafana folder with default dashboards.
Cost attribution: OpenCost emits per-tenant cost; surface in a dashboard.
Crossplane: a Database claim that materializes a real cloud database (or, for demo, a chart-deployed Postgres).

Non-functional spec¶

Tenant onboarding: from "claim PR opened" to "deployed app reachable" in <30 minutes (target: <5).
Failure isolation: a tenant exceeding quota does not affect other tenants.
Compliance: every running Pod can be traced back to a git commit + signature verification.

Acceptance¶

Public repo with platform manifests + tenant-onboarding template.
Live demo: onboard a fresh tenant; deploy a sample app; demonstrate observability + policy denial; demonstrate quota enforcement.
A PLATFORM.md describing the contract between platform team and tenants: versioning, deprecation, support, escalation.

Skills exercised¶

Months 3 (operators / Crossplane), 5 (GitOps + IaC + autoscaling + admission), 6 (multi-tenancy).

Track 3-Operator: Production-Quality Operator From Scratch¶

Outcome: a non-trivial operator that manages a stateful application or external system, complete with backup/restore, upgrades, observability, and a thoughtful API.

Suggested scopes¶

elasticsearch-mini-operator: manage Elasticsearch clusters with auto-scaling, snapshot lifecycle, index lifecycle policies.
postgres-mini-operator: with automatic failover (using the Postgres replication primitives), backup/restore via WAL-G to S3, point-in-time recovery.
saas-resource-operator: manage external SaaS resources via Crossplane composition (e.g., a GitHubRepo operator complete with branch protection, secret scanning, codeowners).

Acceptance¶

Public repo, written with controller-runtime + Kubebuilder.
CRDs versioned (v1alpha1 + v1beta1 + conversion webhook).
Status conditions, finalizers, owner references-all idiomatic.
Comprehensive RBAC (least-privilege, generated from kubebuilder markers).
Mutating + validating admission webhooks.
E2E tests (Ginkgo + envtest, plus a kind-based suite).
Helm chart or kustomize manifests for installation.
Observability: Prometheus metrics, structured logs (logr), OTel traces.
Helm-test-style upgrade tests across three operator versions.
Documentation: design rationale, API reference, examples.

Skills exercised¶

Months 3 (operators), 4 (storage if stateful), 5 (admission), 6 (defense).

Cross-Track Requirements¶

cluster-baseline/ template (Appendix A) integrated.
ADRs (≥3).
THREAT_MODEL.md.
RUNBOOK.md.
Defense readiness: 60-minute walkthrough.

The track choice signals career direction: Track 1 for SRE/cluster-operator roles, Track 2 for platform-engineering roles, Track 3 for software-engineering-on-Kubernetes roles. Pick based on where you want the next interview loop.

Linux Kernel¶

Full capstone-projects page →

Pick one. The work performed here is what you describe in interviews.

Track 1-Kernel Module: An Out-of-Tree LKM¶

Outcome: a non-trivial out-of-tree Linux kernel module, KUnit-tested, sparse-clean, KASAN-clean, with a clear README and a path toward upstream submission (even if you don't take it all the way).

Suggested scopes¶

A character-device key/value store (week 21 lab, hardened). Adds: ioctl for batch ops, an mmap interface for zero-copy reads, RCU-protected reader path.
A netfilter hook. A small accelerator that, e.g., counts packets matching a configurable BPF filter at the netfilter ingress hook, with stats exposed via a procfs entry.
A custom tracepoint suite. Add tracepoints to a subsystem of your choice (e.g., your pkv module from the lab) and write a bpftrace consumer.

Acceptance¶

Loads cleanly on at least two LTS kernels (e.g., 6.6 and 6.12).
KUnit tests in tree; pass on both kernels.
KASAN, lockdep, KCSAN warnings: zero across stress-test load.
Signed for secure boot.
A README.md with build, install, and use; a DESIGN.md with locking and memory ownership documented.

Skills exercised¶

Months 1 (kernel boundary), 2 (memory + scheduling internals), 6 (LKM development).

Track 2-eBPF Observability Tool¶

Outcome: a production-grade tracing tool comparable in quality to one of Brendan Gregg's BCC tools, with a proper userspace consumer, a Prometheus exporter, and CO-RE portability.

Suggested scopes¶

syscallat-system-call latency histograms, per-syscall, per-process, with low overhead. Equivalent of bpftrace's syscount but production-quality.
tcptop-top-N connections by bytes/sec, sortable by direction. Cilium's Hubble has equivalents; do this from scratch.
A profiler-like tool that, given a PID, samples on-CPU stacks at 99 Hz, aggregates with a frequency table, and exposes flamegraph data.

Acceptance¶

Implemented with libbpf + CO-RE.
Userspace consumer in C or Go (using cilium/ebpf).
Runs on kernels 5.10+ without recompilation.
Verifier-clean across architectures (x86_64 + aarch64 minimum).
Prometheus exporter with low-cardinality labels.
A bpftrace equivalent for comparison; document why the production version exists.
CPU overhead under representative load: < 1%.

Skills exercised¶

Months 3 (eBPF), 4 (networking, if you pick tcptop), 6 (perf tuning).

Track 3-Self-Healing Distributed Service¶

Outcome: a small distributed service (a multi-instance HTTP API, a job runner, a metrics collector) deployed on Linux hosts with a comprehensive self-healing posture.

Suggested scopes¶

A 3-node deployment of a small HTTP service:
Each node is a hardened Ubuntu/Debian/Rocky host provisioned by Ansible.
The service is systemd-managed with watchdog, full hardening directives, cgroups-v2 resource limits.
On any node failure, the survivors continue serving (use a TCP load balancer + healthcheck, e.g., HAProxy or IPVS).
Memory pressure (PSI > X%) triggers a soft restart of the worst offender via a cgroup-event watcher.
Disk pressure triggers log rotation and old-data cleanup.
A chaos.sh script kills random nodes; the cluster recovers without human intervention.

Acceptance¶

Reproducible from Ansible: ansible-playbook site.yml brings up 3 hosts from blank Ubuntu cloud images.
Full observability: node_exporter, journald, eBPF tools, Prometheus + Grafana.
A documented threat model and CIS-aligned baseline (lynis score).
A 60-minute "chaos demo": run chaos.sh; observe full self-healing; produce a one-page incident report from logs.
Encryption at rest (LUKS on data volumes); TLS between nodes (step-ca or self-signed); auditd shipping logs off-host.

Skills exercised¶

All months. This is the integrative track-the right choice if you want operations-engineer breadth.

Cross-Track Requirements¶

host-baseline/ template integrated.
ADRs (≥3).
THREAT_MODEL.md, RUNBOOK.md, RECOVERY.md.
Defense readiness: a 45-minute walkthrough with a peer.

Python Mastery¶

Full capstone-projects page →

Pick one in week 21 and build incrementally through Month 6. Defend in week 24.

Every track must, by week 24, exhibit: - pyright --strict clean. - ruff clean with the curriculum's full rule set. - pytest with ≥85% coverage and a hypothesis test suite. - Structured logs, Prometheus metrics, OpenTelemetry traces, /healthz+/readyz. - Containerized, deployed somewhere reachable, with a load test and a postmortem doc. - A docs/architecture.md that another senior engineer could read in 30 minutes.

Track 1 - Production RAG Service¶

Pitch: a multi-tenant retrieval-augmented generation service over a 100k–1M-document corpus with hybrid search, reranking, streaming responses, and an honest eval harness.

Must-have: - Ingestion pipeline: PDF/HTML/Markdown → chunks → embeddings → pgvector (or qdrant). - Retrieval: dense + BM25 + RRF, then a cross-encoder reranker. - Streaming SSE answers with citations linking back to source chunks. - Per-tenant isolation (row-level filters, separate collections, or both). - Eval harness (ragas or custom): faithfulness, answer relevance, context precision, retrieval recall@K. CI gate on regressions. - Cost accounting per request; per-tenant rate limits; cache (prompt prefix + retrieval result).

Stretch: - Query rewriting (HyDE) and routing (small queries → small model). - Multimodal: support image-bearing PDFs via VLM-extracted captions. - Continuous learning: a feedback loop that promotes/demotes chunks based on user signal.

Track 2 - Agent Orchestration Platform¶

Pitch: a platform for running tool-using LLM agents reliably - with durable execution, observability, cost ceilings, and a permissions model.

Must-have: - Agent definitions as Pydantic schemas: tools, system prompt, model, caps (turns, tokens, cost, wall-time). - Durable execution: state machine in Postgres; recover after process kill. - Tool sandbox: at minimum, an e2b-or-Docker-isolated bash tool with allowlist. - Permissions model: per-agent, per-tenant tool access. Audit log. - Observability: per-step spans, full trace replay in tests. - Kill switch: a feature flag that immediately halts execution. End-to-end test for it. - Replay testing: saved traces become regression tests.

Stretch: - Multi-agent orchestration (orchestrator + workers). - Evaluator-optimizer loops with automated prompt revision. - A small UI (Streamlit or Next.js) for inspecting runs.

Track 3 - Training & Serving Pipeline¶

Pitch: fine-tune a small open-weights model with LoRA, evaluate it, serve it with vLLM behind a FastAPI gateway, with autoscaling and continuous eval.

Must-have: - Dataset prep: HuggingFace datasets, schema validation with Pydantic, dedup, deterministic train/val/test split with hash-based assignment. - LoRA fine-tune (peft + trl) on a 7B–8B base. Document VRAM math. - Offline eval: at minimum, a held-out set with task-appropriate metrics; ideally lm-eval-harness on relevant subsets. - Serve: vLLM behind FastAPI gateway. Streaming, batching, structured output. - Routing: cheap queries → small model; complex → large; A/B harness. - Continuous eval: daily replay of N production samples (PII-scrubbed) against the new checkpoint; block promotion on regression. - Rollout: shadow → canary 1% → 10% → 50% → 100% via feature flag.

Stretch: - DPO / KTO post-training on preference data. - Quantization (GPTQ/AWQ) and a serving comparison. - Multi-GPU serving with tensor parallelism.

Defense (Week 24)¶

Each track defends the same five reviews:

Architecture review - whiteboard, defend each component.
Performance review - flame graphs, throughput, p50/p95/p99.
Eval review - harness, regressions caught, rollout policy.
Cost review - $/request, $/user, projected $/month at 10x.
Failure-mode review - provider outage, vector DB down, OOM, runaway agent, prompt injection, tokenizer mismatch.

The bar: every question gets a substantive answer without hand-waving. That is the senior signal.

Rust Mastery¶

Full capstone-projects page →

The Month 6 capstone is the deliverable that converts this curriculum from study into evidence. Pick one track. The work performed here is the work you describe in interviews and link from a portfolio.

Track 1-Compiler / Tooling¶

Outcome: a merged PR (or one in advanced review) against rust-lang/rust, rust-clippy, rust-analyzer, or cargo.

Suggested scopes (ranked by tractability)¶

Diagnostic improvement in rustc. Pick an A-diagnostics issue with a clear reproduction. Improve the error: more accurate span, structured suggestion, better wording. Realistic effort: 20–40 hours including bootstrap, review iterations, UI test churn.
New clippy lint. The clippy issue tracker maintains a queue of "lint requests." Pick one tagged good-first-issue. Implement, test (UI tests + dogfood the lint against the rustc tree), document. Realistic effort: 30–60 hours.
A new MIR optimization pass (advanced). Choose a narrow, well-bounded transform-e.g., a peephole simplification of a specific MIR pattern. Profile its impact with rustc-perf. Realistic effort: 60–120 hours and substantial reviewer hand-holding; treat as stretch.
rust-analyzer feature. Implement a code action or completion improvement. RA's architecture is extremely well-documented; the PR loop is fast.

Acceptance criteria¶

A PR exists, is linked from your portfolio, and has at least one round of review feedback addressed.
A short write-up (CAPSTONE_NOTES.md) documenting: what you changed, why, what you learned about the compiler internals, and what reviewers pushed back on.
Your local fork has a working stage-1 build of rustc with your patch applied.

Skills exercised¶

Months 4 (macros / unsafe), 6.23 (compiler internals).
The hardening discipline matters less here; the deliverable is upstream code, not a service.

Track 2-High-Performance Fintech: Limit-Order-Book Matching Engine¶

Outcome: a benchmarked, fuzz-tested matching engine for limit and market orders across multiple symbols, single-process, with sub-microsecond p99 hot-path latency on commodity x86_64.

Functional spec¶

Order types: limit (GTC, IOC, FOK), market, cancel, modify.
Matching policy: price-time priority. Partial fills allowed. Self-trade prevention configurable.
Multi-symbol: a Engine owns N independent symbol books; symbols may be sharded across worker threads.
Wire format: a binary protocol (your design or a subset of FIX/SBE).
Output: an event stream (Filled, PartiallyFilled, Cancelled, Rejected, BookUpdate) consumed by downstream feed handlers.

Non-functional spec¶

Latency: p50 < 200 ns, p99 < 1 µs for the hot path (order in → match → event out), measured under sustained 1 M orders/sec.
Throughput: ≥ 1 M orders/sec sustained on a single symbol on a single core.
Determinism: identical inputs produce identical event sequences. No HashMap iteration order in the hot path; use deterministic structures.
Fault tolerance: panic-safe (panic = "abort" is acceptable; document operational implications). Persistent log for replay.

Architecture sketch¶

Hot path is single-threaded per symbol. SPSC ring buffer on input, SPSC on output. Cross-thread coordination only at session boundaries.
Order book: paired sorted structures (often BTreeMap<Price, OrderQueue> for asks, mirror for bids). For ultimate latency: array-of-price-levels with sparse bitmap; this is the rabbit hole `Aeron - style designs go down.
Allocator: mimalloc global, plus per-symbol bump arenas for short-lived order metadata.
Memory layout: #[repr(C)] orders, padded to cache-line boundaries; pre-allocated slabs.
No async on the hot path. Async is for session/admin paths only. Mixing the two is the most common architectural mistake in this space.

Test rigor¶

Property tests: proptest invariants on the book-total quantity preserved across fills, no cross of bid > ask except during a match, time priority preserved at a price level.
Fuzz: cargo-fuzz with `arbitrary - derived order generators. The corpus must include high-volume sequences with cancels and modifies.
Loom: any cross-thread sync (admin → engine, engine → publisher) must be Loom-verified.
Bench: criterion with regression detection in CI; flamegraphs committed; perf stat outputs (cycles, instructions, IPC, cache-misses) tracked over time.

Hardening pass¶

LTO fat, codegen-units 1, panic abort, target-cpu=native (for the deployed-on-known-hardware case).
PGO with a representative replay workload.
BOLT post-link.
Deterministic build via Docker pinned to a content hash.

Acceptance criteria¶

Public repo with the above.
A README that includes a flamegraph, a perf stat table, and a latency CDF.
A THREAT_MODEL.md covering the inputs you do and do not validate.
An interview-defensible answer to: "What does your worst-case allocation pattern look like under a 100× burst?"

Skills exercised¶

Months 3 (concurrency), 4 (unsafe / FFI for the wire codec), 5 (production architecture though the hot path skips most of the hexagonal), 6.21 (custom data structures), 6.22 (allocators).

Track 3-Kernel: A `rust-for-linux` Character Device¶

Outcome: a working out-of-tree Rust kernel module implementing a non-trivial character device, with KUnit tests, building cleanly against a recent mainline kernel.

Functional spec¶

A character device (/dev/<yourname>) that exposes an in-kernel ring buffer.
Operations: read (drains the ring), write (appends), ioctl for resize/clear/stats, mmap for zero-copy access (stretch).
Multi-reader / multi-writer with appropriate kernel synchronization (SpinLock, Mutex from the kernel crate, not std).
Sysfs entries for runtime tuning.

Why this scope¶

Touches every cross-FFI surface: char device registration, file operations, copy_from_user/copy_to_user, sysfs, locking.
Forces you to read kernel-side Rust idioms (Box::try_new, fallible alloc, Pin<&mut Self> everywhere, Arc-equivalents).
The rust-for-linux toolchain itself is a learning surface: pinned rustc, custom libcore subset, no std.

Build environment¶

Linux ≥ 6.8 (Rust support is stable enough for out-of-tree work).
rustup toolchain link kernel <path> to point at the kernel-supported rustc.
A local kernel build with CONFIG_RUST=y, CONFIG_SAMPLES_RUST=y.

Test rigor¶

KUnit-based unit tests inside the module.
`selftest - style scripts running the device through real read/write/ioctl from userspace.
Stress test: N concurrent readers and writers with taskset pinning, watch for KASAN/KCSAN reports.

Hardening pass¶

Kernel-side: KASAN (kernel address sanitizer), KCSAN (concurrency sanitizer), lockdep enabled in your test kernel.
Module-side: every unsafe block carries a // SAFETY: comment justifying the kernel invariants.
A `dmesg - clean run on insertion, exercise, and removal.

Acceptance criteria¶

The module builds, loads, exercises end-to-end, unloads, with no KASAN/KCSAN/lockdep warnings.
A PR-ready patch series formatted for git format-patch (even if not submitted upstream).
A KERNEL_NOTES.md describing the locking model, the failure modes you considered, and the explicit reason you chose SpinLock vs Mutex at each site.

Skills exercised¶

Months 4 (unsafe + FFI to the kernel C API), 6.22 (no_std), 6.23 (compiler internals indirectly via the pinned toolchain).

Cross-Track Requirements¶

Regardless of track:

Hardening workspace integrated. The hardening/ template from Appendix A applies.
Architectural Decision Records (ADRs). At least three for the capstone, each ~1 page.
Threat model. One page minimum, no matter the track.
Defense readiness. You should be able to walk a reviewer through the code in 45 minutes and answer "what fails first under load / fuzzing / a malicious input / a pathological kernel state?"

The track choice signals career direction: compiler track for tooling/PL roles, fintech for HFT/exchange/crypto roles, kernel for OS/embedded/security roles. Do not pick based on what looks easiest; pick based on where you want the next interview loop.