Week 24 - Capstone Integration & Defense¶
24.1 Conceptual Core¶
The final week is integration, not new content. Bring your chosen capstone (see CAPSTONE_PROJECTS.md) to defensible quality.
24.2 Final Hardening Checklist¶
- Reproducible training/inference runs: pinned PyTorch/CUDA/driver versions, seed everything, document determinism guarantees.
- Benchmarks: throughput, latency, cost, scaling efficiency. All committed.
- Profiles:
nsysandncureports for at least one kernel hot path; flame graphs for at least one end-to-end run. - Observability: GPU util, memory, communication overhead, request-level metrics-all in Prometheus or equivalent.
- Cost: a documented cost-per-token (inference) or cost-per-step (training).
- Safety (if inference): input/output classification, constrained decoding, audit logging.
- Eval: regression suite that gates merges; baseline + thresholds documented.
- Repro environment: a Dockerfile + a
make demotarget that brings up the artifact end-to-end on a fresh machine. - Defensible decisions: ADRs (≥3) for the non-obvious choices.
- Threat model: input attacks, infrastructure attacks, supply-chain risk.
24.3 Lab-"Defend the Design"¶
Schedule a 60-minute mock review (peer or recorded). Walk through: 1. The architecture diagram. 2. The roofline analysis: where does your system sit on the roofline? What's bound by what? 3. One slide per non-obvious decision (e.g., "why FSDP-2 over DeepSpeed Stage-3", "why AWQ over GPTQ", "why your batching policy"). 4. A live demo of the end-to-end artifact. 5. A live demo of one production-quality concern: cost, observability, safety, or fault tolerance.
The deliverable is the defense, not the slides. If you cannot answer: - "What is your worst-case tail latency under 10× concurrent load?" - "What happens when a GPU fails mid-training?" - "What is your cost per million output tokens?" - "How would you scale this to 10× the model size?" ...you have not yet finished the curriculum.
24.4 Production Slice¶
- Tag the capstone repo
v1.0.0. Write a CHANGELOG. Write a README aimed at the next engineer who picks it up. Write a blog post (publish or shelve) explaining the most interesting technical decision. That blog post is the artifact recruiters and hiring managers actually read.
Month 6 Deliverable¶
The chosen capstone (per CAPSTONE_PROJECTS.md), running, defensible, observable, cost-attributed.
You are done. The next steps are no longer pedagogical; they are professional.
Recommended Reading Done This Month¶
- Designing Machine Learning Systems, Chip Huyen.
- Building Machine Learning Powered Applications, Emmanuel Ameisen-for the production framing.
- The KServe and KubeRay docs.
- The OpenAI Evals and Anthropic published evals documentation.
- Anthropic's Constitutional AI paper (durable framing for safety design).
- The Llama Guard / ShieldGemma papers.