Skip to content

Week 24 - Capstone Integration & Defense

24.1 Conceptual Core

The final week is integration, not new content. Bring your chosen capstone (see CAPSTONE_PROJECTS.md) to defensible quality.

24.2 Final Hardening Checklist

  • Reproducible training/inference runs: pinned PyTorch/CUDA/driver versions, seed everything, document determinism guarantees.
  • Benchmarks: throughput, latency, cost, scaling efficiency. All committed.
  • Profiles: nsys and ncu reports for at least one kernel hot path; flame graphs for at least one end-to-end run.
  • Observability: GPU util, memory, communication overhead, request-level metrics-all in Prometheus or equivalent.
  • Cost: a documented cost-per-token (inference) or cost-per-step (training).
  • Safety (if inference): input/output classification, constrained decoding, audit logging.
  • Eval: regression suite that gates merges; baseline + thresholds documented.
  • Repro environment: a Dockerfile + a make demo target that brings up the artifact end-to-end on a fresh machine.
  • Defensible decisions: ADRs (≥3) for the non-obvious choices.
  • Threat model: input attacks, infrastructure attacks, supply-chain risk.

24.3 Lab-"Defend the Design"

Schedule a 60-minute mock review (peer or recorded). Walk through: 1. The architecture diagram. 2. The roofline analysis: where does your system sit on the roofline? What's bound by what? 3. One slide per non-obvious decision (e.g., "why FSDP-2 over DeepSpeed Stage-3", "why AWQ over GPTQ", "why your batching policy"). 4. A live demo of the end-to-end artifact. 5. A live demo of one production-quality concern: cost, observability, safety, or fault tolerance.

The deliverable is the defense, not the slides. If you cannot answer: - "What is your worst-case tail latency under 10× concurrent load?" - "What happens when a GPU fails mid-training?" - "What is your cost per million output tokens?" - "How would you scale this to 10× the model size?" ...you have not yet finished the curriculum.

24.4 Production Slice

  • Tag the capstone repo v1.0.0. Write a CHANGELOG. Write a README aimed at the next engineer who picks it up. Write a blog post (publish or shelve) explaining the most interesting technical decision. That blog post is the artifact recruiters and hiring managers actually read.

Month 6 Deliverable

The chosen capstone (per CAPSTONE_PROJECTS.md), running, defensible, observable, cost-attributed.

You are done. The next steps are no longer pedagogical; they are professional.


  • Designing Machine Learning Systems, Chip Huyen.
  • Building Machine Learning Powered Applications, Emmanuel Ameisen-for the production framing.
  • The KServe and KubeRay docs.
  • The OpenAI Evals and Anthropic published evals documentation.
  • Anthropic's Constitutional AI paper (durable framing for safety design).
  • The Llama Guard / ShieldGemma papers.

Comments