Week 24 - Training, Serving, Rollout, and the Capstone Defense¶
24.1 Conceptual Core¶
You are unlikely to pretrain a foundation model. You will, repeatedly: (a) fine-tune with LoRA/QLoRA, (b) serve open-weights models, (c) roll out behind a gateway with shadow / canary / staged-percent traffic.
24.2 Mechanical Detail¶
- Fine-tuning stack:
transformers,peft(LoRA/QLoRA),trl(SFT, DPO),bitsandbytes(4-bit),accelerate(multi-GPU),unsloth(faster LoRA). Datasets viadatasets(HuggingFace). - Serving stack:
vLLM(PagedAttention, the default choice for throughput),TGI,SGLang,llama.cpp/ollama(for tiny / local),Triton Inference Server(when you need the matrix). Quantization: GPTQ, AWQ, GGUF. - Gateway shape: FastAPI in front of vLLM. Streaming passthrough. Per-tenant rate limits. Cost accounting per request. Model routing (route cheap queries to small models).
- Rollout: shadow (mirror traffic, compare), canary (1% → 10% → 50% → 100%), feature flags per cohort. Eval-on-rollout: keep the offline eval running against the canary.
- Continuous evaluation: a daily replay of N production samples (PII-scrubbed) against the new model. Block promotion on regression.
24.3 Capstone Defense¶
You picked a track from CAPSTONE_PROJECTS.md at the start of Month 6. You have been building it incrementally. Week 24 is the defense:
- Architecture review. Whiteboard the system. Defend each component choice.
- Performance review.
py-spyflame graph,vLLMthroughput numbers, end-to-end p50/p95/p99 latency. - Eval review. Show the eval harness, the regressions caught, the rollout policy.
- Cost review. $/request, $/user, projected $/month at 10x scale.
- Failure-mode review. What happens on: provider outage, vector-DB down, OOM in worker, agent runaway, prompt injection, tokenizer mismatch.
Pass = you can answer every question without hand-waving.
Month-6 Exit Criteria - and the Senior Bar¶
A graduate of this curriculum, in a senior-AI-engineer interview loop, should be able to:
- Whiteboard a RAG service for 1M docs / 1k QPS in 30 minutes, with explicit pgvector vs. qdrant trade-offs, hybrid retrieval, reranking, eval methodology, and cost projection.
- Diagnose a production agent that's burning $1k/hr by reading traces, identifying the runaway loop, and shipping a fix with caps and a kill switch - same day.
- Fine-tune a 7B model with LoRA on a domain dataset, evaluate offline, serve with vLLM, and roll out behind a canary in under a week.
- Defend the choice not to use Python for a given component - model routing, GPU scheduler, streaming proxy - when Go or Rust is the right answer.
That last bullet is the actual signal of seniority: you have stopped being a Python advocate and started being an engineer.