Saltar a contenido

Week 24 - Training, Serving, Rollout, and the Capstone Defense

24.1 Conceptual Core

You are unlikely to pretrain a foundation model. You will, repeatedly: (a) fine-tune with LoRA/QLoRA, (b) serve open-weights models, (c) roll out behind a gateway with shadow / canary / staged-percent traffic.

24.2 Mechanical Detail

  • Fine-tuning stack: transformers, peft (LoRA/QLoRA), trl (SFT, DPO), bitsandbytes (4-bit), accelerate (multi-GPU), unsloth (faster LoRA). Datasets via datasets (HuggingFace).
  • Serving stack: vLLM (PagedAttention, the default choice for throughput), TGI, SGLang, llama.cpp/ollama (for tiny / local), Triton Inference Server (when you need the matrix). Quantization: GPTQ, AWQ, GGUF.
  • Gateway shape: FastAPI in front of vLLM. Streaming passthrough. Per-tenant rate limits. Cost accounting per request. Model routing (route cheap queries to small models).
  • Rollout: shadow (mirror traffic, compare), canary (1% → 10% → 50% → 100%), feature flags per cohort. Eval-on-rollout: keep the offline eval running against the canary.
  • Continuous evaluation: a daily replay of N production samples (PII-scrubbed) against the new model. Block promotion on regression.

24.3 Capstone Defense

You picked a track from CAPSTONE_PROJECTS.md at the start of Month 6. You have been building it incrementally. Week 24 is the defense:

  1. Architecture review. Whiteboard the system. Defend each component choice.
  2. Performance review. py-spy flame graph, vLLM throughput numbers, end-to-end p50/p95/p99 latency.
  3. Eval review. Show the eval harness, the regressions caught, the rollout policy.
  4. Cost review. $/request, $/user, projected $/month at 10x scale.
  5. Failure-mode review. What happens on: provider outage, vector-DB down, OOM in worker, agent runaway, prompt injection, tokenizer mismatch.

Pass = you can answer every question without hand-waving.


Month-6 Exit Criteria - and the Senior Bar

A graduate of this curriculum, in a senior-AI-engineer interview loop, should be able to:

  1. Whiteboard a RAG service for 1M docs / 1k QPS in 30 minutes, with explicit pgvector vs. qdrant trade-offs, hybrid retrieval, reranking, eval methodology, and cost projection.
  2. Diagnose a production agent that's burning $1k/hr by reading traces, identifying the runaway loop, and shipping a fix with caps and a kill switch - same day.
  3. Fine-tune a 7B model with LoRA on a domain dataset, evaluate offline, serve with vLLM, and roll out behind a canary in under a week.
  4. Defend the choice not to use Python for a given component - model routing, GPU scheduler, streaming proxy - when Go or Rust is the right answer.

That last bullet is the actual signal of seniority: you have stopped being a Python advocate and started being an engineer.

Comments