Capstone Projects - Three Tracks¶
Pick one in week 21 and build incrementally through Month 6. Defend in week 24.
Every track must, by week 24, exhibit:
- pyright --strict clean.
- ruff clean with the curriculum's full rule set.
- pytest with ≥85% coverage and a hypothesis test suite.
- Structured logs, Prometheus metrics, OpenTelemetry traces, /healthz+/readyz.
- Containerized, deployed somewhere reachable, with a load test and a postmortem doc.
- A docs/architecture.md that another senior engineer could read in 30 minutes.
Track 1 - Production RAG Service¶
Pitch: a multi-tenant retrieval-augmented generation service over a 100k–1M-document corpus with hybrid search, reranking, streaming responses, and an honest eval harness.
Must-have:
- Ingestion pipeline: PDF/HTML/Markdown → chunks → embeddings → pgvector (or qdrant).
- Retrieval: dense + BM25 + RRF, then a cross-encoder reranker.
- Streaming SSE answers with citations linking back to source chunks.
- Per-tenant isolation (row-level filters, separate collections, or both).
- Eval harness (ragas or custom): faithfulness, answer relevance, context precision, retrieval recall@K. CI gate on regressions.
- Cost accounting per request; per-tenant rate limits; cache (prompt prefix + retrieval result).
Stretch: - Query rewriting (HyDE) and routing (small queries → small model). - Multimodal: support image-bearing PDFs via VLM-extracted captions. - Continuous learning: a feedback loop that promotes/demotes chunks based on user signal.
Track 2 - Agent Orchestration Platform¶
Pitch: a platform for running tool-using LLM agents reliably - with durable execution, observability, cost ceilings, and a permissions model.
Must-have:
- Agent definitions as Pydantic schemas: tools, system prompt, model, caps (turns, tokens, cost, wall-time).
- Durable execution: state machine in Postgres; recover after process kill.
- Tool sandbox: at minimum, an e2b-or-Docker-isolated bash tool with allowlist.
- Permissions model: per-agent, per-tenant tool access. Audit log.
- Observability: per-step spans, full trace replay in tests.
- Kill switch: a feature flag that immediately halts execution. End-to-end test for it.
- Replay testing: saved traces become regression tests.
Stretch: - Multi-agent orchestration (orchestrator + workers). - Evaluator-optimizer loops with automated prompt revision. - A small UI (Streamlit or Next.js) for inspecting runs.
Track 3 - Training & Serving Pipeline¶
Pitch: fine-tune a small open-weights model with LoRA, evaluate it, serve it with vLLM behind a FastAPI gateway, with autoscaling and continuous eval.
Must-have:
- Dataset prep: HuggingFace datasets, schema validation with Pydantic, dedup, deterministic train/val/test split with hash-based assignment.
- LoRA fine-tune (peft + trl) on a 7B–8B base. Document VRAM math.
- Offline eval: at minimum, a held-out set with task-appropriate metrics; ideally lm-eval-harness on relevant subsets.
- Serve: vLLM behind FastAPI gateway. Streaming, batching, structured output.
- Routing: cheap queries → small model; complex → large; A/B harness.
- Continuous eval: daily replay of N production samples (PII-scrubbed) against the new checkpoint; block promotion on regression.
- Rollout: shadow → canary 1% → 10% → 50% → 100% via feature flag.
Stretch: - DPO / KTO post-training on preference data. - Quantization (GPTQ/AWQ) and a serving comparison. - Multi-GPU serving with tensor parallelism.
Defense (Week 24)¶
Each track defends the same five reviews:
- Architecture review - whiteboard, defend each component.
- Performance review - flame graphs, throughput, p50/p95/p99.
- Eval review - harness, regressions caught, rollout policy.
- Cost review - $/request, $/user, projected $/month at 10x.
- Failure-mode review - provider outage, vector DB down, OOM, runaway agent, prompt injection, tokenizer mismatch.
The bar: every question gets a substantive answer without hand-waving. That is the senior signal.