Saltar a contenido

Capstone Projects - Three Tracks

Pick one in week 21 and build incrementally through Month 6. Defend in week 24.

Every track must, by week 24, exhibit: - pyright --strict clean. - ruff clean with the curriculum's full rule set. - pytest with ≥85% coverage and a hypothesis test suite. - Structured logs, Prometheus metrics, OpenTelemetry traces, /healthz+/readyz. - Containerized, deployed somewhere reachable, with a load test and a postmortem doc. - A docs/architecture.md that another senior engineer could read in 30 minutes.


Track 1 - Production RAG Service

Pitch: a multi-tenant retrieval-augmented generation service over a 100k–1M-document corpus with hybrid search, reranking, streaming responses, and an honest eval harness.

Must-have: - Ingestion pipeline: PDF/HTML/Markdown → chunks → embeddings → pgvector (or qdrant). - Retrieval: dense + BM25 + RRF, then a cross-encoder reranker. - Streaming SSE answers with citations linking back to source chunks. - Per-tenant isolation (row-level filters, separate collections, or both). - Eval harness (ragas or custom): faithfulness, answer relevance, context precision, retrieval recall@K. CI gate on regressions. - Cost accounting per request; per-tenant rate limits; cache (prompt prefix + retrieval result).

Stretch: - Query rewriting (HyDE) and routing (small queries → small model). - Multimodal: support image-bearing PDFs via VLM-extracted captions. - Continuous learning: a feedback loop that promotes/demotes chunks based on user signal.


Track 2 - Agent Orchestration Platform

Pitch: a platform for running tool-using LLM agents reliably - with durable execution, observability, cost ceilings, and a permissions model.

Must-have: - Agent definitions as Pydantic schemas: tools, system prompt, model, caps (turns, tokens, cost, wall-time). - Durable execution: state machine in Postgres; recover after process kill. - Tool sandbox: at minimum, an e2b-or-Docker-isolated bash tool with allowlist. - Permissions model: per-agent, per-tenant tool access. Audit log. - Observability: per-step spans, full trace replay in tests. - Kill switch: a feature flag that immediately halts execution. End-to-end test for it. - Replay testing: saved traces become regression tests.

Stretch: - Multi-agent orchestration (orchestrator + workers). - Evaluator-optimizer loops with automated prompt revision. - A small UI (Streamlit or Next.js) for inspecting runs.


Track 3 - Training & Serving Pipeline

Pitch: fine-tune a small open-weights model with LoRA, evaluate it, serve it with vLLM behind a FastAPI gateway, with autoscaling and continuous eval.

Must-have: - Dataset prep: HuggingFace datasets, schema validation with Pydantic, dedup, deterministic train/val/test split with hash-based assignment. - LoRA fine-tune (peft + trl) on a 7B–8B base. Document VRAM math. - Offline eval: at minimum, a held-out set with task-appropriate metrics; ideally lm-eval-harness on relevant subsets. - Serve: vLLM behind FastAPI gateway. Streaming, batching, structured output. - Routing: cheap queries → small model; complex → large; A/B harness. - Continuous eval: daily replay of N production samples (PII-scrubbed) against the new checkpoint; block promotion on regression. - Rollout: shadow → canary 1% → 10% → 50% → 100% via feature flag.

Stretch: - DPO / KTO post-training on preference data. - Quantization (GPTQ/AWQ) and a serving comparison. - Multi-GPU serving with tensor parallelism.


Defense (Week 24)

Each track defends the same five reviews:

  1. Architecture review - whiteboard, defend each component.
  2. Performance review - flame graphs, throughput, p50/p95/p99.
  3. Eval review - harness, regressions caught, rollout policy.
  4. Cost review - $/request, $/user, projected $/month at 10x.
  5. Failure-mode review - provider outage, vector DB down, OOM, runaway agent, prompt injection, tokenizer mismatch.

The bar: every question gets a substantive answer without hand-waving. That is the senior signal.

Comments