Capstone Projects-Three Tracks, One Choice¶
Pick one. The work performed here is what you describe in interviews and link from a portfolio.
Track 1-Inference Engine: A Mini-vLLM¶
Outcome: an LLM inference server you wrote, with paged KV-cache, continuous batching, and at least one of (FP8 weights, AWQ INT4, speculative decoding). Benchmarked within 2× of production vLLM on a 7B model.
Functional spec¶
- HTTP API:
POST /v1/completionsandPOST /v1/chat/completions(subset of OpenAI's API). - Server-sent-events streaming output.
- Continuous batching with paged KV-cache.
- One quantization scheme (your choice: AWQ W4A16 with Marlin kernel, or FP8 weights via TransformerEngine).
- Optionally: prefix caching, speculative decoding.
- Health, readiness, metrics endpoints.
Non-functional spec¶
- Throughput within 2× of production vLLM for a 7B model on the same hardware. (vLLM is the bar; matching it is implausible in 24 weeks. Within 2× is achievable and impressive.)
- TTFT p99 < 1s for 1K-token prompts under steady-state load.
- TPOT p99 < 30 ms after first token.
- Memory: stable under 8-hour load; no leaks.
Architecture sketch¶
HTTP server (FastAPI/Axum)
│
▼
Request queue ──► Scheduler (Python or Rust)
│
▼
Block manager (page table, free list)
│
▼
Model runner (Python/Triton/CUDA)
│
▼
Token streamer
Test rigor¶
- Unit tests for the block manager (allocate/free/leak detection).
- Integration: warmup load, sustained load, mixed-prompt-length load.
- Correctness: outputs match a reference HF implementation for greedy decoding.
- Stress: kill-9 the process under load; restart; verify recovery.
Hardening pass¶
pprof - style metrics;nsys` profile of one full request lifecycle committed to the repo.ncuprofile of the attention kernel.- Cost/quality matrix.
Acceptance criteria¶
- Public repo with build + run + benchmark scripts.
- A README with: architecture diagram, benchmark table, profiling artifacts, "what's next" section.
- A blog post explaining one non-obvious decision (e.g., your block size choice, your eviction policy).
Skills exercised¶
- All months. Heaviest on Months 2 (kernels), 3 (framework integration), 5 (serving).
Track 2-Training Systems: FSDP From Scratch¶
Outcome: a working sharded data-parallel training implementation, written from scratch on top of torch.distributed primitives. Trains a small transformer on 4–8 GPUs across 1–2 nodes with documented scaling efficiency.
Functional spec¶
- A
MyFSDPwrapper that: - Shards parameters across ranks.
- Allgathers parameters before forward; frees after.
- Reduces gradients via reduce-scatter.
- Supports activation checkpointing.
- Mixed-precision (BF16 compute, FP32 master).
- Gradient accumulation.
- Resumable checkpoints.
- A reference training script that uses it to train a ~500M parameter transformer on a tokenized corpus.
Non-functional spec¶
- Scaling efficiency ≥85% on 8 GPUs (single node) vs single-GPU baseline.
- Scaling efficiency ≥75% on 16 GPUs across 2 nodes.
- Resumed run produces identical loss (within 1e-4) compared to a continuous run.
- Throughput within 30% of PyTorch's native FSDP-2 on the same workload.
Test rigor¶
- Numerical correctness: 4-rank
MyFSDPmatches single-rank reference for one full training step (allclose at 1e-3 in BF16). - Memory measurement: peak HBM matches
model_size / num_ranks + activation_overhead. - Failure injection: kill one rank mid-epoch; observe NCCL timeout; document the recovery path.
Hardening pass¶
- NCCL tuning (
NCCL_IB_HCA,NCCL_SOCKET_IFNAME,NCCL_TOPO_FILE) documented. nsysprofile showing allgather/reduce-scatter overlap with compute.- Cost calculation (GPU-hours × $/hr × scaling-efficiency).
Acceptance criteria¶
- Public repo with infra-as-code (Terraform/Ansible) for bringing up a 2-node cluster, plus the FSDP code, plus the training script.
- A
SCALING_REPORT.mdwith the efficiency numbers, the optimization journey (each tuning step's effect), and one comparison against native FSDP-2.
Skills exercised¶
- All months. Heaviest on Months 3 (framework), 4 (distributed).
Track 3-GPU Kernel Track: A Competitive Fused Attention¶
Outcome: a fused attention kernel in Triton (and optionally CUTLASS), competitive with FlashAttention-2 for at least one common shape regime, complete with profiling, autograd, and a tested PyTorch integration.
Functional spec¶
- A Triton kernel implementing causal flash-attention (forward + backward).
- Configurable for: BF16 / FP16, head dim 64/128, GQA support.
- Drop-in replacement for
F.scaled_dot_product_attentionfor the supported shape range. - Numerically equivalent to the reference (allclose at 1e-3 BF16).
Non-functional spec¶
- Within 1.5× of FlashAttention-2 forward+backward time at one chosen shape (e.g.,
B=4, H=32, S=4096, D=128). - Validated on at least two GPU classes (e.g., A100 + H100 if both accessible; A100 + RTX 4090 acceptable).
- Compiles via
torch.compilewithout graph breaks when used in a small transformer.
Test rigor¶
- Correctness: random input testing against reference attention; gradient testing against
torch.autograd.gradcheck(FP32 reference). - Perf:
ncureports for forward and backward; attached to the repo. - Edge cases: short sequences, variable-length (padding-aware), large batch.
Hardening pass¶
- Autotune configs documented with rationale.
- Performance-regression CI (lock benchmark numbers; alert on >5% regression).
Acceptance criteria¶
- Public repo with the kernel, tests, benchmarks, and a clear README.
- One submitted PR (even if not merged) to a real project: vLLM (as a backend), Liger Kernel, or pytorch/ao.
- A blog post analyzing one design choice in the kernel-block sizes, software pipelining stages, register-pressure tradeoffs.
Skills exercised¶
- All months, but most concentrated on Months 2 (GPU programming), 3 (framework integration), and the inference math from Month 5.
Cross-Track Requirements¶
Regardless of track:
ai-systems-baseline/template (Appendix A) integrated.- ADRs: ≥3 for non-obvious decisions.
- Threat model: at minimum, one page covering input attacks, supply-chain risk, and infrastructure failure modes.
- Cost model: per the workload, what's the steady-state $/hour and $/output-unit?
- Defense readiness: a 60-minute walkthrough you can deliver to a peer or hiring manager.
The track choice signals career direction: - Track 1 (Inference) → inference-engineer roles at frontier labs, latency-sensitive serving teams, model-serving startups. - Track 2 (Training) → training-infra roles at frontier labs, large-scale-training teams, framework engineering teams. - Track 3 (Kernels) → GPU performance engineering, compiler/runtime teams (NVIDIA, OpenAI, Meta), specialized inference accelerator teams.
Pick based on where you want the next interview loop, not on what looks easiest.