Skip to content

Capstone Projects-Three Tracks, One Choice

Pick one. The work performed here is what you describe in interviews and link from a portfolio.


Track 1-Inference Engine: A Mini-vLLM

Outcome: an LLM inference server you wrote, with paged KV-cache, continuous batching, and at least one of (FP8 weights, AWQ INT4, speculative decoding). Benchmarked within 2× of production vLLM on a 7B model.

Functional spec

  • HTTP API: POST /v1/completions and POST /v1/chat/completions (subset of OpenAI's API).
  • Server-sent-events streaming output.
  • Continuous batching with paged KV-cache.
  • One quantization scheme (your choice: AWQ W4A16 with Marlin kernel, or FP8 weights via TransformerEngine).
  • Optionally: prefix caching, speculative decoding.
  • Health, readiness, metrics endpoints.

Non-functional spec

  • Throughput within 2× of production vLLM for a 7B model on the same hardware. (vLLM is the bar; matching it is implausible in 24 weeks. Within 2× is achievable and impressive.)
  • TTFT p99 < 1s for 1K-token prompts under steady-state load.
  • TPOT p99 < 30 ms after first token.
  • Memory: stable under 8-hour load; no leaks.

Architecture sketch

HTTP server (FastAPI/Axum)
Request queue ──► Scheduler (Python or Rust)
                  Block manager (page table, free list)
              Model runner (Python/Triton/CUDA)
                  Token streamer

Test rigor

  • Unit tests for the block manager (allocate/free/leak detection).
  • Integration: warmup load, sustained load, mixed-prompt-length load.
  • Correctness: outputs match a reference HF implementation for greedy decoding.
  • Stress: kill-9 the process under load; restart; verify recovery.

Hardening pass

  • pprof - style metrics;nsys` profile of one full request lifecycle committed to the repo.
  • ncu profile of the attention kernel.
  • Cost/quality matrix.

Acceptance criteria

  • Public repo with build + run + benchmark scripts.
  • A README with: architecture diagram, benchmark table, profiling artifacts, "what's next" section.
  • A blog post explaining one non-obvious decision (e.g., your block size choice, your eviction policy).

Skills exercised

  • All months. Heaviest on Months 2 (kernels), 3 (framework integration), 5 (serving).

Track 2-Training Systems: FSDP From Scratch

Outcome: a working sharded data-parallel training implementation, written from scratch on top of torch.distributed primitives. Trains a small transformer on 4–8 GPUs across 1–2 nodes with documented scaling efficiency.

Functional spec

  • A MyFSDP wrapper that:
  • Shards parameters across ranks.
  • Allgathers parameters before forward; frees after.
  • Reduces gradients via reduce-scatter.
  • Supports activation checkpointing.
  • Mixed-precision (BF16 compute, FP32 master).
  • Gradient accumulation.
  • Resumable checkpoints.
  • A reference training script that uses it to train a ~500M parameter transformer on a tokenized corpus.

Non-functional spec

  • Scaling efficiency ≥85% on 8 GPUs (single node) vs single-GPU baseline.
  • Scaling efficiency ≥75% on 16 GPUs across 2 nodes.
  • Resumed run produces identical loss (within 1e-4) compared to a continuous run.
  • Throughput within 30% of PyTorch's native FSDP-2 on the same workload.

Test rigor

  • Numerical correctness: 4-rank MyFSDP matches single-rank reference for one full training step (allclose at 1e-3 in BF16).
  • Memory measurement: peak HBM matches model_size / num_ranks + activation_overhead.
  • Failure injection: kill one rank mid-epoch; observe NCCL timeout; document the recovery path.

Hardening pass

  • NCCL tuning (NCCL_IB_HCA, NCCL_SOCKET_IFNAME, NCCL_TOPO_FILE) documented.
  • nsys profile showing allgather/reduce-scatter overlap with compute.
  • Cost calculation (GPU-hours × $/hr × scaling-efficiency).

Acceptance criteria

  • Public repo with infra-as-code (Terraform/Ansible) for bringing up a 2-node cluster, plus the FSDP code, plus the training script.
  • A SCALING_REPORT.md with the efficiency numbers, the optimization journey (each tuning step's effect), and one comparison against native FSDP-2.

Skills exercised

  • All months. Heaviest on Months 3 (framework), 4 (distributed).

Track 3-GPU Kernel Track: A Competitive Fused Attention

Outcome: a fused attention kernel in Triton (and optionally CUTLASS), competitive with FlashAttention-2 for at least one common shape regime, complete with profiling, autograd, and a tested PyTorch integration.

Functional spec

  • A Triton kernel implementing causal flash-attention (forward + backward).
  • Configurable for: BF16 / FP16, head dim 64/128, GQA support.
  • Drop-in replacement for F.scaled_dot_product_attention for the supported shape range.
  • Numerically equivalent to the reference (allclose at 1e-3 BF16).

Non-functional spec

  • Within 1.5× of FlashAttention-2 forward+backward time at one chosen shape (e.g., B=4, H=32, S=4096, D=128).
  • Validated on at least two GPU classes (e.g., A100 + H100 if both accessible; A100 + RTX 4090 acceptable).
  • Compiles via torch.compile without graph breaks when used in a small transformer.

Test rigor

  • Correctness: random input testing against reference attention; gradient testing against torch.autograd.gradcheck (FP32 reference).
  • Perf: ncu reports for forward and backward; attached to the repo.
  • Edge cases: short sequences, variable-length (padding-aware), large batch.

Hardening pass

  • Autotune configs documented with rationale.
  • Performance-regression CI (lock benchmark numbers; alert on >5% regression).

Acceptance criteria

  • Public repo with the kernel, tests, benchmarks, and a clear README.
  • One submitted PR (even if not merged) to a real project: vLLM (as a backend), Liger Kernel, or pytorch/ao.
  • A blog post analyzing one design choice in the kernel-block sizes, software pipelining stages, register-pressure tradeoffs.

Skills exercised

  • All months, but most concentrated on Months 2 (GPU programming), 3 (framework integration), and the inference math from Month 5.

Cross-Track Requirements

Regardless of track:

  • ai-systems-baseline/ template (Appendix A) integrated.
  • ADRs: ≥3 for non-obvious decisions.
  • Threat model: at minimum, one page covering input attacks, supply-chain risk, and infrastructure failure modes.
  • Cost model: per the workload, what's the steady-state $/hour and $/output-unit?
  • Defense readiness: a 60-minute walkthrough you can deliver to a peer or hiring manager.

The track choice signals career direction: - Track 1 (Inference) → inference-engineer roles at frontier labs, latency-sensitive serving teams, model-serving startups. - Track 2 (Training) → training-infra roles at frontier labs, large-scale-training teams, framework engineering teams. - Track 3 (Kernels) → GPU performance engineering, compiler/runtime teams (NVIDIA, OpenAI, Meta), specialized inference accelerator teams.

Pick based on where you want the next interview loop, not on what looks easiest.

Comments