Capstone Projects-Three Tracks, One Choice¶

Pick one. The work performed here is what you describe in interviews and link from a portfolio.

Track 1-Inference Engine: A Mini-vLLM¶

Outcome: an LLM inference server you wrote, with paged KV-cache, continuous batching, and at least one of (FP8 weights, AWQ INT4, speculative decoding). Benchmarked within 2× of production vLLM on a 7B model.

Functional spec¶

HTTP API: POST /v1/completions and POST /v1/chat/completions (subset of OpenAI's API).
Server-sent-events streaming output.
Continuous batching with paged KV-cache.
One quantization scheme (your choice: AWQ W4A16 with Marlin kernel, or FP8 weights via TransformerEngine).
Optionally: prefix caching, speculative decoding.
Health, readiness, metrics endpoints.

Non-functional spec¶

Throughput within 2× of production vLLM for a 7B model on the same hardware. (vLLM is the bar; matching it is implausible in 24 weeks. Within 2× is achievable and impressive.)
TTFT p99 < 1s for 1K-token prompts under steady-state load.
TPOT p99 < 30 ms after first token.
Memory: stable under 8-hour load; no leaks.

Architecture sketch¶

HTTP server (FastAPI/Axum)
        │
        ▼
Request queue ──► Scheduler (Python or Rust)
                       │
                       ▼
                  Block manager (page table, free list)
                       │
                       ▼
              Model runner (Python/Triton/CUDA)
                       │
                       ▼
                  Token streamer

Test rigor¶

Unit tests for the block manager (allocate/free/leak detection).
Integration: warmup load, sustained load, mixed-prompt-length load.
Correctness: outputs match a reference HF implementation for greedy decoding.
Stress: kill-9 the process under load; restart; verify recovery.

Hardening pass¶

pprof - style metrics;nsys` profile of one full request lifecycle committed to the repo.
ncu profile of the attention kernel.
Cost/quality matrix.

Acceptance criteria¶

Public repo with build + run + benchmark scripts.
A README with: architecture diagram, benchmark table, profiling artifacts, "what's next" section.
A blog post explaining one non-obvious decision (e.g., your block size choice, your eviction policy).

Skills exercised¶

All months. Heaviest on Months 2 (kernels), 3 (framework integration), 5 (serving).

Track 2-Training Systems: FSDP From Scratch¶

Outcome: a working sharded data-parallel training implementation, written from scratch on top of torch.distributed primitives. Trains a small transformer on 4–8 GPUs across 1–2 nodes with documented scaling efficiency.

Functional spec¶

A MyFSDP wrapper that:
Shards parameters across ranks.
Allgathers parameters before forward; frees after.
Reduces gradients via reduce-scatter.
Supports activation checkpointing.
Mixed-precision (BF16 compute, FP32 master).
Gradient accumulation.
Resumable checkpoints.
A reference training script that uses it to train a ~500M parameter transformer on a tokenized corpus.

Non-functional spec¶

Scaling efficiency ≥85% on 8 GPUs (single node) vs single-GPU baseline.
Scaling efficiency ≥75% on 16 GPUs across 2 nodes.
Resumed run produces identical loss (within 1e-4) compared to a continuous run.
Throughput within 30% of PyTorch's native FSDP-2 on the same workload.

Test rigor¶

Numerical correctness: 4-rank MyFSDP matches single-rank reference for one full training step (allclose at 1e-3 in BF16).
Memory measurement: peak HBM matches model_size / num_ranks + activation_overhead.
Failure injection: kill one rank mid-epoch; observe NCCL timeout; document the recovery path.

Hardening pass¶

NCCL tuning (NCCL_IB_HCA, NCCL_SOCKET_IFNAME, NCCL_TOPO_FILE) documented.
nsys profile showing allgather/reduce-scatter overlap with compute.
Cost calculation (GPU-hours × $/hr × scaling-efficiency).

Acceptance criteria¶

Public repo with infra-as-code (Terraform/Ansible) for bringing up a 2-node cluster, plus the FSDP code, plus the training script.
A SCALING_REPORT.md with the efficiency numbers, the optimization journey (each tuning step's effect), and one comparison against native FSDP-2.

Skills exercised¶

All months. Heaviest on Months 3 (framework), 4 (distributed).

Track 3-GPU Kernel Track: A Competitive Fused Attention¶

Outcome: a fused attention kernel in Triton (and optionally CUTLASS), competitive with FlashAttention-2 for at least one common shape regime, complete with profiling, autograd, and a tested PyTorch integration.

Functional spec¶

A Triton kernel implementing causal flash-attention (forward + backward).
Configurable for: BF16 / FP16, head dim 64/128, GQA support.
Drop-in replacement for F.scaled_dot_product_attention for the supported shape range.
Numerically equivalent to the reference (allclose at 1e-3 BF16).

Non-functional spec¶

Within 1.5× of FlashAttention-2 forward+backward time at one chosen shape (e.g., B=4, H=32, S=4096, D=128).
Validated on at least two GPU classes (e.g., A100 + H100 if both accessible; A100 + RTX 4090 acceptable).
Compiles via torch.compile without graph breaks when used in a small transformer.

Test rigor¶

Correctness: random input testing against reference attention; gradient testing against torch.autograd.gradcheck (FP32 reference).
Perf: ncu reports for forward and backward; attached to the repo.
Edge cases: short sequences, variable-length (padding-aware), large batch.

Hardening pass¶

Autotune configs documented with rationale.
Performance-regression CI (lock benchmark numbers; alert on >5% regression).

Acceptance criteria¶

Public repo with the kernel, tests, benchmarks, and a clear README.
One submitted PR (even if not merged) to a real project: vLLM (as a backend), Liger Kernel, or pytorch/ao.
A blog post analyzing one design choice in the kernel-block sizes, software pipelining stages, register-pressure tradeoffs.

Skills exercised¶

All months, but most concentrated on Months 2 (GPU programming), 3 (framework integration), and the inference math from Month 5.

Cross-Track Requirements¶

Regardless of track:

ai-systems-baseline/ template (Appendix A) integrated.
ADRs: ≥3 for non-obvious decisions.
Threat model: at minimum, one page covering input attacks, supply-chain risk, and infrastructure failure modes.
Cost model: per the workload, what's the steady-state $/hour and $/output-unit?
Defense readiness: a 60-minute walkthrough you can deliver to a peer or hiring manager.

The track choice signals career direction: - Track 1 (Inference) → inference-engineer roles at frontier labs, latency-sensitive serving teams, model-serving startups. - Track 2 (Training) → training-infra roles at frontier labs, large-scale-training teams, framework engineering teams. - Track 3 (Kernels) → GPU performance engineering, compiler/runtime teams (NVIDIA, OpenAI, Meta), specialized inference accelerator teams.

Pick based on where you want the next interview loop, not on what looks easiest.

Capstone Projects-Three Tracks, One Choice¶

Track 1-Inference Engine: A Mini-vLLM¶

Functional spec¶

Non-functional spec¶

Architecture sketch¶

Test rigor¶

Hardening pass¶

Acceptance criteria¶

Skills exercised¶

Track 2-Training Systems: FSDP From Scratch¶

Functional spec¶

Non-functional spec¶

Test rigor¶

Hardening pass¶

Acceptance criteria¶

Skills exercised¶

Track 3-GPU Kernel Track: A Competitive Fused Attention¶

Functional spec¶

Non-functional spec¶

Test rigor¶

Hardening pass¶

Acceptance criteria¶

Skills exercised¶

Cross-Track Requirements¶

Comments