AI Systems Engineering-A 24-Week Beginner-to-Advanced Mastery Roadmap¶

Authoring lens: Principal AI Systems Engineer / ML Performance Architect. Target outcome: A graduate of this curriculum is capable of (a) writing competitive GPU kernels in CUDA or Triton, (b) implementing distributed training (FSDP, tensor parallelism) on multi-node clusters, (c) building production-grade inference servers with paged KV-cache and continuous batching, and (d) operating GPU fleets at scale with cost, observability, and safety controls in place.

Crucially: this curriculum sits underneath ML research. It is not "learn ML in 24 weeks." It teaches the systems infrastructure that makes modern ML possible: the GPUs, kernels, frameworks, schedulers, and serving stacks. A graduate cannot necessarily train a frontier model from scratch, but they can make one run 3× faster and serve a million users without melting.

Why This Curriculum Exists¶

The companion curricula in this repository (RUST_TUTORIAL_PLAN, GO_LEARNIN_PLAN, LINUX, CONTAINER_INTERNALS_PLAN, KUBERNETES_PLAN) build the systems-engineering foundation. They are necessary but not sufficient for an AI-engineer career.

The gap they leave: the AI-systems-specific layer. GPU programming, accelerator runtimes, distributed training internals, inference serving, ML-on-Kubernetes patterns. This curriculum closes that gap, and is intentionally readable by a working backend/SRE engineer who has never written a CUDA kernel.

It is also designed to age gracefully. Specific tools (vLLM v0.x, PyTorch 2.x APIs) will shift in 2–4 years. The concepts-memory hierarchy on GPUs, attention math, parallelism patterns, scheduling theory for inference-are durable. Each module marks which is which, so refreshes target the ephemeral and not the spine.

Repository Layout¶

File	Purpose
`00_PRELUDE_AND_PHILOSOPHY.md`	The shape of "AI systems" as a discipline; cost model; reading list; what's durable vs ephemeral.
`01_MONTH_FOUNDATIONS.md`	Weeks 1–4. Compute hierarchy, linear algebra, tensors, autograd, training loops. Beginner ramp.
`02_MONTH_GPU_PROGRAMMING.md`	Weeks 5–8. GPU architecture, CUDA, memory optimization, Triton.
`03_MONTH_FRAMEWORK_INTERNALS.md`	Weeks 9–12. PyTorch dispatcher, torch.compile, JAX/XLA, custom ops.
`04_MONTH_DISTRIBUTED_TRAINING.md`	Weeks 13–16. NCCL, DDP/FSDP, tensor + pipeline parallelism, FP8.
`05_MONTH_INFERENCE_SYSTEMS.md`	Weeks 17–20. KV-cache, paged attention, continuous batching, quantization, speculative decoding.
`06_MONTH_INFRASTRUCTURE_CAPSTONE.md`	Weeks 21–24. ML-on-K8s, observability, safety/eval infra, capstone defense.
`APPENDIX_A_HARDENING_AND_OBSERVABILITY.md`	GPU profiling, cost dashboards, fleet ops, model monitoring.
`APPENDIX_B_BUILD_FROM_SCRATCH.md`	Reference implementations: attention, layer-norm, optimizer, dataloader, paged-cache.
`APPENDIX_C_CONTRIBUTING.md`	Contribution paths to PyTorch, JAX, Triton, vLLM, Hugging Face.
`CAPSTONE_PROJECTS.md`	Three tracks: mini-vLLM, FSDP-from-scratch, fused attention kernel.
`DEEP_DIVES/`	Eleven self-contained reference chapters (~96K words total). Authored to let the reader master each topic without the underlying papers. See `DEEP_DIVES/README.md` for the index.

The DEEP_DIVES/ Companion¶

The monthly modules are survey + lab. The deep dives are reference text. Eleven chapters totaling ~96,000 words, each authored to be a self-contained mastery resource for one major topic-derive everything, no external paper required:

`01_GPU_ARCHITECTURE.md - pair with Month 2 §5.
`02_CUDA_PROGRAMMING.md - pair with Month 2 §6–7.
`03_TRITON.md - pair with Month 2 §8.
`04_PYTORCH_INTERNALS.md - pair with Month 3 §9–10.
`05_JAX_XLA.md - pair with Month 3 §11.
`06_DISTRIBUTED_TRAINING.md - pair with Month 4 (all weeks).
`07_ATTENTION_TRANSFORMER.md - pair with Month 5 §17.
`08_INFERENCE_SERVING.md - pair with Month 5 §18.
`09_QUANTIZATION.md - pair with Month 5 §19.
`10_SPECULATIVE_DISAGGREGATION.md - pair with Month 5 §20.
`11_NUMERICS_AND_MIXED_PRECISION.md - orthogonal; reference everywhere, anchor in Month 4 §16.

Each chapter is layered: intuition → mechanism → math → numbers → diagrams → code → pitfalls → six worked exercises.

How Each Week Is Structured¶

Conceptual Core-the why, with a mental model.
Mechanical Detail-the how, down to source files in PyTorch/JAX/vLLM/Triton/CUTLASS where relevant.
Lab-a hands-on exercise that cannot be completed without internalizing the concept.
Idiomatic & Diagnostic Drill-nsys, ncu, torch.profiler, dcgm, plus one shape-of-good-code review.
Production Slice-a cost, observability, or reliability micro-task that compounds into a publishable template.

Each week is sized for ~12–16 focused hours.

Progression Strategy¶

Foundations (beginner) ──► GPU Programming ──► Framework Internals
        │                        │                       │
        └────────────┬───────────┴───────────────────────┘
                     ▼
            Distributed Training
                     │
                     ▼
             Inference Systems
                     │
                     ▼
       Infrastructure & Capstone

The first month is the beginner ramp. From week 5 onward the difficulty climbs steeply. By week 12 you should be reading framework source comfortably; by week 16 distributed-training papers; by week 20 OSDI/SOSP-tier inference papers.

Prerequisites¶

Hard prerequisites (without these, the curriculum will not stick): - Python fluency: classes, decorators, context managers, async basics. - Linux fluency: command line, basic systemd, nvidia-smi, cgroup awareness (Linux curriculum weeks 1–10 minimum). - C familiarity: pointers, memory layout, make/cmake. You don't need to be an expert; you need to read it. - Linear algebra: matrix multiplication, dot product, gradient. The first week refreshes these to the level the curriculum needs.

Soft prerequisites (helpful, not required): - Container fluency (CONTAINER_INTERNALS_PLAN weeks 1–3). - Kubernetes basics (KUBERNETES_PLAN Month 1). - Working knowledge of one ML framework (PyTorch preferred). If you have none, plan to spend 2–3 extra weeks before week 1 doing fast.ai's Practical Deep Learning part 1.

Hardware: - Weeks 1–4: any laptop. - Weeks 5–8: access to at least one NVIDIA GPU (RTX 30/40 series fine; cloud T4/L4/A10G fine). RunPod, Lambda Labs, Vast.ai are budget-friendly. - Weeks 13–16: access to at least 2 GPUs on one node, and ideally 4–8 across two nodes for one week of distributed-training labs. Cloud-rented; ~$200–500 budget. - Weeks 17–20: a single A100 or H100 (cloud-rented, ~$2/hour) for two of the four labs. Smaller GPUs work for the others. - Weeks 21–24: depends on capstone choice.

If hardware budget is tight: do everything you can on Colab + a single rented L4 ($0.50–0.80/hour). The curriculum's designs are sized to fit.

Capstone Tracks (pick one in Month 6)¶

Inference Engine Track-Build a mini-vLLM in Python+CUDA: paged KV-cache, continuous batching, FP8 weight quantization. Benchmark within 2× of production vLLM on a 7B model.
Training Systems Track-Implement FSDP-style sharded data parallelism from scratch using NCCL collectives. Train a small transformer on 4–8 GPUs across 2 nodes; demonstrate scaling efficiency.
GPU Kernel Track-Author a fused attention kernel in Triton (and optionally CUTLASS) competitive with FlashAttention-2 for one shape regime. Profile, document, contribute upstream.

Details in CAPSTONE_PROJECTS.md.

What This Curriculum Does Not Cover¶

To set expectations honestly:

Foundational ML theory-backprop derivation, optimization theory, generalization bounds. Use a separate course (Bishop, Goodfellow, fast.ai).
Model architecture research-designing new attention variants, scaling laws investigation. This is the research scientist track; this curriculum is the systems engineer track.
Computer vision / RL / classical ML pipelines-the focus is on the transformer-LLM stack because that's where the systems-engineering pressure is in 2026 and likely 2030. CV and RL infra share many primitives; the deltas are well-trodden.
Prompt engineering, agents, RAG architecture-application-layer concerns. Important, well-covered elsewhere.
AI ethics, governance, policy-flagged in week 23 but not deep-dived.

A Note on the Field's Velocity¶

This is the fastest-moving area in software in 2026. The curriculum copes by: 1. Anchoring each week to at least one peer-reviewed paper that is unlikely to be invalidated. 2. Distinguishing algorithmic content (durable: attention math, ZeRO, paged attention) from API content (ephemeral: vLLM v0.x flags, PyTorch 2.x compile modes). 3. Pointing at source repositories that are likely to remain canonical (PyTorch core, JAX, Triton, vLLM, FlashAttention).

When the curriculum says "in 2026 this is true," it is dated for a reason. Re-evaluate yearly.

Print this path

Want to read offline or archive? Open the printable version - every section of this path concatenated into one page, styled for paper. Use your browser's Print → Save as PDF.