Saltar a contenido

Prelude-The Shape of the Discipline

Sit with this document for an evening before week 1. It is the only place in the curriculum where we step back from mechanics and define what "AI systems engineering" actually is.


1. Two Disciplines, One Field

Modern AI is built by two cooperating disciplines that often share a name:

ML Research AI Systems Engineering
Question What should we compute? How do we compute it efficiently?
Output New architectures, training recipes, benchmark wins. Faster kernels, larger models possible, lower-cost inference, higher reliability.
Optimizes Loss curves. Tokens/sec/dollar.
Reads NeurIPS, ICML, ICLR. OSDI, SOSP, MLSys, ASPLOS, ISCA.
Writes in Python + PyTorch high-level. CUDA, Triton, C++, Rust, the framework's internals.

This curriculum trains the second discipline. You finish able to take any model the research team hands you and: make it train on your hardware, make it run in production, make it cheap, make it observable.

The economic pressure in 2026 is overwhelmingly on the second discipline. Every frontier-lab paper costs millions in compute; every percent of efficiency saves real money; every inference architecture decision compounds across millions of users. The half-life of a research idea is a year; the half-life of the systems infrastructure that serves it is a decade.


2. The Five-Axis Cost Model

A working AI systems engineer reasons along five axes simultaneously:

Axis Question to ask
FLOPs How many floating-point ops does this op cost? Is it compute-bound?
Bytes How much data moves between memory tiers (HBM ↔ SRAM, host ↔ device, node ↔ node)? Is it memory-bound?
Arithmetic intensity FLOPs / Bytes. The ratio that determines whether tensor cores or HBM bandwidth is the limit.
Parallelism Across what axis are we splitting work-batch, sequence, head, layer? What synchronization cost?
Failure What happens on OOM, NaN, NCCL timeout, preemption, datacenter outage?

Beginner ML courses teach axis 1 only ("this model is N billion parameters"). The cost-model in production is dominated by axes 2 and 3-which is exactly why FlashAttention exists, why paged attention exists, why mixed precision exists.

The single most important number in modern AI systems engineering is the arithmetic intensity at which a hardware platform crosses from memory-bound to compute-bound. For an H100, that crossover is roughly 295 FLOP/byte (BF16). Below it, you're starving the tensor cores; above it, HBM doesn't matter. Memorize this. Most performance work is moving operations across that crossover.


3. The Roofline Model

Every performance discussion in this curriculum will use the roofline model (Williams, Waterman, Patterson, 2009-the most useful single paper in computer architecture):

Performance (FLOP/s) = min(
    Peak compute,
    Peak bandwidth × Arithmetic intensity
)

Plot this on log-log axes: a horizontal line at peak compute (the "roof") and a slanted line at peak bandwidth (the "ramp"). Every kernel is a point. If you're under the ramp, you're bandwidth-limited-buy bandwidth (or recompute to reduce bytes moved). If you're under the roof but past the ramp's knee, you're compute-limited-get faster math (tensor cores, lower precision).

By week 5, you will sketch this from memory.


4. The Reading List

Foundational papers (read in order, ideally during weeks 1–4):

  1. Williams et al., Roofline: An Insightful Visual Performance Model (2009). The lens.
  2. Vaswani et al., Attention Is All You Need (2017). The architecture that defines this era.
  3. Dao et al., FlashAttention (2022) and FlashAttention-2 (2023). The most important systems paper of the LLM era.
  4. Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). The vLLM paper.
  5. Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 2020).
  6. Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (2019).
  7. Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022). Continuous batching's origin.
  8. Leviathan, Kalman, Matias, Fast Inference from Transformers via Speculative Decoding (ICML 2023).
  9. Frantar, Ashkboos, Hoefler, Alistarh, GPTQ: Accurate Post-Training Quantization (ICLR 2023). And Lin et al., AWQ (MLSys 2024).

Books: - Programming Massively Parallel Processors (Hwu, Kirk, Hajj, 4e). The CUDA textbook. - Computer Architecture: A Quantitative Approach (Hennessy & Patterson, 6e). Chapters 4–5 on data-level parallelism. Required if you want to be more than a recipe-follower. - Deep Learning (Goodfellow, Bengio, Courville). Chapters 6–8 for the math vocabulary. - Designing Machine Learning Systems (Chip Huyen). For the production framing.

Source repositories to bookmark: - pytorch/pytorch - the framework. Particularlyaten/,torch/csrc/,torch/_inductor/. -openai/triton - the GPU DSL. - NVIDIA/cutlass - high-performance GEMM templates. -vllm-project/vllm - the canonical inference server. - Dao-AILab/flash-attention - the kernel. -pytorch/FBGEMM,NVIDIA/TransformerEngine - quantization and FP8. - google/jax, openxla/xla - the JAX/XLA stack. -microsoft/DeepSpeed,NVIDIA/Megatron-LM - training stacks.

Adjacent canon (you must know): - The roofline paper, mentioned above. - The Linux + Container + Kubernetes curricula for the substrate this all runs on.


5. What's Durable, What's Ephemeral

A 2030 reader will need most of the conceptual content. They will need to refresh much of the API content. The curriculum flags this on a per-week basis. The general pattern:

Durable (10+ year half-life) Ephemeral (2–4 year half-life)
Roofline model Specific GPU's peak FLOPS
Memory hierarchy on GPUs Ada/Hopper/Blackwell-specific instructions
Attention math The nth FlashAttention variant
Parallelism patterns (DP/TP/PP) Specific FSDP / DeepSpeed APIs
Continuous batching theory vLLM's specific scheduler
Quantization theory (INT8, FP8 representability) AWQ vs GPTQ vs SmoothQuant winner
Roofline-aware kernel design Triton vs CUTLASS vs Mojo trajectories
The dispatcher pattern PyTorch 2.x's exact dispatcher API

Bias your study toward the left column. The right column is where you'll cite "as of 2026" timestamps in your code comments.


6. Curriculum Philosophy

  1. Paper first, framework docs second. When the curriculum says "study paged attention," it means open the SOSP 2023 paper. Then read the vLLM source. The framework docs are a tertiary source.
  2. Profile before you optimize. Every performance lab requires a nsys or ncu capture before any change. The most common beginner failure mode is "optimizing" code that wasn't actually slow.
  3. One artifact per phase. End of each month produces a benchmarked, profile-attached, reviewable artifact. The capstone is not a surprise; it's the natural assembly of the monthly artifacts.

7. What AI Systems Are Not For

A graduate of this curriculum should be able to argue these points without sounding evangelical:

  • Small datasets, simple problems. Linear/tree models on tabular data still win much of the time. Don't reach for a transformer because it's the new tool.
  • Hard real-time inference. Modern LLMs have unpredictable latency tails. For sub-millisecond hard deadlines, use a small distilled model (or no LLM).
  • Privacy-critical workloads on shared GPUs. Multi-tenancy on a single GPU has unsolved isolation problems (timing channels, memory residue). Use dedicated hardware or a confidential-computing GPU.
  • Greenfield projects with no production traffic. If you're not yet load-bound, the ML systems infrastructure is overkill. Use a managed inference API (Anthropic, OpenAI, Bedrock, Vertex) until you can justify the lift.

The signal that AI systems engineering is the right tool: you have a cost, latency, sovereignty, or scale constraint that ranks above iteration speed.


8. AI-Assisted Workflows (in an AI Systems curriculum)

The recursive irony is unavoidable: you will use LLMs to learn how to build systems for LLMs. Three rules:

  1. Never accept generated CUDA without ncu profiling. Models hallucinate index math. The kernel may compile and produce mostly-correct outputs while having a 10× perf cliff or a subtle race.
  2. Never accept generated NCCL / distributed code without - race - equivalent. PyTorch's DDP has its own timing assumptions; NCCL has its own deadlock modes. AI-generated all-reduce patterns are the single most common source of "works on 2 GPUs, hangs on 8."
  3. Always read the underlying paper. When the model summarizes paged attention or speculative decoding, the summary is plausibly wrong in ways that matter for implementation. The paper is short. Read it.

You are now ready for Week 1. Open 01_MONTH_FOUNDATIONS.md.

Comments