Deep Dives-Self-Contained Reference Chapters¶

Twelve chapters that take the AI Systems curriculum from "moderate-depth survey + external paper assignments" to self-contained mastery resources. Each chapter was authored to let a reader master the topic from the document alone, without needing the underlying papers, vendor whitepapers, or framework docs as primary sources.

Total: ~104,000 words / ~14,500 lines across 12 files. Each chapter is 7,000–11,000 words, layered (intuition → mechanism → math → numbers → diagrams → exercises), and ends with worked exercises.

Reading Order and Curriculum Mapping¶

The deep dives are designed to be read in tandem with the monthly modules. Recommended pairing:

When	Read this deep dive	After the monthly module
Week 5–6	`01_GPU_ARCHITECTURE.md`	Month 2 §5
Week 6–7	`02_CUDA_PROGRAMMING.md`	Month 2 §6–7
Week 8	`03_TRITON.md`	Month 2 §8
Week 9–10	`04_PYTORCH_INTERNALS.md`	Month 3 §9–10
Week 11	`05_JAX_XLA.md`	Month 3 §11
Week 13–16	`06_DISTRIBUTED_TRAINING.md`	Month 4 (all)
Week 17	`07_ATTENTION_TRANSFORMER.md`	Month 5 §17
Week 18	`08_INFERENCE_SERVING.md`	Month 5 §18
Week 19	`09_QUANTIZATION.md`	Month 5 §19
Week 20	`10_SPECULATIVE_DISAGGREGATION.md`	Month 5 §20
Week 16 + always	`11_NUMERICS_AND_MIXED_PRECISION.md`	Month 4 §16 (referenced everywhere)
Week 8 + always	`12_KERNEL_FUSION.md`	Month 2 §8 (referenced from Month 3 / 5 too)

You can also read the deep dives standalone as a reference text. Topical order:

Hardware foundation: 01 → 02 → 03
Framework foundation: 04 → 05
Numerical foundation: 11 (orthogonal to all others; reference often)
Training: 06 (which assumes 11)
Inference architecture: 07 → 08 → 09 → 10

Chapter Index¶

`01_GPU_ARCHITECTURE.md - NVIDIA GPU Architecture and Memory Hierarchy¶

~9,100 words. Throughput vs latency machines; SIMT/SIMD/MIMD; the streaming multiprocessor (warps, schedulers, registers, divergence, ITS); the full memory hierarchy with H100 numbers; tensor cores (WMMA, mma.sync, fragments, all precisions including FP8); 2:4 sparsity; cp.async + TMA + thread-block clusters; occupancy theory derived from first principles with three worked numerical examples; NVLink/NVSwitch/NVL72; Ada and Blackwell deltas (with explicit uncertainty); AMD MI300X contrast; 5 worked exercises.

`02_CUDA_PROGRAMMING.md - CUDA From First Kernel to Optimized GEMM¶

~8,800 words. Programming model, qualifiers, launch syntax; indexing and grid-stride loops; memory transfer (sync/async, pinned, UVM, zero-copy); streams and events with overlap pipelines; error-handling discipline; coalescing rules with worked numerical examples; shared memory bank conflicts and the [32][33] fix derived; reductions (four variants, perf evolution); six-stage tiled GEMM walkthrough with code (naive → coalesced → SMEM tiled → register tiled → tensor-core wmma → cp.async double-buffered); nvcuda::wmma and mma.sync PTX; cooperative groups + Hopper TBC; profiling discipline; complete buildable BF16 GEMM at 2048×2048; 6 exercises.

`03_TRITON.md - The Triton GPU DSL¶

~7,000 words. Why Triton (block-level programming model); @triton.jit, program instances, tl.load/store/dot; mask semantics; tl.constexpr specialization; @triton.autotune configs and caching; online softmax derivation (single-element and block forms), Welford, log-sum-exp; six full annotated kernels: vector add, naive matmul, autotuned tiled matmul with L2-friendly swizzle, fused softmax, RMSNorm forward+backward, simplified causal flash-attention; compilation pipeline (Python → MLIR → PTX); torch integration via torch.library.custom_op; nine concrete pitfalls; Triton vs CUTLASS vs hand-CUDA decision table; 6 exercises.

`04_PYTORCH_INTERNALS.md - PyTorch From Tensor to Inductor¶

~8,400 words. Layered architecture with ASCII trace of a + b; torch.Tensor → at::Tensor → c10::TensorImpl → c10::Storage; strides and views; the dispatcher (DispatchKey priority, key sets, TORCH_LIBRARY_IMPL); native_functions.yaml codegen; autograd engine (dynamic tape, Function/Node, next_edges, custom autograd.Function, version counters); requires_grad vs no_grad vs inference_mode; autocast as a dispatcher layer; torch.compile end-to-end (TorchDynamo, AOTAutograd, Inductor, guards, modes, TORCH_LOGS); modern custom op path (@torch.library.custom_op, register_fake, register_autograd) with Triton example; C++ extension skeleton; CUDA caching allocator with free-list algorithm and stream-aware reuse; profiler internals; 6 exercises.

`05_JAX_XLA.md - JAX Transformations and the XLA Compiler¶

~8,200 words. Why JAX (functional, composable, XLA-default, TPU-first); pure functions as the unit of compilation; PyTrees and tree_util; stateless PRNGs (PRNGKey, split, the three rules); tracing and jaxprs with worked annotated example; jit cache keys, recompilation costs, AOT lowering; grad/vjp/jvp/higher-order/jacrev/custom VJPs; vmap semantics; legacy pmap vs unified jit + Mesh/PartitionSpec; shard_map; structured loops (scan, while_loop); XLA HLO with op table, full pipeline (jaxpr → StableHLO → HLO → device), fusion, layout, GSPMD with worked Megatron-MLP propagation; TPU vs GPU; Equinox/Flax/Optax; pallas; 6 exercises.

`06_DISTRIBUTED_TRAINING.md - Communication, Parallelism, and Schedule Math¶

~9,400 words. Memory decomposition (16Φ accounting); collective primitive definitions; derivation of all 5 all-reduce algorithms including ring's bandwidth-optimality proof; NCCL algorithm selection; DDP with bucketing/overlap; full ZeRO-1/2/3 memory math table; FSDP (wrapping, prefetch, mixed precision, checkpointing, CPU offload, FSDP2); Megatron column-/row-parallel derivations and the column→row chain trick; attention/MLP TP layouts; pipeline schedules with ASCII diagrams (naive, GPipe, 1F1B, interleaved 1F1B, Zero Bubble) with bubble formulas; 3D parallelism decision matrix with worked examples for 8B/70B/405B; sequence parallelism; FP16 loss scaling derivation; FP8 (E4M3/E5M2 + per-tensor); profiling for overlap; fault tolerance; cluster topology; 6 exercises.

`07_ATTENTION_TRANSFORMER.md - Transformer Math and FlashAttention¶

~8,600 words. Autoregressive setup; scaled dot-product attention with full √dₖ derivation from variance argument; multi-head; causal masking; MQA/GQA with worked Llama-3 group sizes; RoPE complex-number derivation showing <q'_m, k'_n> depends only on m-n; ALiBi/sliding window/YaRN/NTK/PI; pre-norm vs post-norm; RMSNorm vs LayerNorm; SwiGLU and the 8/3 d_ff ratio; KV-cache math with Llama-3-70B worked example (~2.5 GB at 8K with GQA); O(S²) cost analysis; full FlashAttention derivation of online softmax with inductive proof of equivalence to all-at-once softmax; tiled algorithm pseudocode; FA-2 and FA-3 deltas; flash_attn_with_kvcache; 6 exercises.

`08_INFERENCE_SERVING.md - Paged Attention, Continuous Batching, vLLM¶

~8,200 words. Cost-model derivation T_step ≈ (W + b·KV·S) / B_HBM; H100 arithmetic-intensity crossover (~295 FLOP/byte); decode-batch-1 sits at ~1 FLOP/byte; PagedAttention with block pool sizing, page tables, block manager pseudocode, ASCII block-table diagram, fragmentation analysis; Orca-style continuous batching with full scheduler pseudocode; vLLM architecture and engine main loop; chunked prefill (Sarathi-Serve); eviction strategies (swap vs recompute); prefix caching via cumulative content hashing; speculative decoding preview; DistServe/Mooncake/Splitwise disaggregation with KV-transfer engineering; TTFT/TPOT/throughput/goodput SLOs; tuning levers with rules of thumb; 6 exercises.

`09_QUANTIZATION.md - Number Formats, AWQ, GPTQ, SmoothQuant, FP8¶

~10,600 words. Why quantize (arithmetic-intensity argument); number formats (FP32/FP16/BF16/FP8 E4M3/E5M2/INT8/INT4) with bias, range, precision derivations; affine quantization with full derivation of scale/zero_point; granularity (per-tensor/per-channel/per-group) and the 4.13–4.25 effective-bits computation; outlier theory with Var(e_y) = σ_w² · ‖x‖²; AWQ full derivation of W = (W·diag(s))(diag(s)⁻¹·x) identity with worked numerical example; GPTQ derived from Optimal Brain Surgeon with H = 2·X·X^T, Cholesky efficiency, lazy-batch blocks, full pseudocode; SmoothQuant with α derivation; activation quantization (static/dynamic/per-token); FP8 inference (H100 hardware, TransformerEngine, delayed scaling); Marlin kernel; mixed precision (LLM.int8()); calibration; evaluation discipline; 6 exercises.

`10_SPECULATIVE_DISAGGREGATION.md - Speculative Decoding & Disaggregated Inference¶

~8,900 words. Latency framing (TTFT vs TPOT); speculative decoding with rejection-sampling correctness proof; speedup formula S = α / (1 + K · T_draft / T_target) derived; geometric model α = (1 − β^{K+1}) / (1 − β) with worked numbers; variants (vanilla, self-speculative, Medusa, EAGLE/EAGLE-2, lookahead); tree speculation with attention-mask construction; speculation-batching tension showing speculation hurts at saturated batch; engineering (dual KV-caches, rollback, pipelining); DistServe architecture; Splitwise heterogeneous-hardware angle; Mooncake KVCache-centric design; KV-transfer sizing and overlap; full production stack composition with attribution; frontier directions explicitly marked research-stage; 6 exercises.

`11_NUMERICS_AND_MIXED_PRECISION.md - Floating Point and Training Stability¶

~8,600 words. IEEE-754 derivation including subnormals, RNE, FMA, fl(a op b) = (a op b)(1+δ) model; full bit layouts and side-by-side range/precision table for FP64/FP32/TF32/FP16/BF16/FP8 E4M3/E5M2/FP4; per-operation precision (matmul accumulator rule, reductions, softmax overflow); standard mixed-precision recipe with full pseudocode; loss scaling derivation including dynamic GradScaler algorithm; BF16 advantages; FP8 in detail with delayed scaling, amax history, full pseudocode, worked numerical example; TF32; Adam + low precision pitfall with stochastic rounding; catastrophic cancellation (naive vs pairwise vs Kahan); transformer stability tricks (stable softmax, online softmax recurrence, √dₖ derivation, logit soft-cap, z-loss); NaN handling; determinism; 6 exercises.

`12_KERNEL_FUSION.md - Kernel Fusion: Theory, Practice, and the Compilers That Do It For You¶

~8,000 words. The HBM round-trip cost model with worked Llama-3-70B numbers; fusion taxonomy (vertical/horizontal, five patterns); vertical fusion derived with RMSNorm worked example; horizontal fusion with QKV-projection and SwiGLU gate_up fusion math; GEMM epilogue fusion including the "fuse the residual into the matmul" trick; streaming-reduction fusion as the FlashAttention pattern; compiler-driven fusion in XLA, TorchInductor, Triton with production stack composition; three full Triton kernels (fused linear-GELU-residual, single-pass RMSNorm, causal masked online softmax); precision discipline under fusion (cheat-sheet table); register-pressure / SMEM / launch-amortization / tile-mismatch limits with H100 numbers; Nsight Compute metrics for verifying fusion worked; when NOT to fuse (training activations, debug cost, autotune-time inflation); 6 exercises.

Anti-Fabrication Discipline¶

Each chapter was authored under explicit anti-fabrication rules. Numbers cited:

Hardware constants (warp = 32 threads, H100 = 132 SMs, 80 GB HBM3, ~3 TB/s HBM3 BW, ~989 TFLOPS BF16 dense): unhedged.
Algorithm complexities (ring all-reduce bandwidth, FlashAttention HBM access, GPipe bubble): derived in the text.
Approximate / illustrative numbers (per-layer perf factors, hit rates, speedups in real systems): explicitly hedged with "~" or "approximate" or "verify with vendor docs."
Research-stage techniques (multi-token prediction, diffusion LMs, FP4 production deployments): flagged as such.

Layered Pedagogy¶

Every chapter follows the same shape:

Why the topic exists-what problem it solves.
Mental model-the right way to think about it.
Mechanism-how it actually works, step by step.
Math-derivations, not assertions.
Numbers-concrete worked examples with specific hardware/model/shape choices.
Diagrams-ASCII where they clarify.
Code-for engineering chapters, runnable kernels and pseudocode.
Pitfalls-the things you'll get wrong on first attempt.
Exercises-six per chapter, with worked answers.

This is the mastery layout: anyone who reads a chapter end-to-end and completes the exercises has internalized the topic at a working-engineer level.

How to Use This Resource¶

As curriculum companion: read the monthly module, then the matching deep dive, then return to the lab with both as references.

As a reference text: tabbed open during work; jump to the relevant section by topic.

As interview prep: each chapter's exercise section is approximately the depth of a senior-level systems-engineering interview. If you can solve the exercises cold, you can answer the interview question.

As a teaching resource: each chapter is a self-contained lecture worth of material. Use to onboard a new engineer to a sub-topic in a single afternoon.