Saltar a contenido

Month 5-Inference Systems: KV-Cache, Paged Attention, Continuous Batching, Quantization, Speculative Decoding

Goal: by the end of week 20 you can (a) explain why LLM inference is bandwidth-bound and how the KV-cache changes everything, (b) implement a paged KV-cache from scratch, (c) reason about quantization (INT8/INT4/FP8) tradeoffs and pick a scheme for a workload, and (d) implement a basic speculative-decoding loop.

This month is the commercial heart of AI systems engineering. Frontier-lab inference economics are dominated by these techniques.

Deep-dive companions (read in tandem): - Week 17 → the Attention & Transformer deep dive - attention math from first principles, RoPE complex-number derivation, KV-cache memory math with worked Llama-3-70B example, full FlashAttention online-softmax derivation with inductive proof. - Week 18 → the Inference Serving deep dive - cost-model derivation, PagedAttention block-pool algorithm, Orca scheduler pseudocode, vLLM architecture, chunked prefill, prefix caching, disaggregation. - Week 19 → the Quantization deep dive - number-format derivations, AWQ identity proof with numerical example, GPTQ derived from Optimal Brain Surgeon with Cholesky efficiency, SmoothQuant α derivation, FP8 with delayed scaling, Marlin kernel. - Week 20 → the Speculative Decoding & Disaggregation deep dive - speculative decoding rejection-sampling proof, speedup formula, geometric acceptance model, tree speculation, DistServe/Mooncake/Splitwise architectures, full production-stack composition.

Worked investigation (hands-on, real GPU): Diagnose a GPU out-of-memory with nvidia-smi - the four memory consumers, catch zombie processes, leak-vs-plateau-vs-spike, and the fix matched to the cause. See the Worked Examples section.


Weeks

Comments