Skip to content

Month 5-Inference Systems: KV-Cache, Paged Attention, Continuous Batching, Quantization, Speculative Decoding

Goal: by the end of week 20 you can (a) explain why LLM inference is bandwidth-bound and how the KV-cache changes everything, (b) implement a paged KV-cache from scratch, (c) reason about quantization (INT8/INT4/FP8) tradeoffs and pick a scheme for a workload, and (d) implement a basic speculative-decoding loop.

This month is the commercial heart of AI systems engineering. Frontier-lab inference economics are dominated by these techniques.

Deep-dive companions (read in tandem): - Week 17 → DEEP_DIVES/07_ATTENTION_TRANSFORMER.md - attention math from first principles, RoPE complex-number derivation, KV-cache memory math with worked Llama-3-70B example, full FlashAttention online-softmax derivation with inductive proof. - Week 18 →DEEP_DIVES/08_INFERENCE_SERVING.md - cost-model derivation, PagedAttention block-pool algorithm, Orca scheduler pseudocode, vLLM architecture, chunked prefill, prefix caching, disaggregation. - Week 19 → DEEP_DIVES/09_QUANTIZATION.md - number-format derivations, AWQ identity proof with numerical example, GPTQ derived from Optimal Brain Surgeon with Cholesky efficiency, SmoothQuant α derivation, FP8 with delayed scaling, Marlin kernel. - Week 20 →DEEP_DIVES/10_SPECULATIVE_DISAGGREGATION.md - speculative decoding rejection-sampling proof, speedup formula, geometric acceptance model, tree speculation, DistServe/Mooncake/Splitwise architectures, full production-stack composition.


Weeks

Comments