Week 20 - Speculative Decoding, Disaggregation, Inference Frontiers¶
20.1 Conceptual Core¶
- Speculative decoding (Leviathan et al., ICML 2023): use a small "draft" model to generate K candidate tokens; verify with the large "target" model in parallel; accept the longest accepted prefix. Gains ~2-3× tokens/sec when the draft model agrees often.
- Why it works: target model verification is one prefill of K tokens, not K decode steps. Prefill is compute-bound; can use tensor cores efficiently. Decode steps are memory-bound; saving them is gold.
- Variants:
- Vanilla speculative: separate draft model.
- Self-speculative / Medusa: multiple decoding heads on the same model.
- EAGLE / EAGLE-2: train auxiliary heads to predict multiple tokens.
- Lookahead decoding: no draft model; uses n-gram patterns from the model's own generation.
- Prefill/decode disaggregation: covered in week 18. Scaling-out the prefill workers and decode workers separately, with KV-cache transfer between them via fast networking (RDMA).
20.2 Mechanical Detail¶
- Speculative loop:
- Acceptance rate depends on draft-target agreement. Llama-3-8B drafting Llama-3-70B: ~70-80% agreement on typical prompts. Acceptance rate × draft length = expected gain.
- Disaggregation requires KV-cache transfer. Approaches: RDMA (Mooncake), shared-memory pools (ZSpread), distributed object stores. State-of-the-art papers from 2024-2025: Mooncake, DistServe, Splitwise.
20.3 Lab-"Speculative Decoding"¶
- Pair a small model (1B) drafting a larger model (7-13B).
- Implement vanilla speculative decoding: draft-then-verify.
- Measure: acceptance rate, tokens/sec gain, vs baseline single-model decoding.
- Tune K (draft length); sweep; identify the sweet spot for your workload.
20.4 Idiomatic & Diagnostic Drill¶
- Speculative decoding's wins disappear under high concurrency (the target model's batch is already big and fills tensor cores). Profile under varying concurrency; document the regime where speculation helps.
20.5 Production Slice¶
- The current 2026 inference frontier is disaggregation + speculation + paged caching + prefix sharing + quantization, all simultaneously. Document a hypothetical architecture combining all five, with the quantitative contribution of each. This is the design exercise for any senior inference role.
Month 5 Capstone Deliverable¶
A inference-systems/ directory:
1. decode-from-scratch/ (week 17)-KV-cache + FlashAttention.
2. mini-vllm/ (week 18)-your mini paged scheduler vs real vLLM.
3. quantization-bench/ (week 19)-AWQ / FP8 / BF16 tradeoff matrix.
4. speculative/ (week 20)-speculative decoding harness with sweep.
A LLM_SERVING.md documenting: the scheduling model, the cost-per-token calculation, and the roadmap to incorporate disaggregation.
Recommended Reading Done This Month¶
- PagedAttention / vLLM (SOSP 2023). Read twice.
- Orca (OSDI 2022).
- FlashAttention v1 → v2 → v3 papers. v3 (Hopper FP8) is the 2024 paper.
- AWQ (MLSys 2024) and GPTQ (ICLR 2023).
- Speculative Decoding (Leviathan et al., ICML 2023).
- Mooncake (FAST 2025), DistServe (OSDI 2024) for disaggregation state-of-the-art.