Skip to content

Week 20 - Speculative Decoding, Disaggregation, Inference Frontiers

20.1 Conceptual Core

  • Speculative decoding (Leviathan et al., ICML 2023): use a small "draft" model to generate K candidate tokens; verify with the large "target" model in parallel; accept the longest accepted prefix. Gains ~2-3× tokens/sec when the draft model agrees often.
  • Why it works: target model verification is one prefill of K tokens, not K decode steps. Prefill is compute-bound; can use tensor cores efficiently. Decode steps are memory-bound; saving them is gold.
  • Variants:
  • Vanilla speculative: separate draft model.
  • Self-speculative / Medusa: multiple decoding heads on the same model.
  • EAGLE / EAGLE-2: train auxiliary heads to predict multiple tokens.
  • Lookahead decoding: no draft model; uses n-gram patterns from the model's own generation.
  • Prefill/decode disaggregation: covered in week 18. Scaling-out the prefill workers and decode workers separately, with KV-cache transfer between them via fast networking (RDMA).

20.2 Mechanical Detail

  • Speculative loop:
    while not done:
        drafts = draft_model.generate(K)      # K candidate tokens
        logits = target_model.forward(drafts) # parallel verify
        accepted = longest_prefix_where(target_logits agrees with drafts)
        emit accepted tokens
    
  • Acceptance rate depends on draft-target agreement. Llama-3-8B drafting Llama-3-70B: ~70-80% agreement on typical prompts. Acceptance rate × draft length = expected gain.
  • Disaggregation requires KV-cache transfer. Approaches: RDMA (Mooncake), shared-memory pools (ZSpread), distributed object stores. State-of-the-art papers from 2024-2025: Mooncake, DistServe, Splitwise.

20.3 Lab-"Speculative Decoding"

  1. Pair a small model (1B) drafting a larger model (7-13B).
  2. Implement vanilla speculative decoding: draft-then-verify.
  3. Measure: acceptance rate, tokens/sec gain, vs baseline single-model decoding.
  4. Tune K (draft length); sweep; identify the sweet spot for your workload.

20.4 Idiomatic & Diagnostic Drill

  • Speculative decoding's wins disappear under high concurrency (the target model's batch is already big and fills tensor cores). Profile under varying concurrency; document the regime where speculation helps.

20.5 Production Slice

  • The current 2026 inference frontier is disaggregation + speculation + paged caching + prefix sharing + quantization, all simultaneously. Document a hypothetical architecture combining all five, with the quantitative contribution of each. This is the design exercise for any senior inference role.

Month 5 Capstone Deliverable

A inference-systems/ directory: 1. decode-from-scratch/ (week 17)-KV-cache + FlashAttention. 2. mini-vllm/ (week 18)-your mini paged scheduler vs real vLLM. 3. quantization-bench/ (week 19)-AWQ / FP8 / BF16 tradeoff matrix. 4. speculative/ (week 20)-speculative decoding harness with sweep.

A LLM_SERVING.md documenting: the scheduling model, the cost-per-token calculation, and the roadmap to incorporate disaggregation.


  • PagedAttention / vLLM (SOSP 2023). Read twice.
  • Orca (OSDI 2022).
  • FlashAttention v1 → v2 → v3 papers. v3 (Hopper FP8) is the 2024 paper.
  • AWQ (MLSys 2024) and GPTQ (ICLR 2023).
  • Speculative Decoding (Leviathan et al., ICML 2023).
  • Mooncake (FAST 2025), DistServe (OSDI 2024) for disaggregation state-of-the-art.

Comments