Week 20 - Speculative Decoding, Disaggregation, Inference Frontiers¶

20.1 Conceptual Core¶

Speculative decoding (Leviathan et al., ICML 2023): use a small "draft" model to generate K candidate tokens; verify with the large "target" model in parallel; accept the longest accepted prefix. Gains ~2-3× tokens/sec when the draft model agrees often.
Why it works: target model verification is one prefill of K tokens, not K decode steps. Prefill is compute-bound; can use tensor cores efficiently. Decode steps are memory-bound; saving them is gold.
Variants:
Vanilla speculative: separate draft model.
Self-speculative / Medusa: multiple decoding heads on the same model.
EAGLE / EAGLE-2: train auxiliary heads to predict multiple tokens.
Lookahead decoding: no draft model; uses n-gram patterns from the model's own generation.
Prefill/decode disaggregation: covered in week 18. Scaling-out the prefill workers and decode workers separately, with KV-cache transfer between them via fast networking (RDMA).

20.2 Mechanical Detail¶

Speculative loop:

while not done:
    drafts = draft_model.generate(K)      # K candidate tokens
    logits = target_model.forward(drafts) # parallel verify
    accepted = longest_prefix_where(target_logits agrees with drafts)
    emit accepted tokens

Acceptance rate depends on draft-target agreement. Llama-3-8B drafting Llama-3-70B: ~70-80% agreement on typical prompts. Acceptance rate × draft length = expected gain.
Disaggregation requires KV-cache transfer. Approaches: RDMA (Mooncake), shared-memory pools (ZSpread), distributed object stores. State-of-the-art papers from 2024-2025: Mooncake, DistServe, Splitwise.

20.3 Lab-"Speculative Decoding"¶

Pair a small model (1B) drafting a larger model (7-13B).
Implement vanilla speculative decoding: draft-then-verify.
Measure: acceptance rate, tokens/sec gain, vs baseline single-model decoding.
Tune K (draft length); sweep; identify the sweet spot for your workload.

20.4 Idiomatic & Diagnostic Drill¶

Speculative decoding's wins disappear under high concurrency (the target model's batch is already big and fills tensor cores). Profile under varying concurrency; document the regime where speculation helps.

20.5 Production Slice¶

The current 2026 inference frontier is disaggregation + speculation + paged caching + prefix sharing + quantization, all simultaneously. Document a hypothetical architecture combining all five, with the quantitative contribution of each. This is the design exercise for any senior inference role.

Month 5 Capstone Deliverable¶

A inference-systems/ directory: 1. decode-from-scratch/ (week 17)-KV-cache + FlashAttention. 2. mini-vllm/ (week 18)-your mini paged scheduler vs real vLLM. 3. quantization-bench/ (week 19)-AWQ / FP8 / BF16 tradeoff matrix. 4. speculative/ (week 20)-speculative decoding harness with sweep.

A LLM_SERVING.md documenting: the scheduling model, the cost-per-token calculation, and the roadmap to incorporate disaggregation.