Saltar a contenido

Deep Dive 10-Speculative Decoding and Prefill/Decode Disaggregation

"Decode is sequential. Prefill is parallel. Treating them as one workload was always a compromise-the inference frontier of 2024–2026 is a stack of techniques that finally separates them."

This chapter is a self-contained reference on the two most consequential serving-side ideas of the last two years: speculative decoding, which trades a small amount of extra compute for a large reduction in serial steps, and prefill/decode disaggregation, which physically separates the two phases of LLM inference onto different worker pools. By the end you should be able to derive every formula on a whiteboard, sketch the architecture diagrams from memory, write pseudocode for the speculative loop and the disaggregated request flow, and reason about when these techniques help, when they hurt, and how they compose with the rest of a production inference stack.

The chapter assumes the reader is comfortable with transformer inference at the level of Deep Dives 1–9 of this curriculum: KV-cache, paged attention, continuous batching, chunked prefill, prefix caching, weight quantization. We build directly on those primitives.


1. The Latency Problem

Before we can argue for any new technique, we need to be precise about what is slow and why.

1.1 Two latencies that matter

For any chat-style LLM application, two user-visible latencies dominate the experience:

  • TTFT-Time To First Token. The wall-clock time from the user pressing Enter to the first streamed token appearing. This is dominated by prefill: the model must read all input tokens, populate the KV-cache, and emit the first output token.
  • TPOT-Time Per Output Token. Sometimes called inter-token latency or ITL. The wall-clock interval between consecutive streamed tokens after the first. This is dominated by decode: each output token requires one forward pass.

Total response latency for an output of length N is approximately:

total_latency ≈ TTFT + (N − 1) × TPOT

For a 500-token reply with TTFT = 300 ms and TPOT = 30 ms, total latency is roughly 300 + 499 × 30 ≈ 15.3 s. The decode phase contributes nearly 15 of those 15.3 seconds. Decode dominates total latency for any reply longer than a handful of tokens.

1.2 Why decode is sequential

Each decode step depends on the previous token: token t+1 is sampled from p(· | x_{1..t}), and computing that distribution requires the hidden states from token t. There is no way, at the level of a vanilla autoregressive model, to compute token t+2 before token t+1 is sampled. So decode is intrinsically a serial loop:

for i in 1..N:
    h_i = forward(model, token_{i-1}, kv_cache)
    token_i = sample(h_i)
    kv_cache.append(h_i)

Total decode time for N tokens at batch size 1 is N × T_decode_step.

1.3 Why decode is memory-bound

Each decode step performs a forward pass over the full model with input length 1. The arithmetic is:

  • For each linear layer of weight W ∈ R^{d_out × d_in}, the operation is y = W · x for `x ∈ R^{d_in} - a single matrix-vector multiply.
  • FLOPs: 2 · d_in · d_out.
  • Memory traffic: read W (≈ d_in · d_out · bytes_per_param), read x, write y. Dominant term is reading W.
  • Arithmetic intensity ≈ 2 · d_in · d_out / (d_in · d_out · bytes_per_param) = 2 / bytes_per_param.

For FP16 weights (bytes_per_param = 2), arithmetic intensity is ≈ 1 FLOP/byte. An H100 has roughly 3 TB/s of HBM bandwidth and ~1000 TFLOPs of FP16 tensor-core throughput, giving a balance point (the "ridge" of the roofline) at ~330 FLOPs/byte. Decode at batch=1 is two orders of magnitude below the roofline ridge-it is severely memory-bound. The tensor cores are starved; we are reading weights faster than we can use them.

Now consider the same model running at batch size B. The matrix-vector multiply becomes a matrix-matrix multiply of shape (d_out × d_in) · (d_in × B). Weights are still read once; arithmetic scales with B. Arithmetic intensity becomes 2B / bytes_per_param. At B ≈ 256, FP16 decode finally crosses into compute-bound territory. This is the entire reason continuous batching exists.

1.4 Why prefill is compute-bound

Prefill processes a prompt of length L in a single forward pass. The same matmul becomes (d_out × d_in) · (d_in × L). Arithmetic intensity scales with L. For typical chat prompts (L ≈ 200 − 4000), prefill is firmly compute-bound; the GPU is well-utilized; tensor cores are saturated.

1.5 The optimization tension

Continuous batching solves decode throughput but does not help individual latency: with B = 64, the single-step decode time T_decode_step is roughly the same as at B = 1 (slightly higher because we are now compute-bound), and a request still pays N × T_decode_step for its N tokens. Per-request latency is bound below by a factor that batching does not touch.

That is the lever speculative decoding pulls on: it reduces the number of sequential target-model forward passes per output token, without changing the model. Disaggregation, complementarily, removes the second-order penalty that co-located batching imposes on TTFT and TPOT by tuning prefill and decode hardware independently.


2. Speculative Decoding-Setup and Core Algorithm

The technique was published concurrently in two 2023 papers:

  • Leviathan, Kalman, Matias-Fast Inference from Transformers via Speculative Decoding, ICML 2023.
  • Chen, Borgeaud, Irving, et al. (DeepMind)-Accelerating Large Language Model Decoding with Speculative Sampling, 2023.

Both arrive at the same algorithm with essentially the same correctness proof. We follow the Leviathan et al. notation.

2.1 The two-model setup

We have two language models over the same vocabulary V:

  • Target model M_q, the large model whose distribution q(· | context) we want to sample from. Call its single-step latency T_target.
  • Draft model M_p, a small model with distribution p(· | context). Single-step latency T_draft, with T_draft ≪ T_target. Typically `T_draft / T_target ∈ [0.05, 0.2] - a 7B drafting a 70B sits around 0.1.

The two models share the tokenizer (this matters; cross-tokenizer speculation is a research topic but is messier).

2.2 The speculative step

One speculative step produces between 1 and K + 1 accepted tokens by the following procedure. Let x_{1..t} be the current generated context.

SPECULATIVE_STEP(x_{1..t}):

    # 1. Draft K tokens autoregressively with M_p.
    for i in 1..K:
        p_i = M_p(x_{1..t+i-1})              # distribution over V
        ~x_{t+i} ~ p_i                       # sample draft token
    # Now we have draft tokens ~x_{t+1}, ..., ~x_{t+K} and their probs p_1, ..., p_K.

    # 2. Verify all K positions in ONE forward pass of M_q.
    #    Feed the sequence x_{1..t}, ~x_{t+1}, ..., ~x_{t+K} as if it were prefill.
    #    Get back K+1 distributions q_1, ..., q_{K+1}.
    q_1, ..., q_{K+1} = M_q(x_{1..t}, ~x_{t+1..t+K})

    # 3. Accept-reject loop using rejection sampling.
    n = 0
    for i in 1..K:
        r ~ Uniform(0, 1)
        if r < min(1, q_i(~x_{t+i}) / p_i(~x_{t+i})):
            n += 1                           # accept ~x_{t+i}
        else:
            break                            # reject; stop here

    # 4. Sample one extra "free" token at position t+n+1.
    if n < K:
        # Rejection happened. Sample from corrected distribution.
        q_corrected = normalize(max(0, q_{n+1} − p_{n+1}))
        x_{t+n+1} ~ q_corrected
    else:
        # All K accepted. Sample a bonus token from q_{K+1} for free.
        x_{t+K+1} ~ q_{K+1}

    return n + 1 accepted tokens

Each speculative step costs one target forward pass plus K draft forward passes, and it produces a random number of accepted tokens between 1 and K + 1 (the +1 is the bonus token from q_{K+1} when all draft tokens were accepted, or the corrected sample when one is rejected).

2.3 Correctness-why accepted tokens are exactly distributed as target-only sampling

This is the keystone of the technique. Without this, speculative decoding would change the model's output distribution, which is unacceptable.

Claim. Each accepted token at position t+i has the marginal distribution q_i.

Proof sketch. Consider position t+i. Two cases:

  1. The draft proposes ~x ~ p_i, and we accept with probability min(1, q_i(~x) / p_i(~x)).
  2. If rejected, we sample from q_corrected = normalize(max(0, q_i − p_i)).

The total probability that we end up emitting any specific token y at this position is:

P(emit y) = P(draft proposed y) · P(accept y | drafted y)
          + P(reject)             · P(corrected sample = y)

Compute each piece:

  • P(drafted y) = p_i(y).
  • P(accept y | drafted y) = min(1, q_i(y) / p_i(y)), so the joint p_i(y) · min(1, q_i(y) / p_i(y)) = min(p_i(y), q_i(y)).
  • P(reject) = 1 − Σ_z min(p_i(z), q_i(z)). Using min(a, b) = a − max(0, a−b):
    Σ_z min(p_i(z), q_i(z)) = Σ_z p_i(z) − Σ_z max(0, p_i(z) − q_i(z)) = 1 − Σ_z max(0, p_i(z) − q_i(z))
    
    Equivalently P(reject) = Σ_z max(0, p_i(z) − q_i(z)) = Σ_z max(0, q_i(z) − p_i(z)) (the two are equal because Σ p = Σ q = 1, so Σ (p − q)_+ = Σ (q − p)_+).
  • P(corrected sample = y) = max(0, q_i(y) − p_i(y)) / Σ_z max(0, q_i(z) − p_i(z)).

Substituting:

P(emit y) = min(p_i(y), q_i(y))
          + [Σ_z max(0, q_i(z) − p_i(z))] · [max(0, q_i(y) − p_i(y)) / Σ_z max(0, q_i(z) − p_i(z))]
          = min(p_i(y), q_i(y)) + max(0, q_i(y) − p_i(y))
          = q_i(y)

The last equality uses min(a, b) + max(0, b − a) = b for a, b ≥ 0. So at every position the emission distribution is exactly q_i. The samples at different positions are not independent-but the marginal at every position matches the target-and the joint distribution over the accepted prefix can be shown to match the target's joint by a similar inductive argument. ∎

Why this is non-obvious. The cheap thing-just sampling from p and accepting with high probability when q ≈ p - would *not* be unbiased; it would shift the distribution towardp. The rejection-sampling correction (q − p` clamped and renormalized) is what makes the algorithm exact. The genius of the 2023 paper is in noticing that the correction is computable from the same target forward pass that you needed anyway.

2.4 Bonus token

When all K draft tokens are accepted, we already have q_{K+1} from the verification forward pass-the target's distribution at position t + K + 1 conditioned on the verified prefix. We sample one extra token from it for free. This is why the maximum yield per step is K + 1.


3. The Speedup Formula

Now we derive the wall-clock speedup, which is the whole point.

3.1 Setup

Let: - T_target = wall-clock time for one target forward pass at length 1 (a single decode step). - T_target,K = wall-clock time for one target forward pass on K tokens of new input. For small K (say K ≤ 16), T_target,K ≈ T_target because the target was already memory-bound at batch=1; processing K tokens uses idle compute and adds only marginal time. We approximate T_target,K ≈ T_target in the basic model and refine later. - T_draft = single-step draft forward time. - K = draft length. - α = expected number of accepted tokens per speculative step, where α ∈ [1, K+1]. (Convention: includes the +1 bonus on full acceptance.)

3.2 Tokens per wallclock time

Cost of one speculative step:

T_step = K · T_draft + T_target,K  ≈  K · T_draft + T_target

Tokens emitted per step: α (in expectation).

Tokens per second:

throughput_spec = α / (K · T_draft + T_target)

Baseline (no speculation, single decode step per token):

throughput_base = 1 / T_target

Speedup:

S = throughput_spec / throughput_base
  = α · T_target / (T_target + K · T_draft)
  = α / (1 + K · (T_draft / T_target))

This is the central formula. Memorize it.

3.3 Sanity checks

  • If T_draft → 0 (free draft): S → α. The draft costs nothing; we get exactly α tokens per target call. Best possible.
  • If T_draft → T_target (draft as expensive as target): S → α / (1 + K). We are paying K + 1 target-equivalent calls per α tokens. Almost always worse than baseline (since α ≤ K + 1).
  • If α → 1 (no draft tokens accepted): S → 1 / (1 + K · T_draft / T_target) < 1. Bad draft hurts you.

3.4 Choosing K

α depends on K (more attempts have diminishing returns), and so does T_target,K. Treating α(K) as concave in K and T_target,K ≈ T_target for small K, the optimum balances the increasing numerator against the linear cost in the denominator. In practice, sweep K ∈ {2, 4, 6, 8, 12, 16} for your model pair on representative workloads and pick the empirical maximum. Common production sweet spots are K = 4 to K = 8.

3.5 Worked example

Suppose α = 3.5 (typical for a well-matched draft/target pair), K = 8, T_draft = T_target / 10. Then:

S = 3.5 / (1 + 8 · 0.1)
  = 3.5 / 1.8
  ≈ 1.94×

Roughly 2×. This matches the 2–3× range cited as typical for speculative decoding.

If the draft is faster (T_draft = T_target / 20) and acceptance is similar:

S = 3.5 / (1 + 8 · 0.05) = 3.5 / 1.4 ≈ 2.5×

If we have a great draft (α = 5.0, e.g., from EAGLE-style feature speculation) and T_draft / T_target = 0.1:

S = 5.0 / 1.8 ≈ 2.78×

The published claims of 2–3× across many papers are not coincidence; they fall out of the formula given realistic parameter values.


4. Why Speculation Works in Compute Terms

The throughput formula tells us that it works; the roofline tells us why.

At batch=1 decode, the GPU is memory-bound: tensor cores are running at perhaps 1–3% of peak. The weights are streaming through HBM, doing one matvec per pass, and the FLOP units sit idle.

When we feed the target model K + 1 candidate tokens (the prompt context plus the K draft tokens) in a single forward pass, the matmul shape becomes (d_out × d_in) · (d_in × (K+1)). Arithmetic intensity rises by a factor of K + 1. For K = 8, we are doing 9× the FLOPs while reading the weights only once. Up to roughly the roofline ridge-at FP16, somewhere around K ≈ 64–128 for a single-request decode on H100-this extra FLOP cost is essentially free. It happens in the slack time between memory reads.

This is why the claim T_target,K ≈ T_target for small K is a good approximation. The forward pass still pays the same memory bill (read all weights), and the marginal compute cost is small until K exceeds the arithmetic-intensity ridge.

The fundamental trade made by speculative decoding: we use the GPU's idle FLOPs (which we were paying for anyway) to convert sequential target steps into parallel verification of speculative branches. The currency is GPU compute we weren't using; the payoff is reduced wall-clock latency.


5. Acceptance Rate `α - Where It Comes From

α is the empirical quantity that determines whether speculation pays off. It depends on three things:

  1. Per-token agreement probability between draft and target. For each drafted token, the rejection-sampling acceptance probability is min(1, q(y) / p(y)) averaged over draft samples, which equals Σ_z min(p(z), q(z)). If p ≈ q this is close to 1; if p is wildly different this is close to 0.
  2. Length K: longer drafts give more chances but the geometric drop-off from rejections eventually dominates.
  3. Workload dependence: easy text (boilerplate code, formulaic responses) accepts more readily than hard text (novel reasoning, surprising vocabulary).

5.1 Geometric model

If we assume each token is accepted independently with probability β, then the number of accepted draft tokens is a truncated geometric:

P(n accepted) = β^n · (1 − β)        for 0 ≤ n < K
P(K accepted) = β^K

Expected accepted draft tokens:

E[n] = β · (1 − β^K) / (1 − β)

Plus the bonus token (+1) on every step (corrected sample on rejection, or q_{K+1} sample on full acceptance):

α = E[n] + 1 = (1 − β^{K+1}) / (1 − β)

5.2 Worked numbers

The independence assumption is optimistic but reasonable for back-of-envelope work.

  • β = 0.7, K = 4: α = (1 − 0.7^5) / 0.3 = (1 − 0.168) / 0.3 ≈ 2.77
  • β = 0.7, K = 8: α = (1 − 0.7^9) / 0.3 ≈ (1 − 0.040) / 0.3 ≈ 3.20
  • β = 0.8, K = 8: α = (1 − 0.8^9) / 0.2 ≈ (1 − 0.134) / 0.2 ≈ 4.33
  • β = 0.6, K = 8: α = (1 − 0.6^9) / 0.4 ≈ (1 − 0.010) / 0.4 ≈ 2.47

Published work on Llama-3-8B drafting Llama-3-70B reports per-token acceptance in the 0.6–0.8 range under typical chat workloads, giving accepted lengths of roughly 3–5 with K = 8. These are approximate ranges from public benchmarks; exact numbers vary by workload and are not promises.

5.3 What kills α

  • Tokenizer mismatch. Even small differences (added special tokens, different BPE merges) catastrophically reduce agreement. Always use the same tokenizer for draft and target.
  • Sampling temperature. At T = 0 (greedy), agreement is brittle; one disagreement and the prefix diverges. At higher temperatures both distributions are smoother and min(p, q) mass increases.
  • Out-of-distribution context. If the draft was distilled on a narrow domain and the target is asked something else, the draft's predictions drift.

6. Variants

6.1 Vanilla speculative

Separate draft model. Simplest. Examples in production: Llama-3-8B drafting Llama-3-70B, or a custom 1B distilled draft drafting a 70B+ target.

Trade-offs. Two model copies live in GPU memory. The draft must be served on the same hardware (or close to it) to avoid network latency in the inner loop. Pipeline complexity rises: two sets of weights, two KV-caches, two CUDA streams.

6.2 Self-speculative

The draft is a part of the target model rather than a separate model.

  • Layer-skip self-speculative. Run only the early layers of the target as the draft (e.g., first 8 layers of a 32-layer model). The draft "head" is the target's own LM head. Cheap because no extra weights, but α is usually lower because the early-layer representation lacks the depth to predict tokens accurately.
  • Distilled head. Train an additional lightweight head on the target's hidden states to predict the next token. Slightly more weights, often higher α.

6.3 Medusa (Cai et al., 2024)

Add M extra "Medusa heads" on top of the target's last hidden state. Each head is a small MLP that predicts the token at offset +1, +2, ..., +M from the current position, in parallel.

h_t = target_last_hidden(x_{1..t})
prediction_at_offset_j = MedusaHead_j(h_t)        for j = 1..M

So one forward pass of the target produces M candidate tokens at positions t+1, ..., t+M. To verify, the target processes the `M - token candidate as if it were prefill (the same trick as vanilla speculation), with tree-attention to handle multiple branches per position (see Section 7).

Why it's clever. No separate draft model. The Medusa heads add modest parameter count (a few percent of the target). Training fine-tunes only the heads while freezing the backbone, or fine-tunes both with a multi-objective loss.

Limitations. Each head predicts in isolation given h_t; later positions are increasingly hard to predict from h_t alone (no chain of conditioning), so per-offset acceptance falls off rapidly. Mitigated by sampling multiple candidates per offset and using tree-attention.

6.4 EAGLE / EAGLE-2 (Li et al., 2024)

Train a small autoregressive model that operates on the target's hidden states, not on tokens. The auxiliary model takes the target's last-layer hidden states h_{1..t} as input and predicts the next hidden state ~h_{t+1}, then the next, etc. Each predicted hidden state is decoded to a token via the target's LM head.

The intuition: predicting the next hidden state gives the auxiliary model access to a much richer signal than just the next token, dramatically improving α. The auxiliary model is small (typically a single transformer block plus a regression head).

EAGLE-2 adds dynamic tree expansion (branch where uncertain, prune where confident), pushing acceptance lengths higher.

As of 2026, EAGLE/EAGLE-2 is the de facto state of the art for self-speculative decoding on open-weight models. Reported accepted lengths sit in the 4–6 range with K = 8 on standard chat benchmarks, but specific numbers vary by setup and should be verified.

6.5 Lookahead decoding (Fu et al., 2024)

No draft model at all. Instead, exploit n-gram patterns from the target's own previous generations (or training data) to propose continuation candidates, then verify with a single target forward pass using a Jacobi-iteration-style parallel proposal.

Use case. When you can't ship a draft model (memory, deployment, latency constraints) but want some of the speedup. Gains are smaller (typically 1.3–1.8×, range approximate) but free in deployment terms.


7. Tree-Based Speculation

So far we've described linear speculation: a chain ~x_{t+1}, ~x_{t+2}, ..., ~x_{t+K}. One rejection terminates the prefix. With per-token acceptance β = 0.7 and K = 8, we get α ≈ 3.2.

Tree speculation generalizes to a tree of candidate continuations rooted at position t:

                t
               /|\
            a   b   c        # candidates at t+1
           /|   |\   \
         a1 a2 b1 b2 c1      # candidates at t+2 conditioned on parent

Each branch is a possible continuation. The target verifies the entire tree in one forward pass, using a custom tree attention mask so each node attends only to its ancestors.

TREE_VERIFY(tree):
    flatten tree to a sequence of nodes
    construct attention mask:
        node i attends to node j  iff  j is an ancestor of i in the tree
    one target forward pass on the flattened sequence with this mask
    for each root-to-leaf path:
        run the rejection-sampling accept/reject loop
    accept the longest accepted prefix across all paths

7.1 Why it helps

Multiple candidates at each depth give the target multiple chances to accept something. The expected accepted depth is higher than for a linear draft of the same total node count, because divergence in one branch doesn't kill the others.

7.2 Cost

Tree attention costs a forward pass on |tree| tokens with a custom mask. Modern attention kernels (FlashAttention with a tree mask, or specialized kernels) handle this efficiently; the overhead vs. linear verify is modest if |tree| is comparable to K.

7.3 Branching policy

How wide should the tree be? Branching factor b at each depth gives b^d leaves at depth d. Naively this explodes. Practical implementations use dynamic tree expansion: branch wider where the draft is uncertain (high entropy), prune where confident. EAGLE-2 popularized this.

A typical production tree has 25–60 nodes total, with deeper nodes thinner than shallower ones-e.g., (3, 3, 2, 2, 1, 1) branching across depths.


8. The Speculative-Batching Tension

Speculative decoding wins big at batch=1. Continuous batching wins big at batch≫1. They fight each other.

8.1 The trade

At batch=1: decode is memory-bound. Verifying K + 1 candidates uses idle compute, no extra wall-clock cost. Net win: ~2×.

At batch=64 (continuous batching at scale): decode is already compute-bound. The tensor cores are saturated emitting one token per request per step. Now verifying K + 1 candidates per request multiplies the compute load by ~K. The per-step time grows roughly linearly in K, and most of those extra-flop tokens are rejected. We are paying compute to do speculative work that produces only α tokens per step instead of one-but the cost grew by K + 1, not by 1.

The throughput formula at high batch (where T_target,K ≈ K · T_target because we are now compute-bound) becomes:

S_high_batch ≈ α / (K + K · T_draft / T_target) = α / (K · (1 + T_draft / T_target))

For α = 3.5, K = 8, T_draft / T_target = 0.1:

S_high_batch ≈ 3.5 / (8 · 1.1) ≈ 0.40

Speculation actively hurts throughput at high batch. This is not a small effect; it's a 2–3× slowdown vs. plain continuous batching.

8.2 The production pattern

The reconciliation: enable speculation adaptively.

ADAPTIVE_SPECULATION_POLICY(batch_state):
    if current_batch_size <= LOW_BATCH_THRESHOLD:
        use_speculation = True
    elif current_batch_size >= HIGH_BATCH_THRESHOLD:
        use_speculation = False
    else:
        use_speculation = (priority == HIGH)   # for low-latency requests only

    return use_speculation

Typical thresholds: speculation on at batch ≤ 8, off at batch ≥ 32, with a high-priority override in the middle band.

This is the kind of policy decision that gets made at the scheduler level, not at the model level. It's also why speculative decoding is sometimes called "a single-user technique"-in pure-throughput regimes (training, batch inference at scale) it doesn't pay.

8.3 An important nuance

Speculation can still win at moderate batch if the target has spare compute headroom-for example, if the GPU is bandwidth-limited by quantized weights (W4 weights at batch ≤ 16 are still memory-bound on H100). The crossover batch size is workload- and hardware-specific. The right answer is to measure and program the scheduler accordingly.


9. Engineering Speculative Decoding

The algorithm is short. The implementation has corners.

9.1 Two KV-caches

Both draft and target maintain their own KV-cache. They must stay in lockstep with the accepted prefix, not the proposed prefix.

        accepted prefix    last accepted token    proposed (not yet accepted)
  target KV: [............................ T ]
  draft  KV: [............................ T ][~x_{t+1} ~x_{t+2} ... ~x_{t+K}]

When verification finishes:

  • If n tokens accepted (n ≤ K), the draft's KV-cache for positions t+n+1 .. t+K must be rolled back (dropped). The draft re-drafts from position t+n+1 next step.
  • The target's KV-cache must be extended with positions t+1 .. t+n+1 (the accepted prefix, including the bonus or corrected token).

For paged-attention KV-caches, "rolling back" the draft is a matter of returning pages to the free pool (cheap). For contiguous KV-caches, rolling back means truncating the cache pointers.

9.2 The verification forward pass

The target verifies K candidates by running a forward pass on K new tokens given its existing cache. This is identical to a chunked-prefill of length K. Existing prefill kernels handle it directly; no new attention kernel needed (unless using tree speculation, which needs a custom mask).

9.3 Synchronization

Naive implementation runs draft and target serially:

1. draft K steps
2. target verify
3. accept/reject
4. update both caches
5. goto 1

At step 1, the target is idle. At step 2, the draft is idle. On a single GPU, idle silicon is wasted money.

Optimization. Pipeline the draft for step t+1 during the target's verification of step t. The draft uses recently accepted tokens; if the target ends up rejecting some, the draft has to roll back, but the expected work is reduced.

This is the same kind of speculative-on-speculation idea as branch prediction in CPUs. Implementations vary; reference open-source implementations include the speculative path in vLLM, TensorRT-LLM, and SGLang.

9.4 Batching speculation

Within a batch, different requests will have different acceptance lengths per step. After a step, request A has 4 new tokens, request B has 1, request C has 5. The next iteration's batch is jagged. The scheduler must handle this. Continuous batching frameworks generally do; the bookkeeping is per-request KV-cache offsets and per-request next-token positions.

9.5 Pseudocode for the speculative loop

SPECULATIVE_GENERATE(prompt, max_tokens, K):
    target_kv = prefill(M_target, prompt)
    draft_kv  = prefill(M_draft, prompt)
    output = []

    while len(output) < max_tokens:
        # 1. Draft K tokens
        draft_tokens = []
        draft_probs  = []
        ctx = output[-1] if output else last_prompt_token
        for i in 1..K:
            p = M_draft.step(ctx, draft_kv)
            t = sample(p)
            draft_tokens.append(t)
            draft_probs.append(p)
            ctx = t

        # 2. Target verify in one pass
        q_dists = M_target.forward(draft_tokens, target_kv)   # K+1 distributions

        # 3. Accept-reject
        n_accepted = 0
        for i in 1..K:
            r = uniform(0, 1)
            if r < min(1, q_dists[i][draft_tokens[i]] / draft_probs[i][draft_tokens[i]]):
                n_accepted += 1
                output.append(draft_tokens[i])
            else:
                # Corrected sample
                q_corr = normalize(max(0, q_dists[i] - draft_probs[i]))
                output.append(sample(q_corr))
                break

        if n_accepted == K:
            # Bonus token
            output.append(sample(q_dists[K+1]))

        # 4. KV-cache hygiene
        target_kv.commit(n_accepted + 1)              # accepted + corrected/bonus
        draft_kv.rollback_to(target_kv.length)        # discard rejected draft positions
        draft_kv.append_token(output[-1])             # so next draft step starts from accepted token

    return output

10. Disaggregated Inference-The Motivation

The second frontier idea: stop running prefill and decode on the same workers.

10.1 The problem with co-location

A co-located worker handles a stream of mixed requests. At any moment its scheduler chooses which requests to execute and in which phase:

  • Some requests are in prefill (compute-bound).
  • Some are in decode (memory-bound).
  • The scheduler wants to batch.

The optimal batching policy for prefill differs from the optimal policy for decode:

  • Prefill is compute-bound for any non-trivial prompt. Larger batches do not help much (we are already at the roofline ridge from the long sequence dimension), and they hurt latency by stretching T_prefill. Prefill prefers small batches-sometimes batch = 1-for low TTFT.
  • Decode is memory-bound at batch=1 and compute-bound around batch ≈ 256. Decode wants the largest batch the GPU memory allows, for throughput.

A co-located worker must serve both. Common compromises:

  1. Prefill-then-decode flushing. At each scheduler tick, run pending prefill, then a decode tick. Decode requests wait for prefill to finish; prefill requests wait for decode to flush. Latency for both phases suffers.
  2. Chunked prefill. Slice prefill into chunks of C tokens and interleave with decode steps in the same forward pass. Smooths the latency, but a chunk of prefill in the same forward pass as decode costs the decode requests time (because the forward pass runs at the longer sequence length).
  3. SLO violations under load. As load rises, the queue mixes more aggressively; both TTFT and TPOT degrade simultaneously. There is no separate knob to tune them independently.

10.2 The disaggregation insight

If we physically separate prefill workers and decode workers, each pool can be:

  • Sized independently (e.g., 1 prefill worker per 4 decode workers, depending on workload mix).
  • Tuned independently (different batch sizes, different scheduler policies).
  • Even on different hardware (prefill on H100s for compute, decode on cheaper GPUs with high HBM bandwidth).

The cost: a request must move between workers, which means transferring its KV-cache from the prefill worker to the decode worker.


11. DistServe (Zhong et al., OSDI 2024)

DistServe is the canonical reference design for disaggregated LLM serving.

11.1 Architecture

                  ┌───────────────┐
   request ─────► │  Global       │
                  │  Scheduler    │
                  └───┬───────┬───┘
                      │       │
                      ▼       ▼
            ┌────────────┐  ┌────────────┐
            │  Prefill   │  │  Prefill   │  ...   prefill pool
            │  Worker 1  │  │  Worker 2  │
            └─────┬──────┘  └─────┬──────┘
                  │ KV-cache      │
                  │ over RDMA     │
                  ▼               ▼
            ┌────────────┐  ┌────────────┐
            │  Decode    │  │  Decode    │  ...   decode pool
            │  Worker 1  │  │  Worker 2  │
            └─────┬──────┘  └─────┬──────┘
                  │               │
                  └──────► token stream to user

The request flow:

1. Request arrives at scheduler.
2. Scheduler picks a prefill worker (load balancing).
3. Prefill worker: prefill, populate KV-cache.
4. Scheduler picks a decode worker (load balancing on KV memory pressure).
5. KV-cache transferred from prefill worker to decode worker (RDMA).
6. Decode worker: continuous-batch decode until completion.
7. Tokens streamed to client during decode.

11.2 Why each pool can be tuned independently

  • Prefill pool. Optimize for low TTFT under SLO. Use small batches (often 1), maybe with chunked prefill for very long prompts. Configure for fast tensor cores. Prefill workers do not need huge KV memory.
  • Decode pool. Optimize for high decode throughput. Use the largest continuous batch the GPU memory allows. Configure for HBM bandwidth. Decode workers do need huge KV memory and benefit from W4-quantized weights.

11.3 KV-cache transfer

This is the new cost. Per-request KV-cache size (FP16, no quantization) for a model with L layers, H heads, head dim d, and S sequence length:

KV_size_bytes = 2 (K and V) · L · H · d · S · 2 bytes (FP16)
              = 4 · L · H · d · S

For Llama-3-70B (L=80, H_kv=8 after GQA, d=128) at S=8192:

KV_size = 4 · 80 · 8 · 128 · 8192 ≈ 2.7 GB    (FP16, with GQA)

For multi-head attention without GQA the same model would be ~22 GB. Most modern large models use GQA, putting per-request KV in the few-GB range.

Transfer time over RDMA (NVLink between GPUs ≈ 600 GB/s, InfiniBand HDR ≈ 25–50 GB/s, GH200 NVLink-C2C even higher): for ~3 GB at 50 GB/s, transfer takes ~60 ms. At 200 GB/s (NVLink), ~15 ms.

Crucially, the transfer is overlap-able with the first decode step on the receiving worker. If the prefill worker streams the KV-cache layer-by-layer, the decode worker can begin processing as soon as the first layer arrives. End-to-end transfer cost can be made sub-step-time with careful scheduling.

11.4 Reported gains

The DistServe paper reports order-of-magnitude (4–7×) reductions in achievable load while meeting both TTFT and TPOT SLOs, compared to co-located baselines. These are the paper's reported numbers; actual gains depend heavily on workload and hardware. The mechanism is straightforward: by separating the two phases, the scheduler is no longer forced into compromises that hurt both metrics.

11.5 Load balancing across pools

The global scheduler decides:

  • Which prefill worker gets the next request? Pick the one with the shortest prefill queue and enough free KV memory for the prompt's KV.
  • Which decode worker gets the request after prefill? Pick the one with the most free KV memory (decode pressure on memory; not on compute, which is shared across the batch).
  • How to size the pools? The ratio depends on workload. For chat (short prompts, long replies), more decode workers. For RAG / long-context summarization (long prompts, short replies), more prefill workers. The DistServe paper proposes a search/profiling procedure to size the pools.

12. Splitwise (Patel et al., 2024)

Microsoft's variant on the same idea, with a sharper focus on heterogeneous hardware.

12.1 Key idea

  • Prefill is compute-bound → use the most compute-dense GPUs (H100, MI300X).
  • Decode is bandwidth-bound → use GPUs with the best $/GB-of-HBM-bandwidth (A100, sometimes older accelerators that are cheaper but still bandwidth-rich).

A Splitwise-style cluster runs different GPU SKUs in different pools. You buy fewer expensive GPUs for prefill and more cheaper GPUs for decode. The economics improve substantially for workloads with imbalanced phase costs.

12.2 Trade-offs

Heterogeneity adds operational complexity (different drivers, different profiling, different failure modes), and KV-cache transfer between heterogeneous nodes is more constrained (PCIe rather than NVLink in some configs). Splitwise's published benchmarks demonstrate the cost-performance frontier; the actual deployment ratios are workload-specific.


13. Mooncake (Qin et al., 2024)

Moonshot AI's serving architecture for their Kimi chat product. Published in 2024 with full architectural detail.

13.1 KVCache-centric design

Mooncake's organizing principle: the KV-cache is the central data structure of the serving system, not the GPU worker.

The design:

  • A distributed KV-cache pool spans CPU memory across the cluster (and SSDs as a backing store). Total capacity is far larger than what fits on the GPUs alone.
  • GPUs hold only the working set of KV-cache they need right now. Other entries live in CPU memory or SSD.
  • A scheduler routes requests to whichever GPU has the relevant prefix already hot-or, if none, to the GPU that can most cheaply load the prefix from the pool.

13.2 Why this matters for chat

Chat workloads have enormous prefix overlap:

  • System prompts repeat across requests.
  • Multi-turn conversations share their entire history.
  • Tools / RAG contexts are reused.

A 32K-token system prompt prefilled fresh every request is pure waste; with prefix caching its KV is already computed and we just have to find it and use it. Prefix cache hit rates on production chat traffic are commonly cited in the 50–80% range (range approximate; depends entirely on workload).

Mooncake's distributed pool maximizes the chance of a hit anywhere in the cluster.

13.3 Disaggregation in Mooncake

Mooncake is also disaggregated (prefill / decode separation), and the two ideas compose: prefill workers consult the global cache before doing any work; if the prefix is hit, they may skip prefill entirely and just hand the cached KV to a decode worker.

13.4 Reported gains

The Mooncake paper documents substantially higher GPU utilization and request throughput than co-located baselines on Moonshot's production traffic. Specific numbers in the paper depend on their workload mix; treat as approximate. The architectural lesson is robust: at production scale, the global KV-cache is the system.


14. KV-Cache Transfer in Detail

The cost that disaggregation pays. Worth understanding precisely.

14.1 Sizing

Per-request KV-cache (FP16, GQA, L layers, H_kv KV heads, d head dim, S sequence length):

KV_size = 4 · L · H_kv · d · S    bytes

Approximate scenarios (FP16, GQA with H_kv = 8):

Model L S KV size
Llama-3-8B 32 8K ~270 MB
Llama-3-70B 80 8K ~2.7 GB
Llama-3-70B 80 32K ~10.7 GB
Llama-3-405B 126 8K ~4.2 GB
Llama-3-405B 126 128K ~67 GB

These are uncompressed. KV-quantization (INT8, sometimes INT4) cuts these by 2× or 4× at small accuracy cost.

14.2 Transfer bandwidth

Link Approx bandwidth
NVLink 4 (intra-node, H100) 600 GB/s
NVSwitch fabric (inter-node, NVL72) ~900 GB/s aggregate
InfiniBand NDR 50 GB/s per port
InfiniBand HDR 25 GB/s per port
PCIe Gen5 x16 64 GB/s (often less in practice)

Transfer time = KV_size / bandwidth.

For 3 GB over 50 GB/s InfiniBand: 60 ms. For 3 GB over 600 GB/s NVLink: 5 ms. For 10 GB over 50 GB/s: 200 ms.

14.3 Overlap with first decode step

The transfer can overlap with the first decode step on the receiving worker. The receiving worker needs the KV-cache for layer only when it computes attention at layer . If the prefill worker streams KV in layer order, and the decode worker is processing layers in the same order, the only thing that needs to arrive before decode can start is the KV for layer 0. After that, the rest can stream in parallel with computation, hiding most of the transfer latency.

This "layer-pipelined" KV transfer is implemented in DistServe and extensively in Mooncake. It is the engineering move that makes disaggregation production-viable.

14.4 Disaggregation pseudocode

DISAGGREGATED_REQUEST(prompt):
    # 1. Scheduler routing
    prefill_worker = scheduler.pick_prefill_worker(prompt)
    decode_worker  = scheduler.pick_decode_worker(prompt)

    # 2. Prefix cache lookup (if available)
    cached_kv, cache_offset = global_cache.lookup(prompt)
    if cache_offset == len(prompt):
        # full hit: skip prefill entirely
        kv = cached_kv
    else:
        # 3. Prefill (possibly chunked, possibly resuming from cache)
        prefill_worker.load_kv_prefix(cached_kv)
        kv = prefill_worker.prefill(prompt[cache_offset:])

    # 4. Streamed transfer to decode worker (layer-pipelined)
    transfer_handle = prefill_worker.stream_kv_to(decode_worker, kv)

    # 5. Decode worker waits for layer-0 KV, then begins
    decode_worker.await_layer(transfer_handle, layer=0)
    for tok in decode_worker.decode_loop(transfer_handle):
        yield tok          # stream to client

    # 6. Async write-back to global cache for future hits
    global_cache.insert_async(prompt, kv)

15. Combining the Techniques-A Production Stack

Each individual technique gives a multiplicative factor. Production-grade inference is a stack.

15.1 The full stack, ordered

A modern serving stack (2026) typically includes:

  1. Paged attention-non-contiguous KV-cache allocation, eliminates fragmentation. Enables (2)–(4).
  2. Continuous batching-token-level scheduling across requests. Pushes decode toward compute-bound.
  3. Chunked prefill-slices long prefills into chunks that fit alongside decode. Smooths TTFT under load.
  4. Prefix caching-global KV reuse across requests with shared prefixes. Eliminates redundant prefill.
  5. W4 weight quantization-4-bit weights with FP16 activations. Cuts memory traffic ~4×, important for decode.
  6. (Optional) Disaggregation-separate prefill and decode pools. Lets you hit both TTFT and TPOT SLOs at higher load.
  7. (Optional) Speculative decoding-adaptively enabled at low batch / high priority. Cuts per-request decode latency ~2×.

15.2 Hypothetical attribution

A back-of-envelope walkthrough of how each layer contributes. Numbers are illustrative, not measured.

Suppose a baseline naive implementation of Llama-3-70B serves at throughput 1× (whatever absolute units we choose) with TTFT and TPOT both far above SLO at moderate load.

  • Add paged attention: little throughput change at low load, but enables larger effective batch size (less wasted KV memory) → ~1.5× throughput at high load.
  • Add continuous batching: now at batch ≈ 64 effectively → ~5–10× throughput vs naive (decode crossing into compute-bound regime). TTFT only modestly improved.
  • Add chunked prefill: TTFT under load improves ~2–3×; throughput roughly flat.
  • Add prefix caching (chat workload, 60% prefix hit rate): effective prefill compute ~0.4× of original; TTFT improves another ~2×; throughput modestly better.
  • Add W4 quantization: decode bandwidth-bound regime improves ~3× (we read 4 bits of weight per active param instead of 16). Throughput at low-to-moderate batch ~2–3× better; high-batch gains smaller.
  • Add disaggregation: TTFT and TPOT can both meet SLO at higher load. SLO-attainable throughput at SLO ~3–5× the co-located version (per DistServe-class results, range approximate).
  • Add adaptive speculative decoding: low-batch / high-priority requests see ~2× lower TPOT.

Multiplied together, a fully-optimized stack lands roughly 50–200× the naive baseline on relevant metrics, depending on workload. These factors are illustrative and not promises. What matters is that they are roughly multiplicative-none of them subsumes the others; each addresses a different bottleneck.

15.3 What doesn't compose

A few combinations require care:

  • Speculation + high batch: as analyzed, hurts. Use adaptive policy.
  • Speculation + tree attention + paged attention: requires the paged attention kernel to support custom masks. Most modern kernels (FlashAttention v3, vLLM's paged kernels) do.
  • Disaggregation + prefix caching: requires the prefix cache to be global, not per-worker. Otherwise prefix hits collide with worker locality. This is exactly Mooncake's design.
  • Disaggregation + speculation: speculation lives entirely on the decode worker. The prefill worker doesn't need to know about it. Compose freely.

16. Frontier Directions (research-stage as of 2026)

These are active research areas. Treat as ideas to track, not as production techniques.

16.1 Continuous depth / early-exit (research-stage)

Not all tokens need all the model's layers to predict correctly. "Easy" tokens (function words, boilerplate) might be settled by layer 12 of a 32-layer model. Early-exit decoding adds a per-layer prediction head and exits when the head's confidence exceeds a threshold.

Status: works in research, but production deployment is rare because (a) the calibrated thresholds are workload-specific, (b) early-exit at layer produces partial KV-cache (only layers up to ), which other speculative-style techniques want to be complete. Active research as of 2026.

16.2 Multi-token prediction (Gloeckle et al., 2024) (research-stage in production deployment)

Train the model to predict the next N tokens directly, rather than just the next token, by adding N parallel output heads. The 2024 paper showed gains on code and reasoning benchmarks. Decode can then emit N tokens per forward pass (with verification, similar to Medusa).

Status: training-time technique, requires retraining the model. Some open-weight frontier models (2025–2026) include MTP heads natively. Production adoption growing but not yet ubiquitous.

16.3 Diffusion language models (research-stage)

Cast text generation as a diffusion process over the entire output sequence, denoising all positions in parallel. Several papers (2023–2026) demonstrate non-autoregressive parallel decoding, with quality approaching autoregressive baselines on some tasks.

Status: research-stage. Quality gap with autoregressive models has narrowed but not closed for general chat as of 2026. Watch this space-if the gap closes, decode is no longer sequential, and the entire framing of "decode is the bottleneck" changes.


17. Practical Exercises

Six problems. Treat them as if you were on a whiteboard with a 70B-model engineer; show the derivations.

Exercise 1-Derive the speedup formula

State and derive the speculative decoding throughput speedup formula S = α / (1 + K · T_draft / T_target). Identify each assumption and where it can break.

Solution sketch. Tokens per step: α (definition). Time per step: T_target,K + K · T_draft. Approximation: T_target,K ≈ T_target for K below the arithmetic-intensity ridge of the target model on the current hardware. Throughput: α / (T_target + K · T_draft). Baseline throughput: 1 / T_target. Ratio: α · T_target / (T_target + K · T_draft) = α / (1 + K · T_draft / T_target). Breaks when (a) batch is high enough that T_target,K is not approximately T_target (compute-bound regime), (b) draft model causes target cache contention (e.g., they share GPU memory and one evicts the other).

Exercise 2-Compute the speedup for given parameters

Given α = 3.5, K = 8, T_draft = T_target / 10, compute the expected speedup. Then redo with T_draft = T_target / 5. Then with α = 2.0.

Solution. - Base case: S = 3.5 / (1 + 8 · 0.1) = 3.5 / 1.8 ≈ 1.94×. - T_draft = T_target / 5: S = 3.5 / (1 + 8 · 0.2) = 3.5 / 2.6 ≈ 1.35×. - α = 2.0, T_draft = T_target / 10: S = 2.0 / 1.8 ≈ 1.11×. Marginal.

Lesson: speedup is sensitive to both α and T_draft / T_target. A weak draft (low α) or a slow draft (high T_draft) both kill the gain.

Exercise 3-Derive α from per-token acceptance β

Assume each draft token is accepted independently with probability β. Derive α = (1 − β^{K+1}) / (1 − β). Check the limits β → 0 and β → 1.

Solution sketch. Number of accepted draft tokens n is truncated geometric: P(n=k) = β^k(1−β) for k < K, P(n=K) = β^K. Expected accepted draft: E[n] = Σ_{k=0}^{K−1} k β^k (1−β) + K β^K = β (1 − β^K) / (1 − β). Plus 1 bonus / corrected token always: α = β(1 − β^K)/(1 − β) + 1 = (1 − β^{K+1})/(1 − β). Limit β → 0: α → 1 (just the corrected sample). Limit β → 1: α → K + 1 (every draft accepted plus bonus).

Exercise 4-KV-cache transfer budget

A disaggregated cluster runs Llama-3-70B at 32K context with FP16 KV (GQA, H_kv = 8, d = 128, L = 80). Transfer between prefill and decode workers is over a 50 GB/s link. (a) Compute KV-cache size per request. (b) Compute raw transfer time. (c) The decode worker's first decode step takes 30 ms. By overlapping transfer with the first decode step, what fraction of the transfer time can be hidden? (d) What if you switch to INT8 KV-cache?

Solution. (a) KV = 4 · 80 · 8 · 128 · 32768 = 10.7 GB (FP16, 2 bytes per element accounted for in the leading 4 = 2(K+V) · 2 bytes). (b) 10.7 GB / 50 GB/s = 214 ms. (c) The first 30 ms of transfer overlap with the first decode step. Hidden fraction: 30 / 214 ≈ 14%. Most of the transfer is not hidden by a single decode step. To fully hide, layer-pipelining: layer 0 transfer (≈ 134 MB at 50 GB/s ≈ 2.7 ms) finishes before decode starts; subsequent layers stream in parallel with later decode steps (not just the first). Each decode step is 30 ms; per-layer transfer is 10.7 GB / 80 / 50 GB/s ≈ 2.7 ms. As long as decode step time > per-layer transfer time, layer-pipelined transfer hides fully-true here. (d) INT8 halves KV to 5.35 GB, halves transfer to 107 ms; same layer-pipelining argument hides it even more easily.

Exercise 5-Adaptive speculation policy

Design the scheduler logic for adaptively enabling speculative decoding in a continuous-batching server. Inputs available: current_batch_size, priority flag per request, recent_α_estimate, T_draft, T_target, current GPU compute utilization. Output: per-request speculation enable/disable.

Solution sketch.

SHOULD_SPECULATE(request, batch_state, system):
    # Compute expected throughput with and without speculation
    # at the current batch size
    α      = recent_α_estimate
    T_d    = T_draft
    T_t    = T_target
    B      = batch_state.size
    util   = system.compute_utilization()    # 0..1, fraction of FLOPs used

    # Rough single-step time scales as max(memory_bound_time, compute_bound_time).
    # At low util the headroom for K-token verify is essentially free.
    # At high util the K-token verify multiplies cost ~K.
    headroom = 1.0 - util
    effective_K_cost = K * (1 - headroom * 0.7)    # heuristic

    spec_throughput = α / (1 + effective_K_cost * T_d / T_t)
    if request.priority == HIGH:
        # always favor latency for high priority, even at small loss
        return spec_throughput > 0.9
    return spec_throughput > 1.05                  # only when net gain

Real implementations measure α online per request class and re-evaluate the policy periodically.

Exercise 6-Workload routing for disaggregated serving

You operate a disaggregated cluster with 8 prefill workers (H100) and 16 decode workers (A100) serving Llama-3-70B. Three workload classes share the cluster: - (W1) Chat: 200-token prompts, 400-token replies, 40% of traffic. - (W2) RAG: 8K-token prompts, 100-token replies, 50% of traffic. - (W3) Code completion: 1K-token prompts, 50-token replies, 10% of traffic.

(a) Reason about whether the prefill/decode pool sizing is appropriate. (b) Propose routing rules. (c) Where should prefix caching matter most?

Solution sketch. (a) Prefill compute is roughly proportional to prompt length. Weighted prompt length = 0.4·200 + 0.5·8000 + 0.1·1000 = 80 + 4000 + 100 = 4180. Weighted reply length = 0.4·400 + 0.5·100 + 0.1·50 = 160 + 50 + 5 = 215. Decode compute is roughly proportional to reply length × batch utilization; it's bandwidth-bound, so what matters is how many concurrent requests we can keep in decode given KV memory. With ~3 GB KV per request average and ~80 GB usable per A100, ~26 concurrent decode requests per A100. With 16 A100s, ~420 concurrent decode slots.

The prefill load is dominated by RAG (W2). 8 H100s might be tight or generous depending on prefill tokens/sec per H100. Sketch: an H100 prefilling 70B at FP16 hits ~10K–20K tokens/sec (range approximate). 8 of them ≈ 100K tokens/sec. To serve a workload mix where each request is ~4180 prompt tokens on average, that's ~24 requests/sec arrival rate sustainable. If your traffic exceeds that, add prefill workers.

(b) Routing rules: - All workloads use the same prefill/decode pools. - Within prefill: prioritize W3 (low-latency code completion) over W2 (RAG can tolerate higher TTFT). Hold W2 in a chunked-prefill queue to avoid head-of-line blocking on the H100s. - Within decode: continuous-batch all three. Pin W3's decode to a smaller, lower-latency decode pool subset if TPOT SLOs differ.

(c) Prefix caching matters most for W1 (chat: huge multi-turn prefix overlap) and W2 (RAG: shared system prompt + retrieved-context overlap on popular queries). It matters least for W3 (code completion is mostly novel context per request, though shared system prompt for the IDE may still help).


18. Summary

The 2024–2026 inference frontier is a story about separating concerns that were always different.

  • Decode is sequential and memory-bound. Speculative decoding pulls the lever of converting sequential target steps into parallel verification of speculative branches, paid for with idle GPU compute that we already owned. The math gives 2–3× per-request speedup at low batch.
  • Prefill is parallel and compute-bound; decode wants the opposite hardware policy. Disaggregation finally separates them, giving each its own pool, each tuned independently, each potentially on different hardware. DistServe / Splitwise / Mooncake are the canonical references; gains of several factors at SLO are reported.
  • The two techniques compose with each other and with the rest of the stack (paged attention, continuous batching, chunked prefill, prefix caching, W4). Each contributes a multiplicative factor; full-stack production systems are 50–200× a naive baseline (illustrative, workload-dependent).
  • Each technique has a regime where it doesn't help. Speculation hurts at high batch. Disaggregation adds operational complexity and is unnecessary when load is so low that co-located scheduling never compromises. Quantization has accuracy costs. Knowing the regimes is the engineering work.

The reader who has internalized this chapter should be able to: derive the speculative speedup formula on demand; explain why decode is memory-bound and prefill compute-bound; sketch a disaggregated cluster's request flow and KV-transfer pipeline; argue for or against speculation given a batch state; size prefill/decode pools for a workload mix; and read papers on EAGLE, Medusa, DistServe, Splitwise, Mooncake without needing the introductions.

The next deep dive in this curriculum (Month 5, Week 18) builds on these primitives toward end-to-end production serving stacks.


References (canonical, for further reading)

  • Leviathan, Kalman, Matias. Fast Inference from Transformers via Speculative Decoding. ICML 2023.
  • Chen, Borgeaud, Irving, et al. Accelerating Large Language Model Decoding with Speculative Sampling. 2023.
  • Cai et al. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. 2024.
  • Li et al. EAGLE / EAGLE-2: Speculative Sampling Requires Rethinking Feature Uncertainty. 2024.
  • Fu et al. Lookahead Decoding. 2024.
  • Zhong et al. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving. OSDI 2024.
  • Patel et al. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. 2024.
  • Qin et al. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. 2024.
  • Gloeckle et al. Better & Faster Large Language Models via Multi-token Prediction. 2024.

(Citations are by name and year. Treat performance numbers in this chapter as approximate ranges from public reporting; reproduce on your own workload before relying on them.)

Comments