Saltar a contenido

Deep Dive 08-LLM Inference Serving Systems

Paged attention, continuous batching, vLLM architecture, prefill/decode disaggregation, and the algebra that makes them necessary.

Self-contained. Reading the SOSP and OSDI papers afterward should feel like consolidation, not first contact.


0. Why this chapter exists

A trained LLM is, fundamentally, a function: given a prompt, produce a continuation. In a research notebook, this is one line. In production, it is a distributed system with throughput, latency, fairness, and memory-management problems that look more like an operating system than like deep learning.

This chapter rebuilds that system from first principles. We start by deriving-from arithmetic intensity alone-why serving an LLM is fundamentally a memory-bandwidth problem, why batching is the universal lever, and why the naive batching strategies fail. From there the architecture of vLLM (paged attention + continuous batching) is forced rather than chosen, and so are the more recent designs (chunked prefill, prefix caching, prefill/decode disaggregation).

A reader who finishes this chapter should be able to:

  1. Predict, within a factor of two, the throughput of a given model on a given GPU at a given batch size.
  2. Explain why doubling batch size at decode is nearly free in compute but expensive in KV memory.
  3. Sketch a paged-attention kernel and a continuous-batching scheduler from memory.
  4. Tune vLLM (or any descendant) without poking at flags blindly.
  5. Recognize when the right answer is "split prefill and decode onto different machines."

We will derive every claim. References to SOSP'23 (vLLM / PagedAttention, Kwon et al.), OSDI'22 (Orca, Yu et al.), OSDI'24 (DistServe, Zhong et al.), and the Sarathi-Serve work (Agrawal et al.) are pointers, not load-bearing-the math here stands alone.


1. The two phases of LLM inference

A decoder-only Transformer (GPT-style) has two operational regimes that look like the same forward pass but have very different performance characteristics. Understanding this is the foundation of everything else.

1.1 Prefill

When a request arrives with prompt of length P, the model must compute hidden states (and KV-cache entries) for all P positions before any new token can be sampled. Crucially, all P positions can be processed in parallel in a single forward pass:

  • The attention mask is causal but the matmuls are still over P × d and P × P matrices.
  • For each layer, you do roughly O(P · d²) for the QKV projection, O(P² · d) for attention scores, O(P · d²) for the FFN.
  • Per-token work is large; per-request work is P × per-token work.

Prefill is compute-bound on modern hardware. The GPU can saturate its tensor cores because the matmuls are tall (lots of rows)-there is plenty of arithmetic per byte of weight read.

1.2 Decode

After prefill, generation is autoregressive: produce token P+1, append to context, produce P+2, etc. Each step does a forward pass over a single new token (the previously sampled one), reusing the KV-cache for all earlier positions.

Per-step:

  • QKV projection on a 1 × d vector: O(d²) FLOPs, but reads the full weight matrix parameters.
  • Attention: query is 1 × d, keys/values are S × d_kv (where S is the current context length). Compute O(S · d), memory read `O(S · d_kv) - proportional to context length.
  • FFN: 1 × d against d × 4d and 4d × d weights: O(d²) FLOPs, O(d²) memory.

Decode is memory-bandwidth-bound. We are reading huge weight tensors from HBM and doing a tiny dot-product against a single token vector.

1.3 The cost model (the equation that runs the chapter)

Let us define, for a model:

  • W = total parameter count in bytes (bytes-already accounts for dtype).
  • K(s) = KV-cache size in bytes for a sequence of length s.
  • B_HBM = HBM bandwidth, e.g., 3.35 TB/s for an H100 SXM.
  • F_peak = peak tensor-core throughput, e.g., ~990 TFLOP/s BF16 dense on H100.
  • b = batch size (number of sequences in this iteration).

Decode step time, when memory-bound (we will validate this assumption shortly), is:

$$ T_\text{decode-step}(b) \;\approx\; \frac{W + b \cdot K(s)}{B_\text{HBM}} $$

Read this carefully. The model weights W are read once per step-every sequence in the batch sees the same weights. But each sequence has its own KV-cache, and those do scale with b.

When compute-bound (prefill, or very large batch decode), the time becomes (b · per-token-FLOPs) / F_peak instead. The crossover batch size between regimes is what determines whether batching helps.

1.4 A worked example: Llama-3-70B on H100

  • Parameters: 70B × 2 bytes (BF16) = 140 GB. Doesn't fit on one H100 (80 GB)-assume 2× H100 with tensor parallelism, so each GPU holds ~70 GB.
  • KV-cache per token (Llama-3-70B): 80 layers × 8 KV heads × 128 head dim × 2 (K and V) × 2 bytes = 327,680 bytes/token ≈ 320 KB/token.
  • HBM bandwidth per GPU: 3.35 TB/s.
  • For a sequence at length 2048: KV size ≈ 2048 × 320 KB ≈ 640 MB.

At batch=1, decode step time ≈ (70 GB + 0.64 GB) / 3.35 TB/s ≈ 21 ms per layer-shard, per token. (Real implementations achieve roughly this.)

At batch=32, weights are still 70 GB but KV totals 32 × 0.64 GB = 20.5 GB. Step time ≈ (70 + 20.5) / 3.35 ≈ 27 ms. We've gone from 1 token / 21 ms to 32 tokens / 27 ms-almost a 24× throughput improvement for a 30% latency hit. That is why batching is the central lever.

Hold this picture: weight bytes amortize across the batch, KV bytes do not.


2. Why decode is memory-bound (rigorous derivation)

Every performance optimization in this chapter falls out of a single number: arithmetic intensity, the ratio of FLOPs done per byte read.

2.1 The roofline crossover

GPUs have a roofline: at low arithmetic intensity, you are bandwidth-limited (achieved_FLOPs = intensity × B_HBM). At high intensity, you are compute-limited (achieved_FLOPs = F_peak). The crossover happens at:

$$ I_\text{crossover} = \frac{F_\text{peak}}{B_\text{HBM}} $$

For H100 SXM BF16: 990 TFLOP/s ÷ 3.35 TB/s ≈ 295 FLOP/byte. (Numbers vary; 280–310 is the right ballpark.)

Below ~295 FLOP/byte, you are memory-bound. Above, compute-bound.

2.2 Decode arithmetic intensity at batch=1

Take the FFN block-typically the dominant compute. Llama-3-70B FFN is d=8192, intermediate d_ff ≈ 28672 (SwiGLU). Per token:

  • FLOPs: 2 · d · d_ff · 3 (the 3 is for SwiGLU's three matmuls) ≈ 2 · 8192 · 28672 · 3 ≈ 1.4 GFLOP.
  • Bytes read (weights): d · d_ff · 3 · 2 (BF16) ≈ 1.4 GB of weights for this one block per layer.

Intensity = 1.4 GFLOP / 1.4 GB = 1 FLOP/byte. That is 295× below the crossover. We are deep in the memory-bound regime.

Adding batch b keeps the bytes constant (still one weight read per step) and multiplies the FLOPs by b. So intensity scales linearly with batch:

$$ I_\text{decode}(b) \approx b \cdot 1\,\text{FLOP/byte} $$

We need b ≈ 295 to hit the compute roof. In practice nobody runs decode at batch=295 because (a) KV memory runs out long before that, and (b) you hit other limits-kernel launch overhead, attention scaling, etc. But the direction is right: bigger batch ⇒ closer to compute-bound ⇒ better hardware utilization.

2.3 Attention is the wrinkle

The FFN is straightforwardly batch-friendly because all tokens share weights. Attention is not. Each sequence reads its own KV-cache. So in the attention block:

  • FLOPs: O(b · S · d) (for each of the b sequences, dot-product query against S KV entries).
  • Bytes: O(b · S · d_kv) (each sequence loads its own KV).

Intensity is independent of batch size for attention: ≈ d / d_kv, which is roughly the head-replication factor in GQA. For Llama-3-70B with 64 query heads and 8 KV heads, that's 8 FLOP/byte. Still way below 295.

This is why long contexts hurt so much-attention's bandwidth cost grows with sequence length but its arithmetic intensity does not. Even with infinite batching, attention stays memory-bound. (Flash-attention helps with on-chip SRAM tiling but cannot change the off-chip KV reads.)

2.4 Takeaway equation

$$ \boxed{\;T_\text{step}(b, S) \approx \underbrace{\frac{W}{B_\text{HBM}}}{\text{weight read, amortized}} + \underbrace{\frac{b \cdot K\text{per-token} \cdot S}{B_\text{HBM}}}_{\text{KV read, scales with } b\cdot S}\;} $$

Two terms: weights (great news for batching) and KV (bad news for batching). The whole architecture of vLLM is about managing the second term as aggressively as possible.


3. The big idea: batching saves bandwidth

Let's stare at the equation again with the question: "what does batching buy?"

  • Per-token throughput at batch b: b / T_step(b, S).
  • At small batch, T_step ≈ W/B_HBM (weights dominate). Throughput grows linearly with b.
  • At large batch, T_step ≈ b · K_per-token · S / B_HBM (KV dominates). Throughput plateaus.

The plateau happens when the KV term equals the weight term:

$$ b^* \approx \frac{W}{K_\text{per-token} \cdot S} $$

For Llama-3-70B at 2K context, b* ≈ 70 GB / 640 MB ≈ 110. In principle we want batch ~100. In practice we are limited by total KV memory.

KV budget: each GPU has 80 GB. Weights take 70 GB. That leaves ~10 GB for KV-cache. At 320 KB/token, that is 30K tokens total across the batch. At 2K context per sequence, that's at most 15 concurrent sequences.

So the physics says we want batch=100, but the capacity says batch=15. The gap between these is exactly the design pressure that produced paged attention. Every byte of KV memory wasted is a sequence we can't serve.

3.1 Why weight-amortization scales the way it does

Subtle but important: the weight read is once per iteration, not once per request. So as long as you can pack b sequences into one forward pass, the weight bandwidth cost is fixed. This is what makes "continuous batching" so powerful-every iteration we re-fill the batch up to capacity, so the GPU never sees a half-empty batch.


4. KV-cache management-the central problem

A sequence of length S needs S · K_per-token bytes of KV-cache. Two facts make this hard:

  1. S is unknown at admission time. Generation continues until an EOS token or max_tokens. We may stop after 5 tokens or 5000.
  2. S grows during the sequence's lifetime. The KV-cache for a sequence is mutated in place at every decode step.

4.1 The naive approach: contiguous per-request buffer

For each request, allocate a contiguous buffer of size max_seq_len · K_per-token upfront. Fill it as tokens arrive.

Problems:

  • Internal fragmentation. A 100-token output in a 4096-token buffer wastes 97% of the allocation.
  • External fragmentation. Buffers of varying sizes leave holes the allocator can't fill.
  • No prefix sharing. Two requests with the same system prompt store identical KV bytes twice.
  • Kills batch size. With 10 GB of KV budget and 4K-token reservations, you fit 10 GB / (4096 · 320 KB) ≈ 8 sequences.

Numerically: in a 70B-on-80GB setup, naive contiguous allocation caps batch at ~8. The physics said 100. We are leaving an order of magnitude of throughput on the floor.

4.2 What we want from a KV allocator

  • Fine-grained allocation-give a sequence one chunk at a time, as it grows.
  • No fragmentation-chunks are uniform size, so any free chunk fits any need.
  • Cheap relocation-but actually, no relocation: chunks are virtually addressed.
  • Sharing-distinct sequences with identical prefix-content share the same physical chunks.
  • Cheap GC-reference counting is enough.

This is, almost word for word, the description of a virtual-memory system. Which is exactly the analogy Kwon et al. drew.


5. Paged attention (the heart of vLLM)

PagedAttention treats KV-cache like an OS treats process memory: it is virtually contiguous but physically paged.

5.1 The block pool

HBM (minus weights and activations) is carved up into uniform physical blocks. Each block holds the KV-cache for B_block consecutive token positions in one sequence:

block_size_bytes = B_block * num_kv_heads * head_dim * 2 (K and V) * dtype_bytes

For Llama-3-70B with B_block = 16, BF16: 16 · 8 · 128 · 2 · 2 = 65,536 bytes = 64 KB per block per layer. Across 80 layers: 5.12 MB per block (this is the per-sequence cost of one block of KV across the whole stack). With 10 GB of KV budget that's ~2000 blocks total-enough for 2000 · 16 = 32K tokens distributed however we like.

B_block = 16 is the typical choice-large enough to amortize indirection overhead, small enough to keep wasted memory under one-block-per-sequence.

The block pool is just an array P[0..N-1] of these fixed-size buffers, plus a free-list F of unused indices.

5.2 The per-request page table

Each in-flight request has a block table-a list of physical block indices, in logical order:

request 7's KV at logical position 145
  → block 145 // 16 = 9 in the table
  → physical block index = block_table[7][9]
  → byte offset within block = (145 % 16) * per_token_kv_bytes

This is byte-for-byte the same as a page table mapping virtual addresses to physical frames. The block table is small (one int per 16 tokens-8K context = 512 entries = 2 KB).

5.3 The block manager

The block manager owns the free list and exposes:

class BlockManager:
    def can_allocate(self, request) -> bool:
        # check if free_blocks >= ceil(request.context_len / block_size)
        ...

    def allocate(self, request):
        # pop blocks from free_list, build initial block_table
        ...

    def append_slot(self, request):
        # decode added a token; if last block is full, allocate a new one
        if request.last_block_full():
            block = self.free_list.pop()
            request.block_table.append(block)
            self.refcount[block] = 1

    def free(self, request):
        for block in request.block_table:
            self.refcount[block] -= 1
            if self.refcount[block] == 0:
                self.free_list.push(block)

    def fork(self, parent_request, child_request):
        # for prefix sharing or beam search:
        # child reuses parent's blocks; bump refcount
        for block in parent_request.block_table:
            self.refcount[block] += 1
        child_request.block_table = list(parent_request.block_table)

Reference counting is doing the heavy lifting for prefix sharing-see §10.

5.4 The paged attention kernel

The naive attention kernel reads K and V from contiguous tensors. The paged kernel takes the block table as an extra input and does an indirect read:

for each query token q in this iteration:
    for each block_idx in block_table[q.request_id]:
        load K_block, V_block from P[block_idx]   # gather from non-contiguous memory
        accumulate q · K_block^T into scores
    softmax(scores)
    accumulate scores · V_block into output

The gather is the cost of indirection. Measured overhead in the original paper: ~10% slowdown vs a contiguous-tensor kernel. This is the price of fragmentation-freedom, and it's a steal: you typically triple your effective batch size, so you net 2.5× throughput after eating the 10% kernel overhead.

Implementation tricks that keep the cost down:

  • Block size is a compile-time constant (e.g., 16) so loops unroll.
  • Within a block, KV is contiguous → vectorized load.
  • Block table is fetched once per query, kept in registers/shared memory.
  • Flash-attention-style on-chip tiling still applies; the page table just controls where the K/V tiles come from.

5.5 Memory waste analysis

The only wasted memory is in the last partial block of each sequence. In the worst case, a sequence of length 17 with block_size=16 occupies 2 blocks but uses only 17/32 of them. Across b sequences, total waste is at most b · block_size · K_per-token bytes-bounded, small.

Compare to naive: waste is `b · (max_seq_len - actual_len) · K_per-token - unbounded relative to actual usage. The difference is 1-2 orders of magnitude in real workloads.

5.6 Diagram: block table

Logical KV view (request 7, 50 tokens so far, block_size=16):

  positions: [0..15]   [16..31]  [32..47]  [48..49]
  block #:     0          1         2         3 (partial)

block_table[7] = [42, 17, 99, 23]
                  │   │   │   │
                  ▼   ▼   ▼   ▼
Physical block pool:
  [ ... | block 17: tokens 16-31 | ... | block 23: tokens 48-49 (partial) | ... | block 42: tokens 0-15 | ... | block 99: tokens 32-47 | ... ]

Physically scattered, logically contiguous. Just like virtual memory.

5.7 Exercise (do it now, mentally)

How many blocks does an 8K-context sequence need at block_size=16? 512. How much KV memory is that for Llama-3-70B BF16? 512 · 80 layers · 64 KB/layer/block = 2.6 GB. Sanity-check against §1.4: 8192 · 320 KB/token = 2.6 GB. Match.


6. Continuous batching (Orca's iteration-level scheduling)

Paged attention solves the spatial problem (memory). Continuous batching solves the temporal problem (when do we run which sequence?).

6.1 The naive (request-level) batching strategy

The dumb scheduler:

batch = first N requests in queue
run them all in parallel until ALL are done
return results

This wastes massive amounts of GPU time. Suppose 8 requests batch together, 7 finish at token 50 but one runs to token 2000. For 1950 decode steps, the GPU runs a batch of one. We just gave back 7/8 of our throughput.

6.2 The Orca insight: schedule per iteration, not per request

What if the scheduler runs between every decode step?

loop forever:
    1. select a batch from runnable requests, subject to memory + compute budget
    2. run one forward pass over the batch
    3. for each sequence in batch:
         - sample next token
         - if EOS or max_tokens: mark finished, free its blocks
         - else: append token, request stays runnable
    4. admit new requests if there's spare capacity

Finished requests leave immediately. Newly-arrived requests join immediately. The batch is continuously refilled-hence "continuous batching."

6.3 The scheduler in pseudocode

def schedule_iteration(scheduler):
    # 1. Take all currently-running requests (those with prefilled KV)
    running = scheduler.running_queue

    # 2. Drop any that finished last step
    running = [r for r in running if not r.finished]

    # 3. Try to grow the batch by admitting waiting requests
    waiting = scheduler.waiting_queue
    while waiting and scheduler.can_admit(waiting[0]):
        req = waiting.popleft()
        scheduler.block_manager.allocate(req)  # for prefill
        running.append(req)

    # 4. Each running req consumes one decode step (or one prefill chunk - see §8)
    for r in running:
        scheduler.block_manager.append_slot(r)  # may allocate a block

    # 5. If memory pressure, evict some running reqs (preempt - see §9)
    while scheduler.over_budget():
        victim = scheduler.choose_victim()
        scheduler.preempt(victim)
        running.remove(victim)
        scheduler.waiting_queue.appendleft(victim)  # priority for re-admission

    return running

The interesting part: step 3 (admission) and step 5 (preemption) are policy. FCFS by default. You can plug in priority, fairness, deadline-aware, etc.-Orca's contribution was the mechanism; the policy is yours.

6.4 Why this is so much better

  • No long-tail blocking. A 5-token request and a 5000-token request that arrived together don't share fate.
  • Constant-batch GPU utilization. As long as the queue is non-empty, the scheduler keeps the running batch near capacity.
  • Bounded latency overhead. The scheduler runs in microseconds; the forward pass takes tens of milliseconds. Scheduler overhead is <1%.

In production, continuous batching alone (without paged attention) typically gives 2-4× throughput vs request-level batching. Combined with paged attention: 5-10×.

6.5 The "iteration" subtlety

In Orca's paper an "iteration" is a single forward pass that processes one new token per running sequence, plus prefill on any newly-admitted sequences. In modern vLLM, prefill and decode mix more flexibly-see §8.


7. Putting it together: vLLM architecture

We now have all the parts. Here is the system, in dependency order.

7.1 Components

                      ┌─────────────────┐
   HTTP/OpenAI API ─→ │   API Server    │
                      └────────┬────────┘
                               │ enqueue request
                      ┌─────────────────┐
                      │     Engine      │  (main loop, blocking)
                      │  ┌───────────┐  │
                      │  │ Scheduler │  │  picks batch each iter
                      │  └─────┬─────┘  │
                      │        │        │
                      │  ┌─────▼─────┐  │
                      │  │  Block    │  │  paged-KV alloc
                      │  │  Manager  │  │
                      │  └─────┬─────┘  │
                      │        │        │
                      │  ┌─────▼─────┐  │
                      │  │   Model   │  │  forward pass
                      │  │ Executor  │  │  (paged attn kernel)
                      │  └─────┬─────┘  │
                      │        │        │
                      │  ┌─────▼─────┐  │
                      │  │  Sampler  │  │  next token per seq
                      │  └─────┬─────┘  │
                      └────────┼────────┘
                               │ output tokens
                      ┌─────────────────┐
                      │ Output Processor│  detokenize, stream
                      └─────────────────┘

API server. HTTP front-end, often OpenAI-compatible. Translates requests into the engine's internal format, exposes streaming responses.

Engine. The orchestrator. Owns the scheduler, block manager, model executor, sampler, output processor. Single-threaded main loop.

Scheduler. Decides which requests run this iteration. Maintains running and waiting queues. Enforces memory budget, scheduling policy, max-batch-size, max-batched-tokens.

Block manager. Owns the physical block pool, free list, per-request block tables, refcounts. Exposes allocate, append_slot, free, fork, swap_out, swap_in.

Model executor. Runs the forward pass on whatever batch the scheduler hands it. Internally calls paged-attention kernels with block tables. May span multiple GPUs (tensor or pipeline parallel).

Sampler. Given final-layer logits, samples the next token per sequence. Supports temperature, top-k, top-p, repetition penalties, beam, etc.

Output processor. Detokenizes, accumulates output, streams partial completions back via the API server.

7.2 The engine main loop

class Engine:
    def __init__(self, model, scheduler, executor, sampler, out_proc):
        self.model = model
        self.scheduler = scheduler
        self.executor = executor
        self.sampler = sampler
        self.out_proc = out_proc

    def step(self):
        # 1. Build the next batch
        batch = self.scheduler.schedule_iteration()
        if not batch:
            return  # nothing to do, idle

        # 2. Forward pass: prefill chunks + decode steps mixed
        logits = self.executor.forward(
            batch,
            block_tables=self.scheduler.block_manager.tables_for(batch),
        )

        # 3. Sample next tokens
        next_tokens = self.sampler.sample(logits, batch.sampling_params)

        # 4. For each request, append token, check stop conditions
        for req, tok in zip(batch, next_tokens):
            req.append_token(tok)
            if self.is_done(req, tok):
                req.finish()
                self.scheduler.block_manager.free(req)
            self.out_proc.emit(req, tok)

    def run(self):
        while True:
            self.handle_new_requests()
            self.step()

The whole system is a tight loop around step(). Iteration time is typically 10-50 ms. The scheduler runs at the same cadence-a key reason latency is bounded.

7.3 Where everything plugs in

  • Tensor parallelism lives inside `executor.forward - each GPU runs a shard of the layers, NCCL all-reduces between shards. Transparent to the scheduler.
  • Pipeline parallelism turns the loop into a pipelined sequence of micro-batches. More involved.
  • Quantization (INT8, INT4, FP8) changes weight bytes but not the architecture.
  • LoRA / multi-tenant adapters plug into the executor: it composes the base weights with the active adapter at forward time.

8. Mixing prefill and decode in continuous batching

A subtle but very important point. Prefill and decode have different compute profiles:

  • A prefill of 1000 tokens: ~1000× the FLOPs of a single decode step on the same model.
  • A decode step: tiny FLOPs but reads all weights.

If you run a "pure prefill" iteration, decode requests stall (TPOT spikes). If you run a "pure decode" iteration, prefill requests stall (TTFT spikes). Naively alternating wastes hardware.

8.1 Mixed batches

Modern engines run mixed iterations: a single forward pass that includes prefill tokens for newly-admitted sequences and decode tokens for already-running ones. The kernel is careful about the per-token positions and the per-sequence KV layouts, but the high-level idea is:

  • The batch has N_decode sequences each contributing 1 token, plus N_prefill chunks contributing many tokens.
  • Total tokens in this iteration: T = N_decode + sum(prefill_chunk_lengths).
  • Forward pass is one giant matmul over T tokens.

8.2 Chunked prefill (Sarathi-Serve)

Problem: a 4000-token prefill in one chunk takes much longer than a 1-token decode. Mixing them naively means the iteration is gated by the prefill length-TPOT spikes for everyone.

Sarathi-Serve's fix: bound each iteration's total token budget. Split long prefills into chunks of, say, 512 tokens. Each iteration processes min(remaining_prefill, chunk_size) prefill tokens for that request, alongside 1 decode token per running request. Total token count per iteration is roughly constant ⇒ wall-clock per iteration is roughly constant ⇒ TPOT is bounded.

This is a knob: max_num_batched_tokens in vLLM. Setting it to, say, 2048 says "every iteration's prefill+decode work, in token units, sums to at most 2048." Lowers TPOT variance dramatically; slightly hurts throughput on pure-prefill workloads.

8.3 Pseudocode for chunked prefill

def build_iteration_batch(scheduler, max_batched_tokens):
    batch = []
    budget = max_batched_tokens

    # 1. Decode steps for all running sequences (1 token each)
    for r in scheduler.running:
        batch.append(DecodeStep(r))
        budget -= 1

    # 2. Fill the rest with prefill chunks
    for r in scheduler.waiting + scheduler.partially_prefilled:
        if budget <= 0: break
        chunk_len = min(r.remaining_prefill, budget)
        if chunk_len < MIN_CHUNK and budget < r.remaining_prefill:
            continue  # don't bother with tiny chunks unless we'll finish
        batch.append(PrefillChunk(r, chunk_len))
        budget -= chunk_len

    return batch

The scheduler is balancing two competing pressures: keep TPOT low (cap the iteration size) and keep TTFT low (admit prefills aggressively). Both pressures are knobs the operator sets per-deployment.

8.4 Why mixed batches are bandwidth-good

The weight read is still once per iteration. Whether the iteration has 32 decode tokens or 32 prefill tokens or 16+16, the weight bytes are constant. So mixing prefill into spare decode capacity is free in bandwidth. It's only paid in extra FLOPs-which we have plenty of when memory-bound.

This is why chunked prefill is such a clean throughput win: it converts "spare bandwidth" (idle decode iteration) into "useful prefill progress."


9. Eviction and preemption

What happens when a new admission would push KV-cache over budget, and we have no spare blocks? The scheduler evicts a victim sequence.

9.1 Two strategies

Swap (CPU offload). Move the victim's KV-cache blocks from HBM to CPU pinned memory. When the sequence is rescheduled, swap them back. Cost: PCIe bandwidth one-way. For Llama-3-70B, swapping a 2K-context sequence is 2K · 320 KB = 640 MB × PCIe ≈ 25 ms one-way, 50 ms round-trip. Real, but bounded.

Recompute. Drop the KV-cache. When the sequence is rescheduled, re-run prefill on its prompt + already-generated tokens. Cost: a full prefill, which can be cheap (chunked prefill is fast) for short contexts but expensive for long ones.

9.2 Choosing the right strategy

  • Swap wins when (a) PCIe bandwidth is plentiful and (b) prefill is expensive (long context).
  • Recompute wins when (a) generation is short (tiny KV but full prompt prefill is short) and (b) you don't want to manage a CPU-side buffer.

vLLM exposes swap_space (GiB of CPU RAM reserved)-set to nonzero to enable swap. Otherwise it recomputes.

9.3 Victim selection

Default policy: evict the most recently admitted (LIFO)-protects long-running sequences that have already invested compute. Other policies: lowest priority, longest-remaining, fairness-based.

def choose_victim(running):
    # vLLM default: most recently scheduled
    return running[-1]

The choice matters for tail latency but not for throughput.

9.4 Preemption pseudocode

def preempt(req, mode):
    if mode == "swap":
        for block in req.block_table:
            cpu_block = self.swap_pool.allocate()
            copy_hbm_to_cpu(block, cpu_block)
            req.swapped_blocks.append(cpu_block)
        self.block_manager.free(req)  # return HBM blocks
        req.state = "swapped"
    elif mode == "recompute":
        self.block_manager.free(req)
        req.state = "waiting"   # will re-prefill on resume
        req.kv_dropped = True

Preemption is rare in well-tuned systems. If it happens often, you've over-admitted; lower max_num_seqs.


10. Prefix caching

Most production LLM workloads share prefixes:

  • System prompts: every chat call from an app starts with the same 500-token system message.
  • Multi-turn conversations: turn N+1 has all of turn N's prefix.
  • RAG: long retrieved-document prompts repeat across queries about the same doc.
  • Few-shot prompts: shared examples across requests.

If two sequences share a prefix of length L, they have identical KV-cache for those L positions (because attention is causal-KV at position i only depends on tokens 0..i). Storing them twice is waste.

10.1 The mechanism

Block-level content hashing. When a prefill produces a full block (16 tokens):

block_hash = hash(parent_block_hash, tokens[block_start:block_end])

This is a cumulative hash-block at position k*16 depends on the hashes of all earlier blocks. So two requests with identical first-`L - token prefix produce identical block hashes for all blocks fully inside that prefix.

The block manager keeps a hash → block_idx map. On a hit:

def maybe_share_block(req, block_idx_in_seq, block_hash):
    if block_hash in self.hash_table:
        existing_block = self.hash_table[block_hash]
        self.refcount[existing_block] += 1
        req.block_table[block_idx_in_seq] = existing_block
        return True   # reused
    return False

Reference counting handles cleanup: when a block's refcount drops to zero (all referring sequences finished), it goes back on the free list.

10.2 Copy-on-write

Subtle: a shared block is read-only-the sampled tokens diverge after the prefix. But the prefix blocks are never written to during decode (KV at those positions is fixed). So no CoW is needed for prefix sharing per se.

For beam search or speculative decoding, where two children of one parent diverge mid-block, CoW is needed: when child writes, allocate a new block, copy the parent's prefix portion, then write. Same block manager primitive-just a different caller.

10.3 Hit rates in production

This is one of those features where the empirical numbers are larger than you'd expect. In real chat workloads (multiple tenants, shared system prompts, multi-turn):

  • System prompts shared across all users: hit rates above 50% on prefix tokens are routine.
  • Multi-turn conversations: each turn re-uses N-1 turns' worth of KV. Effective compute reduction at decode is small (KV is already there) but prefill cost goes to near zero-you only prefill the new user message.
  • Mass RAG over a fixed corpus: prompts share the retrieved-doc tokens, hit rates can exceed 70%.

Because prefill is the expensive part, prefix caching often dominates the throughput win for chat-style workloads. Enable it (enable_prefix_caching=True) by default.

10.4 Eviction policy for prefix blocks

A prefix block with refcount=0 doesn't have to be freed-it might be useful later. vLLM (and similar systems) keep evicted-but-cached blocks on a separate "evictor" list (LRU). When new allocation needs space, evict from there first. This way the prefix cache is bounded by KV memory, not by active sequences.


11. Speculative decoding (preview-full treatment in deep dive 10)

Decoding one token per step is wasteful when the model is memory-bound: we read all weights to do a tiny matmul. Idea: have a small "draft" model produce K candidate tokens cheaply, then verify them in one big-model forward pass.

draft_tokens = draft_model.generate(K)            # cheap, K small-model steps
target_logits = target_model.forward(draft_tokens) # ONE big-model step
accept tokens that match target's argmax (or pass rejection sampling)
keep prefix of accepted tokens; resume from first reject

Why it works: the big model's forward pass over K tokens is roughly the same wall-clock as over 1 token (memory-bound-weights dominate). So we get up to K tokens per big-model step, at the cost of one cheap draft pass and one big verification pass.

11.1 Integration with paged attention

The verification pass is a "K-token decode"-we extend each sequence's KV-cache by up to K positions in one go. Block manager must allocate up to K new slots upfront (or fewer if blocks have room). After verification, if only k < K are accepted, the manager must truncate the block table and reset positions for the rejected suffix.

def speculative_step(req, draft_tokens, target_logits):
    # Append slots for K candidate tokens
    for _ in range(K):
        block_manager.append_slot(req)

    # Verify
    accepted = []
    for i, (draft, target) in enumerate(zip(draft_tokens, target_logits)):
        if accept(draft, target):
            accepted.append(draft)
        else:
            break

    # Truncate KV to len(accepted) + 1 (the corrected token)
    block_manager.truncate(req, prev_len + len(accepted) + 1)
    return accepted + [target_argmax_at_first_reject]

Realistic acceptance rates with a good draft model: 60-80%, giving ~2-3× decode throughput.

This is a preview; deep dive 10 derives the rejection sampler, draft-model selection, and tree-based variants (Medusa, EAGLE).


12. Prefill/decode disaggregation

The most recent (and perhaps most consequential) shift in inference architecture: stop running prefill and decode on the same workers.

12.1 The problem with co-location

Recall §1: prefill is compute-bound, decode is memory-bound. They want different things from the hardware:

  • Prefill is happy with low memory bandwidth, high FLOPs. Likes large batches of tokens (which it has-many prompt tokens). Hits the FLOP roof easily.
  • Decode is starved for bandwidth, indifferent to FLOPs. Wants large batches of sequences but each is just 1 token.

When you mix them on the same GPU:

  • The mixed batch cannot be optimal for either. Sarathi-Serve's chunked prefill (§8) is the best you can do with co-location, but you're still leaving throughput on the table.
  • A long prefill iteration spikes TPOT for all the decode-only requests in that batch.
  • You can't pick different hardware for the two phases (e.g., use H100 for prefill, A100 for decode).

12.2 The disaggregation idea

Run two pools:

  • Prefill workers: optimized for compute. Run pure prefill, no decode. Smaller KV-cache footprint (sequences leave after prefill).
  • Decode workers: optimized for memory bandwidth and capacity. Run pure decode batches.

Workflow:

  1. Request arrives at router.
  2. Router sends it to a prefill worker.
  3. Prefill worker runs prefill, produces final-layer hidden states + KV-cache for the prompt.
  4. KV-cache is transferred to a decode worker.
  5. Decode worker generates tokens until completion.

12.3 The KV-cache transfer

This is the load-bearing engineering piece. KV-cache for a 2K-token Llama-3-70B prompt is 640 MB (§1.4). Moving that over PCIe takes ~25 ms; over NVLink, single-digit ms; over RDMA (200 Gbps Infiniband), ~25 ms.

Key tricks:

  • Layer-by-layer streaming: send layer L's KV while prefill computes layer L+1. Hides most of the transfer behind computation.
  • Zero-copy + RDMA: GPUDirect-RDMA writes the KV-cache straight from prefill GPU's HBM to decode GPU's HBM, without CPU involvement.
  • Block-aligned transfers: send paged-attention blocks as units; integrate with block manager on the receiving side.

The transfer cost is amortized over all the decode tokens: if the prompt generates 200 output tokens, 25 ms of transfer is 25 / (25 + 200·30ms) ≈ 0.4% overhead. Cheap.

12.4 Why it helps

DistServe (Zhong et al., OSDI'24) and Mooncake show:

  • Prefill workers run at much higher GPU utilization (80%+ of FLOP roof) than co-located workers.
  • Decode workers run at higher batch sizes (no prefill stealing KV memory) → better bandwidth amortization.
  • The two pools can be sized independently to match workload mix.
  • Different SLO classes can be honored independently-TTFT comes from prefill pool, TPOT from decode pool.

End-to-end throughput improvements: 1.5-3× over co-located continuous batching, depending on workload.

12.5 When not to disaggregate

  • Tiny deployments (1-2 GPUs): the network overhead and operational complexity dominate.
  • Workloads where prefill is small (chat with short messages, mostly decode)-co-located handles fine.
  • Workloads where prefill dominates (long prompts, short outputs)-you may want all workers to be prefill-shaped.

Disaggregation pays off when you have enough scale to dedicate GPUs and a workload mix that genuinely has both regimes active.

12.6 Splitwise/Mooncake variants

  • Splitwise: similar idea, distinguishes "prompt" and "token" phases. Microsoft prod.
  • Mooncake (Moonshot AI's Kimi): pushes further, treats KV-cache as a first-class distributed object with its own caching tier (HBM → DRAM → SSD), and routes requests to maximize prefix-cache hits at the global level.

13. Performance metrics and SLOs

You cannot tune what you cannot measure. The standard metrics:

13.1 TTFT-Time to First Token

End-to-end time from request arrival to first generated token. Components:

  • Queue wait: how long the request sits in waiting_queue before scheduling.
  • Prefill latency: time to process the prompt. Roughly P · per-token-prefill-time for prompt length P.
  • First-decode latency: one decode step time.

For a chat UI, TTFT determines "feels responsive." Typical SLO: p99 < 1 second for short prompts.

13.2 TPOT-Time Per Output Token

Once decoding starts, average decode step time. Determines streaming output speed. Typical SLO: p99 < 100ms per token (10 tok/s minimum) for chat. For high-quality experiences: 30-50ms (20-30 tok/s).

TPOT is dominated by per-iteration time, which is dominated by (W + b·KV) / B_HBM. Tuning levers: batch size (lower → lower TPOT, lower throughput) and chunked prefill chunk size (lower → less interference).

13.3 Throughput

Total output tokens generated per second across all concurrent requests. The headline number on benchmarks. Limited by hardware: at saturation, throughput ≈ b_max · B_HBM / (W + b_max · KV_max).

These three metrics trade off:

  • Higher batch → higher throughput, higher TPOT.
  • Lower batch → lower TPOT, lower throughput.
  • More aggressive prefill admission → lower TTFT, higher TPOT spikes.

The SLO formulation looks like:

maximize throughput
s.t. TTFT_p99 ≤ X ms
     TPOT_p99 ≤ Y ms

And the operator's job is to pick the levers that satisfy the SLO at maximum throughput.

13.4 Other useful metrics

  • Goodput: throughput counting only requests that met SLO. Better than raw throughput for capacity planning.
  • Queue length over time: leading indicator of saturation.
  • KV utilization: used_blocks / total_blocks. Should be high (80-95%) at saturation.
  • Preemption rate: ideally <1% of requests.
  • Prefix cache hit rate: workload-dependent; higher is free throughput.

14. Tuning vLLM (and descendants)

The names are vLLM-specific but the concepts transfer. (Field names may have evolved between versions; if a flag isn't where I describe, the equivalent exists.)

14.1 gpu_memory_utilization (e.g., 0.9)

Fraction of GPU memory the engine may use. The rest is reserved for system overhead, CUDA workspace, NCCL buffers. Default 0.9.

  • Higher (0.95): more KV blocks → larger batch → higher throughput. Risk of OOM under transient spikes.
  • Lower (0.85): more headroom, fewer KV blocks → smaller batch.

Rule of thumb: start at 0.9. If you see OOMs in production, drop to 0.85. If KV utilization is consistently < 70%, raise to 0.93.

14.2 max_num_batched_tokens (e.g., 4096)

Per-iteration token budget (sum over decode tokens + prefill chunk lengths). The chunked-prefill cap.

  • Higher: longer prefill chunks → faster TTFT but spikier TPOT.
  • Lower: smoother TPOT but TTFT suffers on long prompts.

Rule of thumb: set so one iteration's wall-clock at this token count is roughly your target TPOT. For Llama-3-70B on H100, ~2048-4096 keeps iterations under 50 ms.

14.3 max_num_seqs (e.g., 256)

Hard cap on concurrent in-flight sequences (decode batch size cap).

  • Higher: more concurrent users, higher throughput at saturation.
  • Lower: lower TPOT, fewer preemptions, less scheduling overhead.

Rule of thumb: set just above the highest batch size you actually achieve given KV budget. If KV budget supports batch=64, set `max_num_seqs=80 - caps absurd over-admission, doesn't constrain real load.

14.4 block_size (e.g., 16)

Tokens per paged-attention block.

  • Smaller (8): less last-block waste, more page-table indirection overhead.
  • Larger (32): less indirection, more waste.

Rule of thumb: 16 is the universal sweet spot. Don't change unless you have unusual workloads (very short or very long sequences) and you've measured.

14.5 enable_prefix_caching (bool)

Turn block-level content hashing on. Default: on, in modern vLLM.

  • Cost: small overhead per block to compute and look up hashes; some KV memory used to retain evicted-but-cached blocks.
  • Benefit: massive speedup on shared-prefix workloads (chat, RAG, multi-turn).

Rule of thumb: leave on. Only disable if you have measured that it hurts your workload (rare-pathological case is purely random prompts).

14.6 swap_space (GiB)

CPU RAM reserved for swap-out of evicted KV-cache. Default: 4 GiB or so.

  • Nonzero: enables swap-mode preemption.
  • Zero: forces recompute-mode preemption.

Rule of thumb: set to 4-8 GiB if you have generous host memory and long contexts; 0 otherwise. Recompute is fine for most workloads.

14.7 tensor_parallel_size, pipeline_parallel_size

How to shard the model across GPUs.

  • TP (intra-layer): splits each layer across N GPUs, all-reduce after each. Low latency, NVLink-bandwidth-hungry.
  • PP (inter-layer): splits the layer stack across stages, micro-batching. Higher latency, PCIe-tolerant.

Rule of thumb: TP first, up to NVLink island size (usually 8). Add PP only when you're past one island and need to scale further. For Llama-3-70B on 2× H100: TP=2.

14.8 max_model_len (e.g., 8192)

Maximum context length. Caps the longest sequence you'll serve. KV memory budget per sequence is implied by this.

  • Higher: more flexible but reserves more pessimistic block budget.
  • Lower: tighter packing, more concurrent users.

Rule of thumb: set to your real max prompt+output length, not the model's architectural max. If you only ever serve 4K-context chats, set max_model_len=4096 and reap the batch-size benefits.


15. Practical exercises

These are designed to be done on paper. Solutions follow the spirit of §1-they're all derivable from the cost model.

Exercise 1-KV-cache pool size

You're running Llama-3-70B BF16 on a single 80 GB H100 (assume 1-GPU deployment is somehow possible-say with FP8 quantization at 70 GB → ~35 GB weights). After weights and overhead, 40 GB is left for KV-cache. With block_size=16 and the model's KV-per-token of 320 KB across all layers:

  • How many blocks are in the pool?
  • How many tokens of total KV-cache?
  • At max_model_len=4096, how many concurrent sequences (worst case, fully-grown)?

Answer sketch: Block size in bytes (across layers) = 16 × 320 KB = 5 MB. Pool size = 40 GB / 5 MB = 8000 blocks. Tokens = 8000 × 16 = 128K tokens. Concurrent fully-grown = 128K / 4096 = 32 sequences.

Exercise 2-Decode step time

Llama-3-70B BF16 on 2× H100 (TP=2, 35 GB weights per GPU, 1.675 TB/s effective bandwidth per GPU due to TP overhead):

  • TPOT at batch=1, S=2K?
  • TPOT at batch=32, S=2K?
  • At what batch does TPOT double vs batch=1?

Answer sketch: KV per sequence = 640 MB. Per-GPU work ≈ (35 GB + b × 320 MB)/1.675 TB/s (KV halved by TP). - b=1: (35 + 0.32)/1.675 ≈ 21 ms. - b=32: (35 + 10.24)/1.675 ≈ 27 ms. - Doubles when b × 320 MB ≈ 35 GB → b ≈ 110.

Exercise 3-Throughput wall

Same setup. What batch size maximizes decode throughput, given a 80 GB GPU?

Answer sketch: KV memory budget per GPU = 80 - 35 - overhead ≈ 40 GB. KV per sequence at S=2K = 320 MB (post-TP). Max batch = 40 / 0.32 = 125. Throughput at b=125: 125 / 64 ms ≈ 1950 tokens/sec/GPU. Beyond that you OOM.

Exercise 4-Naive vs paged

Same model, max_model_len=4096. Compare concurrent-sequence capacity:

  • Naive contiguous allocation per request, each reserving 4K KV.
  • Paged with block_size=16, sequences average actual length 800.

Answer sketch: Naive KV per seq = 4096 × 320 KB = 1.28 GB. Capacity = 40 GB / 1.28 GB ≈ 31. Paged: each seq actually uses ceil(800/16) = 50 blocks = 250 MB plus at most 16-token waste. Capacity = 40 GB / 250 MB ≈ 160. 5× more concurrent users.

Exercise 5-TTFT under load

Llama-3-70B prefill rate: 5000 tokens/sec/GPU at batch=1 (compute-bound, hits ~50% of FLOP roof). Prompts arrive Poisson at rate λ = 10/sec, mean prompt length 1000 tokens.

  • Prefill server load? Under what λ does the queue blow up?
  • What's expected TTFT (M/M/1 approximation) at λ=10, λ=15, λ=18?

Answer sketch: Service rate = 5000 tok/s ÷ 1000 tok/req = 5 req/s. Wait-that's already overloaded at λ=10. We need TP=2 or 2 prefill servers. With service rate 10 req/s, ρ = λ/μ. At λ=10: ρ=1, queue diverges (M/M/1). At λ=8: ρ=0.8, expected wait = ρ/(μ(1-ρ)) = 0.8/(10×0.2) = 0.4 s. Lesson: prefill capacity must be over-provisioned, especially when tail-aware.

Exercise 6-Disaggregation worth it?

Workload: 50% requests have 4K prompts and 100 output tokens (prefill-heavy); 50% have 100-token prompts and 1000 output tokens (decode-heavy). Two GPUs.

  • In co-located continuous batching, what's the bottleneck regime?
  • In disaggregated (1 prefill, 1 decode), what bottleneck shifts?

Answer sketch: Total prefill work per request avg = 0.5 × 4000 + 0.5 × 100 = 2050 tokens. Total decode tokens per request avg = 0.5 × 100 + 0.5 × 1000 = 550 tokens. In co-located, both compete for KV. The 4K-prompt requests admit slowly under tight KV, increasing TTFT. Disaggregated: prefill GPU runs at 5000 tok/s, handles 5000/2050 ≈ 2.4 req/s. Decode GPU runs at maybe 2000 tok/s, handles 2000/550 ≈ 3.6 req/s. Bottleneck is prefill-add a second prefill GPU. Co-located would have over-provisioned both regimes simultaneously.


16. The shape of the field

The vLLM + Orca synthesis (paged attention + continuous batching) is now table stakes-every modern inference engine (TGI, TensorRT-LLM, SGLang, LMDeploy) has both. The frontier is now:

  • Better scheduling: SLO-aware, multi-tenant, fairness, deadline scheduling. SGLang's RadixAttention and constrained decoding fit here.
  • Better KV management: hierarchical KV (HBM → DRAM → SSD) à la Mooncake. Distributed prefix caches.
  • Disaggregation everywhere: prefill/decode is just the start. Some systems are exploring further splits-embedding compute, sampling, even per-layer routing.
  • Speculative & parallel decoding: Medusa, EAGLE, lookahead decoding-all integrate with paged attention via the same block-manager primitives.
  • Quantization at every layer: FP8 weights, FP8 KV-cache, INT4 weight-only-all expand the practical batch envelope by shrinking the bandwidth footprint.
  • MoE serving: routes the bandwidth equation through expert-selection, with its own batching considerations (every token activates a different expert subset → batching gains weaker).

Each of these slots into the same architecture: scheduler → block manager → executor → sampler. The mechanism is stable; the policies and the kernels evolve.


17. Cheat sheet

The two phases:

Prefill  : compute-bound.  P tokens in parallel.  T ≈ P · per-tok-prefill ≈ P · 200 µs (70B/H100).
Decode   : bandwidth-bound. 1 token per step.   T ≈ (W + b·KV)/B_HBM.

The cost model:

T_step(b, S) ≈ (W + b · K_per_tok · S) / B_HBM        [memory-bound regime]
Throughput(b) ≈ b / T_step(b, S)
                      ↑ peaks where KV-term ≈ weight-term
                      ↑ in practice capped by KV memory

The architecture:

API → Engine.step():
  1. Scheduler picks batch (running ∪ admit_some_waiting)
  2. BlockManager allocates / appends slots
  3. Executor.forward(batch, block_tables)
  4. Sampler.sample(logits)
  5. For each: append token, finish-or-continue, free blocks
  → loop

The big wins:

Continuous batching     : 2-4×    (vs request-level)
+ Paged attention       : 5-10×   (vs naive contiguous)
+ Prefix caching        : 2-5×    (workload-dependent, chat/RAG)
+ Chunked prefill       : smoother TPOT, modest throughput
+ Speculative decoding  : 2-3×    (decode only)
+ Disaggregation        : 1.5-3×  (mixed workloads, scale)

The tuning levers:

gpu_memory_utilization        ← KV pool size
max_num_batched_tokens        ← TPOT smoothness
max_num_seqs                  ← concurrent batch cap
block_size                    ← leave at 16
enable_prefix_caching         ← leave on
swap_space                    ← if long ctx + spare host RAM
tensor_parallel_size          ← within-NVLink-island
pipeline_parallel_size        ← across-island
max_model_len                 ← match real workload, not arch max

The metrics:

TTFT  = queue + prefill + first_decode
TPOT  ≈ T_step
Throughput = sum of output tokens / wall time
Goodput = throughput counting only SLO-meeting requests

If you want to consolidate (not learn from scratch-that's what this chapter is for):

  • Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023. The vLLM paper. Read for the kernel details and original block-manager design.
  • Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models," OSDI 2022. Iteration-level scheduling. Predates vLLM, still the cleanest exposition of the scheduling idea.
  • Agrawal et al., "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve," OSDI 2024. Chunked prefill and the prefill/decode interference analysis.
  • Zhong et al., "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving," OSDI 2024. The disaggregation argument and KV-transfer engineering.
  • Mooncake (Moonshot AI, 2024-2025). KV-as-distributed-object, hierarchical KV cache.
  • The FlashAttention series (Dao et al., 2022, 2023, 2024)-orthogonal but always relevant for understanding the attention kernel; covered in deep dive 06.

The papers are dense in evaluation (benchmark suites, microbenchmarks). With this chapter as scaffolding, you should be able to skim those evaluations as confirmation rather than as primary learning.


19. Closing

Inference serving is not a deep-learning problem dressed up as a systems problem. It is a systems problem, full stop, with a deep-learning function as the workload. The hardness comes from:

  • Variable-length, growing, unpredictable workloads.
  • Bandwidth-bound primary cost model.
  • Multi-tenant SLOs across heterogeneous request shapes.
  • Memory hierarchies that span HBM, DRAM, and PCIe.

PagedAttention and continuous batching solved the spatial and temporal problems. Chunked prefill smoothed the regime transitions. Prefix caching exploited the workload's redundancy. Disaggregation made the regimes physical. Speculative decoding cracked the per-step bandwidth ceiling.

Each of these is one good idea, and the system that implements them is more or less forced once you accept the cost model in §1-3. That is the point of the chapter: when you understand why the equation looks the way it does, every architectural choice in vLLM and its descendants reads as inevitable.

The next deep dive (09) covers training systems-the same kind of cost-model-first analysis, applied to gradient computation, all-reduce, and pipeline schedules. After that, 10 covers speculative decoding properly, 11 covers MoE, and 12 covers the multi-modal serving extensions. They will all build on the same foundation: a clear-eyed accounting of bytes, FLOPs, and time.

Comments