Week 18 - Paged Attention, Continuous Batching, vLLM¶

18.1 Conceptual Core¶

The two ideas that made open-source LLM serving competitive with closed APIs:
Paged attention (Kwon et al., SOSP 2023): manage the KV-cache like virtual memory. Fixed-size blocks (e.g., 16 tokens each) allocated from a pool; per-request page tables map logical token positions to physical blocks. Eliminates fragmentation; enables prefix sharing.
Continuous batching (Yu et al., OSDI 2022 / Orca): instead of batching requests at start, dynamically schedule per-step. Finished requests leave the batch immediately; new requests join. Decode batches stay full.
vLLM combines both. Result: ~5-20× higher throughput vs naive HuggingFace generation.

Paged attention kernel: takes Q (current step), K/V cache pool, and per-request block tables. Each query attends to its blocks, gathered via the block table. The kernel handles non-contiguous K/V-modest perf cost (~10%) for huge memory savings.
Continuous batching scheduler (Orca's "iteration-level scheduling"):
Maintain a queue of pending requests.
Each iteration, pick a batch satisfying memory budget (sum of KV-cache sizes ≤ available pool).
Execute one decode step; emit any finished tokens; release any finished requests' blocks.
Loop.
Prefill / decode disaggregation (an active research area; production-deployed in 2024-2026 by major labs):
Prefill is compute-bound; benefits from large batches and TP.
Decode is memory-bound; benefits from many concurrent requests sharing weights.
Run them on different hardware tiers. The token-streaming protocol between prefill and decode workers is non-trivial.
Prefix caching / speculative prefix caching: identical prompt prefixes share blocks. Critical for common system-prompt patterns and multi-turn dialogs.

Install vLLM. Serve a 7B model. Run a load test (benchmark_serving.py) at various concurrency levels.
Read vllm/core/scheduler.py and vllm/attention/backends/flash_attn.py end-to-end. Annotate the scheduler's iteration loop.
Build a mini-scheduler in Python (not for prod; for understanding): manages a fixed pool of KV blocks, schedules decode steps, evicts on memory pressure. Use real model forward via vLLM's lower-level APIs or HuggingFace.
Compare throughput of your mini-scheduler vs vLLM proper. The gap is likely 5-20×-that gap is your education.

vLLM exposes Prometheus metrics. Capture: requests-running, requests-waiting, GPU-cache-usage, time-to-first-token (TTFT), time-per-output-token (TPOT). These are the SLOs for LLM serving.

Tune vLLM for your workload: gpu-memory-utilization, max-num-batched-tokens, enable-prefix-caching, swap-space. Each is a real lever. Document chosen values + rationale in a SERVING.md.