Week 18 - Paged Attention, Continuous Batching, vLLM¶
18.1 Conceptual Core¶
- The two ideas that made open-source LLM serving competitive with closed APIs:
- Paged attention (Kwon et al., SOSP 2023): manage the KV-cache like virtual memory. Fixed-size blocks (e.g., 16 tokens each) allocated from a pool; per-request page tables map logical token positions to physical blocks. Eliminates fragmentation; enables prefix sharing.
- Continuous batching (Yu et al., OSDI 2022 / Orca): instead of batching requests at start, dynamically schedule per-step. Finished requests leave the batch immediately; new requests join. Decode batches stay full.
- vLLM combines both. Result: ~5-20× higher throughput vs naive HuggingFace generation.
18.2 Mechanical Detail¶
- Paged attention kernel: takes Q (current step), K/V cache pool, and per-request block tables. Each query attends to its blocks, gathered via the block table. The kernel handles non-contiguous K/V-modest perf cost (~10%) for huge memory savings.
- Continuous batching scheduler (Orca's "iteration-level scheduling"):
- Maintain a queue of pending requests.
- Each iteration, pick a batch satisfying memory budget (sum of KV-cache sizes ≤ available pool).
- Execute one decode step; emit any finished tokens; release any finished requests' blocks.
- Loop.
- Prefill / decode disaggregation (an active research area; production-deployed in 2024-2026 by major labs):
- Prefill is compute-bound; benefits from large batches and TP.
- Decode is memory-bound; benefits from many concurrent requests sharing weights.
- Run them on different hardware tiers. The token-streaming protocol between prefill and decode workers is non-trivial.
- Prefix caching / speculative prefix caching: identical prompt prefixes share blocks. Critical for common system-prompt patterns and multi-turn dialogs.
18.3 Lab-"vLLM Internals"¶
- Install vLLM. Serve a 7B model. Run a load test (
benchmark_serving.py) at various concurrency levels. - Read
vllm/core/scheduler.pyandvllm/attention/backends/flash_attn.pyend-to-end. Annotate the scheduler's iteration loop. - Build a mini-scheduler in Python (not for prod; for understanding): manages a fixed pool of KV blocks, schedules decode steps, evicts on memory pressure. Use real model forward via vLLM's lower-level APIs or HuggingFace.
- Compare throughput of your mini-scheduler vs vLLM proper. The gap is likely 5-20×-that gap is your education.
18.4 Idiomatic & Diagnostic Drill¶
- vLLM exposes Prometheus metrics. Capture: requests-running, requests-waiting, GPU-cache-usage, time-to-first-token (TTFT), time-per-output-token (TPOT). These are the SLOs for LLM serving.
18.5 Production Slice¶
- Tune vLLM for your workload:
gpu-memory-utilization,max-num-batched-tokens,enable-prefix-caching,swap-space. Each is a real lever. Document chosen values + rationale in aSERVING.md.