Skip to content

Week 18 - Paged Attention, Continuous Batching, vLLM

18.1 Conceptual Core

  • The two ideas that made open-source LLM serving competitive with closed APIs:
  • Paged attention (Kwon et al., SOSP 2023): manage the KV-cache like virtual memory. Fixed-size blocks (e.g., 16 tokens each) allocated from a pool; per-request page tables map logical token positions to physical blocks. Eliminates fragmentation; enables prefix sharing.
  • Continuous batching (Yu et al., OSDI 2022 / Orca): instead of batching requests at start, dynamically schedule per-step. Finished requests leave the batch immediately; new requests join. Decode batches stay full.
  • vLLM combines both. Result: ~5-20× higher throughput vs naive HuggingFace generation.

18.2 Mechanical Detail

  • Paged attention kernel: takes Q (current step), K/V cache pool, and per-request block tables. Each query attends to its blocks, gathered via the block table. The kernel handles non-contiguous K/V-modest perf cost (~10%) for huge memory savings.
  • Continuous batching scheduler (Orca's "iteration-level scheduling"):
  • Maintain a queue of pending requests.
  • Each iteration, pick a batch satisfying memory budget (sum of KV-cache sizes ≤ available pool).
  • Execute one decode step; emit any finished tokens; release any finished requests' blocks.
  • Loop.
  • Prefill / decode disaggregation (an active research area; production-deployed in 2024-2026 by major labs):
  • Prefill is compute-bound; benefits from large batches and TP.
  • Decode is memory-bound; benefits from many concurrent requests sharing weights.
  • Run them on different hardware tiers. The token-streaming protocol between prefill and decode workers is non-trivial.
  • Prefix caching / speculative prefix caching: identical prompt prefixes share blocks. Critical for common system-prompt patterns and multi-turn dialogs.

18.3 Lab-"vLLM Internals"

  1. Install vLLM. Serve a 7B model. Run a load test (benchmark_serving.py) at various concurrency levels.
  2. Read vllm/core/scheduler.py and vllm/attention/backends/flash_attn.py end-to-end. Annotate the scheduler's iteration loop.
  3. Build a mini-scheduler in Python (not for prod; for understanding): manages a fixed pool of KV blocks, schedules decode steps, evicts on memory pressure. Use real model forward via vLLM's lower-level APIs or HuggingFace.
  4. Compare throughput of your mini-scheduler vs vLLM proper. The gap is likely 5-20×-that gap is your education.

18.4 Idiomatic & Diagnostic Drill

  • vLLM exposes Prometheus metrics. Capture: requests-running, requests-waiting, GPU-cache-usage, time-to-first-token (TTFT), time-per-output-token (TPOT). These are the SLOs for LLM serving.

18.5 Production Slice

  • Tune vLLM for your workload: gpu-memory-utilization, max-num-batched-tokens, enable-prefix-caching, swap-space. Each is a real lever. Document chosen values + rationale in a SERVING.md.

Comments