Month 8-Week 1: vLLM, KV cache, quantization (universal)¶
Week summary¶
- Goal: Universal across tracks. Stand up vLLM. Understand KV cache and continuous batching deeply. Quantize a 7B model and benchmark.
- Time: ~10 h over 3 sessions.
- Output: Working vLLM serving Llama 3.1 8B; AWQ quantization compared to fp16; benchmarks across concurrency 1–64.
- Sequences relied on: 14-inference-serving rungs 01–05, 08.
Why this week matters¶
Even if your specialty is evals or agents, you must know inference infra. Otherwise: - You can't reason about why agent latency is high. - You can't argue about self-host vs API economics. - Frontier paper sections about training cost and inference cost are opaque.
This week is universal because the bridge between AI and infra is a key part of what makes you employable.
Prerequisites¶
- GPU access (RunPod / Lambda Labs / Modal-~$1/hr for an A10).
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): mental model + first run
- Session B-Sat morning (~4 h): KV cache + quantization deep dive
- Session C-Sun afternoon (~3 h): benchmarks + writeup
Session A-GPU mental model + first vLLM run¶
Goal: Understand why batching helps. Stand up vLLM end-to-end.
Part 1-GPU mental model (60 min)¶
Read: Horace He's Making Deep Learning Go Brrrr From First Principles. Search "horace he making deep learning go brrrr".
Key ideas to internalize: - GPUs have many parallel cores AND fast HBM (memory). Not just compute. - Most ML workloads are memory-bandwidth bound, not compute bound. - "Batching" helps because it amortizes memory loads over more compute. - For LLM inference, prefill (processing the prompt) is compute-bound; decode (generating tokens one-by-one) is memory-bound.
Sketch: for a 7B model in fp16 (14GB weights), if HBM bandwidth is 1.5 TB/s, you can read the weights ~107 times per second. That's the upper bound on tokens/sec for batch=1 decode.
Part 2-Get GPU + install vLLM (45 min)¶
Pick: RunPod (templates with vLLM pre-installed) or Lambda Labs or Modal.
Or local Docker if you have an NVIDIA GPU:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
(For gated models like Llama, you need a HF token; or use Qwen/Qwen2.5-7B-Instruct which is open.)
Part 3-First request (75 min)¶
vLLM exposes an OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(resp.choices[0].message.content)
Inspect:
- GPU memory usage (nvidia-smi). Most of HBM should be filled with KV cache.
- Throughput at concurrency=1: typical ~30 tokens/sec for an 8B model on A10.
Output of Session A¶
- vLLM serving a real model.
- First request succeeds.
- Basic GPU memory observation.
Session B-KV cache + quantization¶
Goal: Understand KV cache memory math. Quantize and re-benchmark.
Part 1-KV cache math (45 min)¶
For a transformer with L layers, H heads, D_head dim, sequence T, batch B, in fp16 (2 bytes):
For Llama 3.1 8B (L=32, H=32, D_head=128), batch=1, T=2048:
At T=8192: ~4 GiB. At T=32K: ~16 GiB-bigger than the model itself!This is why long contexts are hard. It's also why PagedAttention (vLLM's central innovation) matters-it manages KV memory like virtual memory in operating systems.
Part 2-Read PagedAttention (75 min)¶
Read the vLLM paper (arxiv.org/abs/2309.06180) sections 1–4.
Key ideas: - Naive KV cache wastes memory: pre-allocate max-sequence; sequences shorter than max waste the rest. - PagedAttention pages memory in fixed-size blocks (like OS virtual memory pages). - Continuous batching schedules new requests as old ones finish, instead of waiting for batch. - Combined: 23× higher throughput than naive serving.
Part 3-Quantize with AWQ (60 min)¶
Quantization reduces precision (fp16 → int4). Saves ~4× memory; speeds up decode (memory-bound); negligible accuracy loss for most workloads.
Read AWQ paper (arxiv.org/abs/2306.00978) sections 1, 3.
Find an AWQ-quantized model on HF Hub (search "Llama-3.1-8B-Instruct-AWQ"). Run vLLM with - -quantization awq` and the AWQ model:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model neuralmagic/Llama-3.1-8B-Instruct-AWQ-INT4 \
--quantization awq
Run the same first request. Compare: - GPU memory: should be ~2.5 GB instead of 14 GB. - Per-token latency: faster. - Output quality: usually indistinguishable on most tasks.
Output of Session B¶
- KV cache memory math worked through.
- vLLM paper notes.
- AWQ-quantized model running.
Session C-Benchmarks across concurrency¶
Goal: Sweep concurrency 1–64 with fp16 vs AWQ. Capture TTFT, ITL, throughput.
Part 1-Benchmark script (60 min)¶
import asyncio, time, statistics
from openai import AsyncOpenAI
async def one_request(client, prompt):
start = time.perf_counter()
first_tok_at = None
completion_tokens = 0
async with client.chat.completions.create(
model="...",
messages=[{"role": "user", "content": prompt}],
stream=True,
max_tokens=200,
) as stream:
async for chunk in stream:
if first_tok_at is None:
first_tok_at = time.perf_counter()
if chunk.choices[0].delta.content:
completion_tokens += 1
end = time.perf_counter()
return {
"ttft_ms": (first_tok_at - start) * 1000,
"total_ms": (end - start) * 1000,
"tps": completion_tokens / (end - first_tok_at),
}
async def benchmark(concurrency: int, n_requests: int = 100):
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
sem = asyncio.Semaphore(concurrency)
async def bounded():
async with sem:
return await one_request(client, "Explain photosynthesis briefly.")
return await asyncio.gather(*(bounded() for _ in range(n_requests)))
Part 2-Run sweep (75 min)¶
For each (quantization, concurrency) pair, run 100 requests:
- fp16 × {1, 4, 16, 32, 64}
- awq × {1, 4, 16, 32, 64}
Capture for each: - TTFT p50, p95. - Throughput (tokens/sec, summed across concurrent requests). - Tokens/sec/request.
Part 3-Analyze + report (45 min)¶
Build a chart: x-axis concurrency, y-axis throughput, two lines (fp16, awq).
Likely shape: - AWQ throughput at low concurrency: 2–3× fp16. - AWQ throughput at high concurrency: still better but gap shrinks (compute becomes the bottleneck). - AWQ TTFT: similar at low concurrency (prefill is compute-bound, not helped much by quantization).
Write up in inference-experiments/results.md. Push.
Output of Session C¶
- Comprehensive benchmark sweep.
- Results doc with chart.
End-of-week artifact¶
- vLLM serving fp16 and AWQ models
- KV cache math worked through
- Concurrency sweep with TTFT, ITL, throughput
- Results doc
End-of-week self-assessment¶
- I can compute KV cache memory for any transformer.
- I can explain why decode is memory-bound.
- I can articulate the vLLM throughput advantage.
- I have real benchmark numbers from my own deployment.
Common failure modes for this week¶
- Skipping the math. KV memory math is foundational; not optional.
- Single-config benchmarks. A sweep is informative; a point isn't.
- Trusting headline numbers without measurement. Real workloads beat marketing.
What's next (preview of M08-W02)¶
LoRA fine-tuning on a small model. PEFT + TRL. Before/after eval.