Saltar a contenido

Month 8-Week 1: vLLM, KV cache, quantization (universal)

Week summary

  • Goal: Universal across tracks. Stand up vLLM. Understand KV cache and continuous batching deeply. Quantize a 7B model and benchmark.
  • Time: ~10 h over 3 sessions.
  • Output: Working vLLM serving Llama 3.1 8B; AWQ quantization compared to fp16; benchmarks across concurrency 1–64.
  • Sequences relied on: 14-inference-serving rungs 01–05, 08.

Why this week matters

Even if your specialty is evals or agents, you must know inference infra. Otherwise: - You can't reason about why agent latency is high. - You can't argue about self-host vs API economics. - Frontier paper sections about training cost and inference cost are opaque.

This week is universal because the bridge between AI and infra is a key part of what makes you employable.

Prerequisites

  • GPU access (RunPod / Lambda Labs / Modal-~$1/hr for an A10).
  • Session A-Tue/Wed evening (~3 h): mental model + first run
  • Session B-Sat morning (~4 h): KV cache + quantization deep dive
  • Session C-Sun afternoon (~3 h): benchmarks + writeup

Session A-GPU mental model + first vLLM run

Goal: Understand why batching helps. Stand up vLLM end-to-end.

Part 1-GPU mental model (60 min)

Read: Horace He's Making Deep Learning Go Brrrr From First Principles. Search "horace he making deep learning go brrrr".

Key ideas to internalize: - GPUs have many parallel cores AND fast HBM (memory). Not just compute. - Most ML workloads are memory-bandwidth bound, not compute bound. - "Batching" helps because it amortizes memory loads over more compute. - For LLM inference, prefill (processing the prompt) is compute-bound; decode (generating tokens one-by-one) is memory-bound.

Sketch: for a 7B model in fp16 (14GB weights), if HBM bandwidth is 1.5 TB/s, you can read the weights ~107 times per second. That's the upper bound on tokens/sec for batch=1 decode.

Part 2-Get GPU + install vLLM (45 min)

Pick: RunPod (templates with vLLM pre-installed) or Lambda Labs or Modal.

Or local Docker if you have an NVIDIA GPU:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

(For gated models like Llama, you need a HF token; or use Qwen/Qwen2.5-7B-Instruct which is open.)

Part 3-First request (75 min)

vLLM exposes an OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(resp.choices[0].message.content)

Inspect: - GPU memory usage (nvidia-smi). Most of HBM should be filled with KV cache. - Throughput at concurrency=1: typical ~30 tokens/sec for an 8B model on A10.

Output of Session A

  • vLLM serving a real model.
  • First request succeeds.
  • Basic GPU memory observation.

Session B-KV cache + quantization

Goal: Understand KV cache memory math. Quantize and re-benchmark.

Part 1-KV cache math (45 min)

For a transformer with L layers, H heads, D_head dim, sequence T, batch B, in fp16 (2 bytes):

KV cache memory = 2 (K and V) × L × H × D_head × T × B × 2 bytes

For Llama 3.1 8B (L=32, H=32, D_head=128), batch=1, T=2048:

2 × 32 × 32 × 128 × 2048 × 1 × 2 = ~1 GiB
At T=8192: ~4 GiB. At T=32K: ~16 GiB-bigger than the model itself!

This is why long contexts are hard. It's also why PagedAttention (vLLM's central innovation) matters-it manages KV memory like virtual memory in operating systems.

Part 2-Read PagedAttention (75 min)

Read the vLLM paper (arxiv.org/abs/2309.06180) sections 1–4.

Key ideas: - Naive KV cache wastes memory: pre-allocate max-sequence; sequences shorter than max waste the rest. - PagedAttention pages memory in fixed-size blocks (like OS virtual memory pages). - Continuous batching schedules new requests as old ones finish, instead of waiting for batch. - Combined: 23× higher throughput than naive serving.

Part 3-Quantize with AWQ (60 min)

Quantization reduces precision (fp16 → int4). Saves ~4× memory; speeds up decode (memory-bound); negligible accuracy loss for most workloads.

Read AWQ paper (arxiv.org/abs/2306.00978) sections 1, 3.

Find an AWQ-quantized model on HF Hub (search "Llama-3.1-8B-Instruct-AWQ"). Run vLLM with - -quantization awq` and the AWQ model:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model neuralmagic/Llama-3.1-8B-Instruct-AWQ-INT4 \
  --quantization awq

Run the same first request. Compare: - GPU memory: should be ~2.5 GB instead of 14 GB. - Per-token latency: faster. - Output quality: usually indistinguishable on most tasks.

Output of Session B

  • KV cache memory math worked through.
  • vLLM paper notes.
  • AWQ-quantized model running.

Session C-Benchmarks across concurrency

Goal: Sweep concurrency 1–64 with fp16 vs AWQ. Capture TTFT, ITL, throughput.

Part 1-Benchmark script (60 min)

import asyncio, time, statistics
from openai import AsyncOpenAI

async def one_request(client, prompt):
    start = time.perf_counter()
    first_tok_at = None
    completion_tokens = 0
    async with client.chat.completions.create(
        model="...",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
        max_tokens=200,
    ) as stream:
        async for chunk in stream:
            if first_tok_at is None:
                first_tok_at = time.perf_counter()
            if chunk.choices[0].delta.content:
                completion_tokens += 1
    end = time.perf_counter()
    return {
        "ttft_ms": (first_tok_at - start) * 1000,
        "total_ms": (end - start) * 1000,
        "tps": completion_tokens / (end - first_tok_at),
    }

async def benchmark(concurrency: int, n_requests: int = 100):
    client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
    sem = asyncio.Semaphore(concurrency)
    async def bounded():
        async with sem:
            return await one_request(client, "Explain photosynthesis briefly.")
    return await asyncio.gather(*(bounded() for _ in range(n_requests)))

Part 2-Run sweep (75 min)

For each (quantization, concurrency) pair, run 100 requests: - fp16 × {1, 4, 16, 32, 64} - awq × {1, 4, 16, 32, 64}

Capture for each: - TTFT p50, p95. - Throughput (tokens/sec, summed across concurrent requests). - Tokens/sec/request.

Part 3-Analyze + report (45 min)

Build a chart: x-axis concurrency, y-axis throughput, two lines (fp16, awq).

Likely shape: - AWQ throughput at low concurrency: 2–3× fp16. - AWQ throughput at high concurrency: still better but gap shrinks (compute becomes the bottleneck). - AWQ TTFT: similar at low concurrency (prefill is compute-bound, not helped much by quantization).

Write up in inference-experiments/results.md. Push.

Output of Session C

  • Comprehensive benchmark sweep.
  • Results doc with chart.

End-of-week artifact

  • vLLM serving fp16 and AWQ models
  • KV cache math worked through
  • Concurrency sweep with TTFT, ITL, throughput
  • Results doc

End-of-week self-assessment

  • I can compute KV cache memory for any transformer.
  • I can explain why decode is memory-bound.
  • I can articulate the vLLM throughput advantage.
  • I have real benchmark numbers from my own deployment.

Common failure modes for this week

  • Skipping the math. KV memory math is foundational; not optional.
  • Single-config benchmarks. A sweep is informative; a point isn't.
  • Trusting headline numbers without measurement. Real workloads beat marketing.

What's next (preview of M08-W02)

LoRA fine-tuning on a small model. PEFT + TRL. Before/after eval.

Comments