Saltar a contenido

14-Inference & Serving

Why this matters in the journey

This is where backend engineering meets AI. Inference servers like vLLM are distributed systems with GPUs and a transformer-shaped workload-exactly the place your existing skills converge. You don't have to write CUDA, but you must understand the GPU mental model, KV caching, batching, and quantization to be credible. This sequence is essential for Q3 Track C and useful for everyone.

The rungs

Rung 01-The GPU mental model (light touch)

  • What: GPUs have many cores, fast HBM (high-bandwidth memory), tiny SRAM. Memory bandwidth, not compute, is usually the bottleneck.
  • Why it earns its place: "Why is this slow?" almost always answers to memory bandwidth. Knowing this is everything.
  • Resource: Making Deep Learning Go Brrrr From First Principles by Horace He (search "horace he making deep learning go brrr").
  • Done when: You can explain why batching helps GPU utilization.

Rung 02-KV cache

  • What: Transformer decoding caches the keys and values of past tokens to avoid recomputation. Massive speedup; massive memory.
  • Why it earns its place: KV cache is the central data structure of inference servers. PagedAttention (vLLM) is a KV-cache management innovation.
  • Resource: Hugging Face blog "How to make LLMs faster" (search "huggingface llm inference"). Plus the vLLM paper for PagedAttention.
  • Done when: You can explain how KV cache memory grows with sequence length and batch size.

Rung 03-vLLM and PagedAttention

  • What: vLLM serves LLMs with paged KV cache (memory like virtual memory in OSes), continuous batching, prefix caching.
  • Why it earns its place: vLLM is the de facto open inference server. SGLang and TensorRT-LLM are alternatives.
  • Resource: vLLM paper (arxiv.org/abs/2309.06180). vLLM docs (docs.vllm.ai).
  • Done when: You can deploy a 7B model on vLLM and serve a request.

Rung 04-Continuous batching

  • What: Naive batching waits for all sequences in a batch to finish. Continuous batching swaps in new requests as old ones finish.
  • Why it earns its place: Continuous batching is why modern inference servers handle 10× more traffic than old ones.
  • Resource: Anyscale blog "How continuous batching enables 23x throughput" (search "anyscale continuous batching").
  • Done when: You can explain why continuous batching is more efficient than static batching.

Rung 05-Quantization

  • What: Reduce model precision from fp16 to int8/int4/etc. Major techniques: GPTQ, AWQ, SmoothQuant, GGUF (for llama.cpp).
  • Why it earns its place: Quantization makes self-hosting feasible. A 4-bit Llama 70B fits where fp16 wouldn't.
  • Resource: GPTQ paper (arxiv.org/abs/2210.17323). AWQ paper (arxiv.org/abs/2306.00978). HF blog posts on bitsandbytes / AutoGPTQ.
  • Done when: You've quantized a model with AWQ and benchmarked latency / accuracy vs fp16.

Rung 06-Speculative decoding

  • What: Use a small "draft" model to propose tokens, large model to verify. Speeds up generation 2–3×.
  • Why it earns its place: Standard technique in modern inference servers. Architectural awareness.
  • Resource: Speculative Decoding paper (Leviathan et al., arxiv.org/abs/2211.17192). Plus DeepMind's Medusa paper (arxiv.org/abs/2401.10774).
  • Done when: You can explain why speculative decoding doesn't change the output distribution (it's exact).

Rung 07-Prefill vs decode

  • What: Prefill processes the whole prompt in parallel (compute-bound). Decode generates tokens one-by-one (memory-bound). Different scaling characteristics.
  • Why it earns its place: Inference systems schedule prefill and decode separately. Different bottlenecks, different optimizations.
  • Resource: Read sections of the SGLang paper (arxiv.org/abs/2312.07104) and the vLLM scheduling docs.
  • Done when: You can explain why prefill is compute-bound and decode is memory-bound.

Rung 08-Latency, throughput, and tradeoffs

  • What: First-token latency (TTFT), inter-token latency (ITL), tokens/sec, requests/sec. Each optimization affects them differently.
  • Why it earns its place: Picking the right optimization requires knowing what metric your product cares about.
  • Resource: vLLM benchmarking docs. Plus the Hugging Face "Optimum" benchmarks.
  • Done when: You can run a benchmark on your own deployment and report TTFT/ITL/throughput.

Rung 09-Multi-tenant serving and request scheduling

  • What: Priority classes, fairness, isolation, hot/cold model swapping, autoscaling.
  • Why it earns its place: Production inference is multi-tenant. Your distributed-systems instincts are gold here.
  • Resource: vLLM scheduler source code. Plus the LMDeploy and TensorRT-LLM design docs.
  • Done when: You can articulate the tradeoff between throughput and fairness in a multi-tenant setup.

Rung 10-Edge / local inference

  • What: llama.cpp, MLX (Apple), ONNX Runtime, mobile inference. Different hardware, different tradeoffs.
  • Why it earns its place: Some applications need edge inference for privacy / latency / cost. Knowing the landscape is breadth.
  • Resource: llama.cpp GitHub (ggerganov/llama.cpp). MLX docs from Apple.
  • Done when: You've run a small model locally on llama.cpp or MLX.

Rung 11-Self-hosted economics

  • What: When does self-hosting beat API? Function of: traffic volume, latency requirements, privacy needs, model needs.
  • Why it earns its place: This is the architectural decision that drives whether your team owns infrastructure.
  • Resource: Various blog posts comparing API vs self-hosted (search "self-hosted vs api llm cost"). Plus your own benchmarking from rung 08.
  • Done when: You can write a memo defending self-hosting (or API) for a specific use case with numbers.

Minimum required to leave this sequence

  • Deploy a 7B–13B model on vLLM and serve a request.
  • Quantize a model with AWQ or GPTQ and measure the latency / accuracy delta.
  • Run a benchmark and report TTFT, ITL, throughput.
  • Explain KV cache, continuous batching, and speculative decoding.
  • Write a self-host vs API memo with numbers.

Going further

  • vLLM source code-read the scheduler.
  • Efficiently Scaling Transformer Inference (Pope et al., arxiv.org/abs/2211.05102).
  • NVIDIA TensorRT-LLM docs-production-grade serving.
  • Hugging Face TGI (Text Generation Inference)-the HF alternative.

How this sequence connects to the year

  • Month 8: Rungs 01–05 are required reading.
  • Q3 Track C: This sequence is your specialty if you pick inference infra.
  • Capstone: Self-host benchmarking is a great public artifact regardless of track.

Comments