12 - Serving Models¶

What this session is¶

About 45 minutes. How to expose your model as a service users can call. Local options (Ollama, llama.cpp), high-performance serving (vLLM), and rolling your own HTTP API.

The simplest possible serve: Ollama¶

Ollama is the easiest way to run an LLM locally.

# Install (macOS):
brew install ollama
ollama serve &

# Pull and run a model:
ollama pull llama3.2:3b
ollama run llama3.2:3b

You get an interactive chat. To use programmatically:

import httpx
r = httpx.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": False,
})
print(r.json()["message"]["content"])

Ollama handles the model loading, quantization, GPU detection. For local dev, it's the easiest start.

llama.cpp - for tighter control¶

llama.cpp is a C++ inference engine that runs GGUF-quantized models on CPU + GPU. Lower-level than Ollama; faster; more configurable.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a quantized model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run
./llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello!" -n 100

Or serve as HTTP:

./llama-server -m model.gguf -c 4096
# OpenAI-compatible API on http://localhost:8080

Many projects use llama.cpp under the hood (Ollama, LM Studio, etc.).

vLLM - high-throughput serving¶

For production-grade serving of large models with high concurrency: vLLM.

pip install vllm

Run a server:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 1

OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
r = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(r.choices[0].message.content)

vLLM's killer features: - PagedAttention - KV-cache management like virtual memory. Way better GPU utilization than naive serving. - Continuous batching - interleaves requests at the token level. Many concurrent users; high throughput. - OpenAI-compatible API - drop-in replacement for OpenAI client libraries.

Used by many production LLM deployments. Needs a GPU; doesn't run on CPU usefully.

A simple HTTP wrapper around your own model¶

If you want full control:

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()


class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 100
    temperature: float = 0.7


@app.post("/generate")
def generate(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=req.max_new_tokens,
            temperature=req.temperature,
            do_sample=True,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    return {"text": text[len(req.prompt):]}

Run: uvicorn server:app --host 0.0.0.0 --port 8080. Test:

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "max_new_tokens": 30}'

For a small model and a few users, this works fine. For high concurrency, use vLLM.

Streaming responses¶

Users want to see tokens as they generate, not wait for the whole response. Implementation:

from fastapi.responses import StreamingResponse
from threading import Thread
from transformers import TextIteratorStreamer

@app.post("/stream")
def stream(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    Thread(target=model.generate, kwargs={
        **inputs, "max_new_tokens": req.max_new_tokens,
        "streamer": streamer, "do_sample": True, "temperature": req.temperature,
    }).start()

    def gen():
        for token in streamer:
            yield token
    return StreamingResponse(gen(), media_type="text/plain")

For production: use vLLM (streaming is built-in) rather than rolling your own threading.

Containerize for deployment¶

A Dockerfile for the FastAPI server:

FROM python:3.12-slim
WORKDIR /app

# CUDA wheel for PyTorch (adjust to your CUDA version)
RUN pip install --no-cache-dir torch fastapi uvicorn pydantic transformers

COPY server.py .
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

Build, push, run on Kubernetes (Containers + Kubernetes paths).

Deployment concerns¶

GPU scheduling. Kubernetes can schedule pods to GPU nodes (nvidia.com/gpu: 1 in resources). NVIDIA's GPU Operator manages drivers.
Cold start. Loading a 7B model takes 10-30 seconds. Don't scale to zero unless cold start is acceptable.
Model caching. Embed the model weights in the container image (huge), or mount as a PV (faster restarts).
Autoscaling. GPU pods are expensive. Scale based on request queue depth or GPU utilization, not CPU.
Observability. Latency per request, tokens/sec, GPU memory, queue depth.

Cost / latency calibration¶

Rough numbers for a single A100 GPU serving Llama-3.1-8B (FP16): - ~30-80 tokens/sec generation rate. - 5-15 GB GPU memory. - A few concurrent users; vLLM bumps this to dozens.

Quantized (4-bit) on a single consumer 24GB GPU: - ~40-100 tokens/sec. - Single users comfortable; concurrency lower.

For frontier models (70B+), you need multi-GPU or sharded serving. Beyond beginner.

OpenAI compatibility is a contract¶

Many tools (LangChain, llama-index, CLI tools) speak the OpenAI HTTP API. vLLM, llama.cpp's server, Ollama (with its /v1/... endpoints) all implement it. Building against the OpenAI API makes you portable across self-hosted and hosted backends.

from openai import OpenAI
# Same code works for openai.com, vLLM, Ollama, llama.cpp:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

Going deeper¶

You can serve a model now. This is the depth that turns "it works on my machine for one request" into understanding why production serving is hard - the failure modes you'll hit the moment there's more than one user, with what you'll see.

The throughput cliff: it's fast alone, slow under load¶

Your model serves one request at 50 tokens/sec. Five users connect, and suddenly everyone gets 10 tokens/sec - worse than 1/5th, because naive serving processes requests one at a time (or wastes the GPU). The fix that defines modern serving is continuous batching: the server processes many requests' tokens together in each GPU step, since the GPU has spare capacity during a single request.

Naive (one at a time):     5 users -> ~10 tok/s each, GPU mostly idle between requests
Continuous batching:        5 users -> ~40 tok/s each, GPU saturated
(vLLM/TGI do this; Ollama/llama.cpp are more single-user oriented)

This is the reason vLLM and TGI exist and why "I'll just wrap the model in FastAPI" falls over in production. If you're serving more than one concurrent user, you need a server built for batching - not a hand-rolled loop. Watching per-user throughput collapse as users increase is the symptom; continuous batching is the fix.

The memory math that decides everything: the KV cache¶

The hidden memory consumer in serving isn't the model - it's the KV cache, which stores the attention keys/values for every token in every active request's context. It grows with concurrent_requests x context_length, and it's often what OOMs a serving box:

Model weights (7B, fp16):     ~14 GB   (fixed)
KV cache per request:         ~0.5-2 GB depending on context length
-> 10 concurrent users with long contexts can need MORE memory than the model itself

This is why a server that loads fine then OOMs under load - the weights fit, but the KV cache for many concurrent long-context requests doesn't. vLLM's famous innovation (PagedAttention) is specifically about managing this KV-cache memory efficiently. For you as a beginner: know that serving memory = weights + (concurrency x context KV cache), and that long contexts and high concurrency are what blow it up - not just model size.

What you'll see: time-to-first-token vs tokens-per-second¶

Serving has two latencies users feel, and they're different problems:

Time to first token (TTFT) - how long until the response starts. Dominated by the prefill phase (processing the whole prompt). Long prompts = slow TTFT. Felt as "it hangs before responding."
Inter-token latency / tokens-per-second - how fast tokens stream once started. Dominated by the decode phase (one token at a time). Felt as "it types slowly."

$ curl -w "TTFT and total..." ...     # measure both
prompt of 2000 tokens -> TTFT 1.8s (prefill is slow), then 45 tok/s (decode fine)

A user complaining "it's slow to start" (TTFT/prefill - shorten the prompt, use prefix caching) is a different fix from "it types slowly" (decode/throughput - batching, quantization, a smaller model). Streaming responses (stream=True) doesn't make it faster but makes TTFT feel better by showing tokens as they generate instead of waiting for the whole response. Knowing which latency a complaint is about is the serving diagnostic.

The quantization tradeoff, made concrete¶

Quantization (4-bit/8-bit) is how you fit a model on a smaller GPU and serve faster - at an accuracy cost:

7B model, fp16:    ~14 GB, highest quality, needs a 16GB+ GPU
7B model, 8-bit:   ~7 GB,  near-identical quality, fits 12GB
7B model, 4-bit:   ~4 GB,  slightly degraded, fits an 8GB consumer GPU

The practical guidance: 4-bit quantization is usually the right default for self-hosting - the quality loss is small and often imperceptible for many tasks, and it makes models fit on hardware you can afford. But measure on your actual task (the Evaluation chapter) - quantization can hurt more on reasoning-heavy or precise tasks than on casual chat. Don't assume; check the outputs.

Try it (with what you'll see)¶

Serve a model (Ollama is easiest). Hit it with one request, note tokens/sec. Then fire 5 concurrent requests (a quick loop) and watch per-request throughput drop - the batching cliff.
Send a short prompt and a 2000-token prompt; measure time-to-first-token for each. Watch TTFT grow with prompt length (prefill cost).
Run the same model at fp16, 8-bit, and 4-bit (if your GPU allows). Compare memory use (nvidia-smi) and output quality on a few prompts. Feel the size-vs-quality tradeoff.
Use stream=True vs not. Same total time, but streaming feels far faster because tokens appear immediately.

Exercise¶

Install Ollama. Pull a small model. Chat.
Call it from Python via its HTTP API.
Build a tiny FastAPI server that wraps a small HF model (page 04's MLP works, or a Phi-3-mini for fun). Curl it.
(Stretch - GPU helpful) Install vLLM. Serve microsoft/Phi-3-mini-4k-instruct. Use the OpenAI client to call it.
(Stretch) Containerize your FastAPI server. Build the image. Run via docker run.

What you might wonder¶

"Should I serve via my own framework or vLLM?" For real production: vLLM (or TGI, Triton). The hand-rolled FastAPI version works fine for a hobby project but doesn't handle concurrency well.

"How do I keep the model warm?" Don't scale to zero. Have at least one replica always running. Health checks must respond fast (≤1s) without invoking the model.

"GPU memory keeps growing - what?" KV-cache (page 07) accumulates as context grows. Limit max_total_tokens in vLLM or chunk old context out.

"Open source vs API providers?" Both have a place. OpenAI/Anthropic/Google APIs are easy and powerful; you pay per token. Self-hosting is cheaper at scale but adds ops complexity. Most production teams use both - APIs for high-quality requests, self-hosted for high-volume cheaper requests.

Done¶

Run an LLM locally with Ollama or llama.cpp.
Serve high-throughput with vLLM.
Build a custom FastAPI wrapper.
Stream responses.
Containerize for deployment.

Next: Picking a project →