12 - Serving Models¶
What this session is¶
About 45 minutes. How to expose your model as a service users can call. Local options (Ollama, llama.cpp), high-performance serving (vLLM), and rolling your own HTTP API.
The simplest possible serve: Ollama¶
Ollama is the easiest way to run an LLM locally.
# Install (macOS):
brew install ollama
ollama serve &
# Pull and run a model:
ollama pull llama3.2:3b
ollama run llama3.2:3b
You get an interactive chat. To use programmatically:
import httpx
r = httpx.post("http://localhost:11434/api/chat", json={
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": False,
})
print(r.json()["message"]["content"])
Ollama handles the model loading, quantization, GPU detection. For local dev, it's the easiest start.
llama.cpp - for tighter control¶
llama.cpp is a C++ inference engine that runs GGUF-quantized models on CPU + GPU. Lower-level than Ollama; faster; more configurable.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download a quantized model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Run
./llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello!" -n 100
Or serve as HTTP:
Many projects use llama.cpp under the hood (Ollama, LM Studio, etc.).
vLLM - high-throughput serving¶
For production-grade serving of large models with high concurrency: vLLM.
Run a server:
OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
r = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(r.choices[0].message.content)
vLLM's killer features: - PagedAttention - KV-cache management like virtual memory. Way better GPU utilization than naive serving. - Continuous batching - interleaves requests at the token level. Many concurrent users; high throughput. - OpenAI-compatible API - drop-in replacement for OpenAI client libraries.
Used by many production LLM deployments. Needs a GPU; doesn't run on CPU usefully.
A simple HTTP wrapper around your own model¶
If you want full control:
# server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
app = FastAPI()
device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
class GenerateRequest(BaseModel):
prompt: str
max_new_tokens: int = 100
temperature: float = 0.7
@app.post("/generate")
def generate(req: GenerateRequest):
inputs = tok(req.prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=req.max_new_tokens,
temperature=req.temperature,
do_sample=True,
)
text = tok.decode(out[0], skip_special_tokens=True)
return {"text": text[len(req.prompt):]}
Run: uvicorn server:app --host 0.0.0.0 --port 8080. Test:
curl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Once upon a time", "max_new_tokens": 30}'
For a small model and a few users, this works fine. For high concurrency, use vLLM.
Streaming responses¶
Users want to see tokens as they generate, not wait for the whole response. Implementation:
from fastapi.responses import StreamingResponse
from threading import Thread
from transformers import TextIteratorStreamer
@app.post("/stream")
def stream(req: GenerateRequest):
inputs = tok(req.prompt, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
Thread(target=model.generate, kwargs={
**inputs, "max_new_tokens": req.max_new_tokens,
"streamer": streamer, "do_sample": True, "temperature": req.temperature,
}).start()
def gen():
for token in streamer:
yield token
return StreamingResponse(gen(), media_type="text/plain")
For production: use vLLM (streaming is built-in) rather than rolling your own threading.
Containerize for deployment¶
A Dockerfile for the FastAPI server:
FROM python:3.12-slim
WORKDIR /app
# CUDA wheel for PyTorch (adjust to your CUDA version)
RUN pip install --no-cache-dir torch fastapi uvicorn pydantic transformers
COPY server.py .
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]
Build, push, run on Kubernetes (Containers + Kubernetes paths).
Deployment concerns¶
- GPU scheduling. Kubernetes can schedule pods to GPU nodes (
nvidia.com/gpu: 1in resources). NVIDIA's GPU Operator manages drivers. - Cold start. Loading a 7B model takes 10-30 seconds. Don't scale to zero unless cold start is acceptable.
- Model caching. Embed the model weights in the container image (huge), or mount as a PV (faster restarts).
- Autoscaling. GPU pods are expensive. Scale based on request queue depth or GPU utilization, not CPU.
- Observability. Latency per request, tokens/sec, GPU memory, queue depth.
Cost / latency calibration¶
Rough numbers for a single A100 GPU serving Llama-3.1-8B (FP16): - ~30-80 tokens/sec generation rate. - 5-15 GB GPU memory. - A few concurrent users; vLLM bumps this to dozens.
Quantized (4-bit) on a single consumer 24GB GPU: - ~40-100 tokens/sec. - Single users comfortable; concurrency lower.
For frontier models (70B+), you need multi-GPU or sharded serving. Beyond beginner.
OpenAI compatibility is a contract¶
Many tools (LangChain, llama-index, CLI tools) speak the OpenAI HTTP API. vLLM, llama.cpp's server, Ollama (with its /v1/... endpoints) all implement it. Building against the OpenAI API makes you portable across self-hosted and hosted backends.
from openai import OpenAI
# Same code works for openai.com, vLLM, Ollama, llama.cpp:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
Exercise¶
- Install Ollama. Pull a small model. Chat.
- Call it from Python via its HTTP API.
- Build a tiny FastAPI server that wraps a small HF model (page 04's MLP works, or a Phi-3-mini for fun). Curl it.
- (Stretch - GPU helpful) Install vLLM. Serve
microsoft/Phi-3-mini-4k-instruct. Use the OpenAI client to call it. - (Stretch) Containerize your FastAPI server. Build the image. Run via
docker run.
What you might wonder¶
"Should I serve via my own framework or vLLM?" For real production: vLLM (or TGI, Triton). The hand-rolled FastAPI version works fine for a hobby project but doesn't handle concurrency well.
"How do I keep the model warm?" Don't scale to zero. Have at least one replica always running. Health checks must respond fast (≤1s) without invoking the model.
"GPU memory keeps growing - what?"
KV-cache (page 07) accumulates as context grows. Limit max_total_tokens in vLLM or chunk old context out.
"Open source vs API providers?" Both have a place. OpenAI/Anthropic/Google APIs are easy and powerful; you pay per token. Self-hosting is cheaper at scale but adds ops complexity. Most production teams use both - APIs for high-quality requests, self-hosted for high-volume cheaper requests.
Done¶
- Run an LLM locally with Ollama or llama.cpp.
- Serve high-throughput with vLLM.
- Build a custom FastAPI wrapper.
- Stream responses.
- Containerize for deployment.