12 - Serving Models¶

What this session is¶

About 45 minutes. How to expose your model as a service users can call. Local options (Ollama, llama.cpp), high-performance serving (vLLM), and rolling your own HTTP API.

The simplest possible serve: Ollama¶

Ollama is the easiest way to run an LLM locally.

# Install (macOS):
brew install ollama
ollama serve &

# Pull and run a model:
ollama pull llama3.2:3b
ollama run llama3.2:3b

You get an interactive chat. To use programmatically:

import httpx
r = httpx.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": False,
})
print(r.json()["message"]["content"])

Ollama handles the model loading, quantization, GPU detection. For local dev, it's the easiest start.

llama.cpp - for tighter control¶

llama.cpp is a C++ inference engine that runs GGUF-quantized models on CPU + GPU. Lower-level than Ollama; faster; more configurable.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a quantized model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run
./llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello!" -n 100

Or serve as HTTP:

./llama-server -m model.gguf -c 4096
# OpenAI-compatible API on http://localhost:8080

Many projects use llama.cpp under the hood (Ollama, LM Studio, etc.).

vLLM - high-throughput serving¶

For production-grade serving of large models with high concurrency: vLLM.

pip install vllm

Run a server:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 1

OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
r = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(r.choices[0].message.content)

vLLM's killer features: - PagedAttention - KV-cache management like virtual memory. Way better GPU utilization than naive serving. - Continuous batching - interleaves requests at the token level. Many concurrent users; high throughput. - OpenAI-compatible API - drop-in replacement for OpenAI client libraries.

Used by many production LLM deployments. Needs a GPU; doesn't run on CPU usefully.

A simple HTTP wrapper around your own model¶

If you want full control:

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()


class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 100
    temperature: float = 0.7


@app.post("/generate")
def generate(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=req.max_new_tokens,
            temperature=req.temperature,
            do_sample=True,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    return {"text": text[len(req.prompt):]}

Run: uvicorn server:app --host 0.0.0.0 --port 8080. Test:

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "max_new_tokens": 30}'

For a small model and a few users, this works fine. For high concurrency, use vLLM.

Streaming responses¶

Users want to see tokens as they generate, not wait for the whole response. Implementation:

from fastapi.responses import StreamingResponse
from threading import Thread
from transformers import TextIteratorStreamer

@app.post("/stream")
def stream(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    Thread(target=model.generate, kwargs={
        **inputs, "max_new_tokens": req.max_new_tokens,
        "streamer": streamer, "do_sample": True, "temperature": req.temperature,
    }).start()

    def gen():
        for token in streamer:
            yield token
    return StreamingResponse(gen(), media_type="text/plain")

For production: use vLLM (streaming is built-in) rather than rolling your own threading.

Containerize for deployment¶

A Dockerfile for the FastAPI server:

FROM python:3.12-slim
WORKDIR /app

# CUDA wheel for PyTorch (adjust to your CUDA version)
RUN pip install --no-cache-dir torch fastapi uvicorn pydantic transformers

COPY server.py .
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

Build, push, run on Kubernetes (Containers + Kubernetes paths).

Deployment concerns¶

GPU scheduling. Kubernetes can schedule pods to GPU nodes (nvidia.com/gpu: 1 in resources). NVIDIA's GPU Operator manages drivers.
Cold start. Loading a 7B model takes 10-30 seconds. Don't scale to zero unless cold start is acceptable.
Model caching. Embed the model weights in the container image (huge), or mount as a PV (faster restarts).
Autoscaling. GPU pods are expensive. Scale based on request queue depth or GPU utilization, not CPU.
Observability. Latency per request, tokens/sec, GPU memory, queue depth.

Cost / latency calibration¶

Rough numbers for a single A100 GPU serving Llama-3.1-8B (FP16): - ~30-80 tokens/sec generation rate. - 5-15 GB GPU memory. - A few concurrent users; vLLM bumps this to dozens.

Quantized (4-bit) on a single consumer 24GB GPU: - ~40-100 tokens/sec. - Single users comfortable; concurrency lower.

For frontier models (70B+), you need multi-GPU or sharded serving. Beyond beginner.

OpenAI compatibility is a contract¶

Many tools (LangChain, llama-index, CLI tools) speak the OpenAI HTTP API. vLLM, llama.cpp's server, Ollama (with its /v1/... endpoints) all implement it. Building against the OpenAI API makes you portable across self-hosted and hosted backends.

from openai import OpenAI
# Same code works for openai.com, vLLM, Ollama, llama.cpp:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

Exercise¶

Install Ollama. Pull a small model. Chat.
Call it from Python via its HTTP API.
Build a tiny FastAPI server that wraps a small HF model (page 04's MLP works, or a Phi-3-mini for fun). Curl it.
(Stretch - GPU helpful) Install vLLM. Serve microsoft/Phi-3-mini-4k-instruct. Use the OpenAI client to call it.
(Stretch) Containerize your FastAPI server. Build the image. Run via docker run.

What you might wonder¶

"Should I serve via my own framework or vLLM?" For real production: vLLM (or TGI, Triton). The hand-rolled FastAPI version works fine for a hobby project but doesn't handle concurrency well.

"How do I keep the model warm?" Don't scale to zero. Have at least one replica always running. Health checks must respond fast (≤1s) without invoking the model.

"GPU memory keeps growing - what?" KV-cache (page 07) accumulates as context grows. Limit max_total_tokens in vLLM or chunk old context out.

"Open source vs API providers?" Both have a place. OpenAI/Anthropic/Google APIs are easy and powerful; you pay per token. Self-hosting is cheaper at scale but adds ops complexity. Most production teams use both - APIs for high-quality requests, self-hosted for high-volume cheaper requests.

Done¶

Run an LLM locally with Ollama or llama.cpp.
Serve high-throughput with vLLM.
Build a custom FastAPI wrapper.
Stream responses.
Containerize for deployment.

Next: Picking a project →