Week 21 - LLM-App Foundations: Prompts, Tokens, Streaming, Cost¶
21.1 Conceptual Core¶
- An LLM call is an autoregressive generation over a token stream, billed per token, latency-bound by output length, and probabilistic in output. All four facts shape the system around it.
- Tokens, not characters or words. Costs, context windows, and rate limits are all in tokens. Always budget in tokens.
tiktokenor the model's own tokenizer is the source of truth. - Streaming is product-critical. Time-to-first-token (TTFT) usually matters more than total time. Design APIs streaming-first; convert to batch only if needed.
- Caching the prompt prefix is free latency. Anthropic and OpenAI both expose prompt caching; structure prompts so the cacheable prefix is large and stable.
21.2 Mechanical Detail¶
- SDKs:
anthropic,openai, pluslitellmas a normalization layer if you need provider portability. Async clients always - sync clients block your event loop and waste capacity. - Streaming: SSE on the wire; in code, an async iterator of events. Render incrementally. Handle mid-stream errors and retries (resume is hard; usually you re-issue from scratch).
- Structured output: JSON mode, tool use / function calling, or constrained decoding (
outlines,instructor,lm-format-enforcer). Pydantic models as the schema. - Failure modes: rate limits (429), token limits, content filters, schema-violating outputs, hallucinated tool arguments. Each has a distinct retry strategy.
21.3 Lab - "A Disciplined LLM Client"¶
- Build an
LLMClientabstraction overanthropicandopenaiasync SDKs. Methods:generate,stream,with_tools. - Add token accounting: pre-call estimate, post-call actual, running cost meter.
- Add caching headers (Anthropic prompt caching). Measure latency delta.
- Add structured-output mode using
instructor+ a Pydantic schema. Test on a deliberately ambiguous prompt; observe schema enforcement. - Add timeout, retry-with-backoff, and circuit breaker (
pybreakeror hand-rolled).
21.4 Production Hardening Slice¶
- Add per-request
trace_id,model,prompt_tokens,completion_tokens,cached_tokens,cost_usdto your structured logs. This is the only way you'll keep cost under control in production.