Week 21 - LLM-App Foundations: Prompts, Tokens, Streaming, Cost¶

21.1 Conceptual Core¶

An LLM call is an autoregressive generation over a token stream, billed per token, latency-bound by output length, and probabilistic in output. All four facts shape the system around it.
Tokens, not characters or words. Costs, context windows, and rate limits are all in tokens. Always budget in tokens. tiktoken or the model's own tokenizer is the source of truth.
Streaming is product-critical. Time-to-first-token (TTFT) usually matters more than total time. Design APIs streaming-first; convert to batch only if needed.
Caching the prompt prefix is free latency. Anthropic and OpenAI both expose prompt caching; structure prompts so the cacheable prefix is large and stable.

SDKs: anthropic, openai, plus litellm as a normalization layer if you need provider portability. Async clients always - sync clients block your event loop and waste capacity.
Streaming: SSE on the wire; in code, an async iterator of events. Render incrementally. Handle mid-stream errors and retries (resume is hard; usually you re-issue from scratch).
Structured output: JSON mode, tool use / function calling, or constrained decoding (outlines, instructor, lm-format-enforcer). Pydantic models as the schema.
Failure modes: rate limits (429), token limits, content filters, schema-violating outputs, hallucinated tool arguments. Each has a distinct retry strategy.

Build an LLMClient abstraction over anthropic and openai async SDKs. Methods: generate, stream, with_tools.
Add token accounting: pre-call estimate, post-call actual, running cost meter.
Add caching headers (Anthropic prompt caching). Measure latency delta.
Add structured-output mode using instructor + a Pydantic schema. Test on a deliberately ambiguous prompt; observe schema enforcement.
Add timeout, retry-with-backoff, and circuit breaker (pybreaker or hand-rolled).

Add per-request trace_id, model, prompt_tokens, completion_tokens, cached_tokens, cost_usd to your structured logs. This is the only way you'll keep cost under control in production.