Skip to content

Week 21 - LLM-App Foundations: Prompts, Tokens, Streaming, Cost

21.1 Conceptual Core

  • An LLM call is an autoregressive generation over a token stream, billed per token, latency-bound by output length, and probabilistic in output. All four facts shape the system around it.
  • Tokens, not characters or words. Costs, context windows, and rate limits are all in tokens. Always budget in tokens. tiktoken or the model's own tokenizer is the source of truth.
  • Streaming is product-critical. Time-to-first-token (TTFT) usually matters more than total time. Design APIs streaming-first; convert to batch only if needed.
  • Caching the prompt prefix is free latency. Anthropic and OpenAI both expose prompt caching; structure prompts so the cacheable prefix is large and stable.

21.2 Mechanical Detail

  • SDKs: anthropic, openai, plus litellm as a normalization layer if you need provider portability. Async clients always - sync clients block your event loop and waste capacity.
  • Streaming: SSE on the wire; in code, an async iterator of events. Render incrementally. Handle mid-stream errors and retries (resume is hard; usually you re-issue from scratch).
  • Structured output: JSON mode, tool use / function calling, or constrained decoding (outlines, instructor, lm-format-enforcer). Pydantic models as the schema.
  • Failure modes: rate limits (429), token limits, content filters, schema-violating outputs, hallucinated tool arguments. Each has a distinct retry strategy.

21.3 Lab - "A Disciplined LLM Client"

  1. Build an LLMClient abstraction over anthropic and openai async SDKs. Methods: generate, stream, with_tools.
  2. Add token accounting: pre-call estimate, post-call actual, running cost meter.
  3. Add caching headers (Anthropic prompt caching). Measure latency delta.
  4. Add structured-output mode using instructor + a Pydantic schema. Test on a deliberately ambiguous prompt; observe schema enforcement.
  5. Add timeout, retry-with-backoff, and circuit breaker (pybreaker or hand-rolled).

21.4 Production Hardening Slice

  • Add per-request trace_id, model, prompt_tokens, completion_tokens, cached_tokens, cost_usd to your structured logs. This is the only way you'll keep cost under control in production.

Comments