09-LLM Application Engineering¶
Why this matters in the journey¶
This sequence is where backend engineering and AI fuse. You take a foundation model and turn it into a system that does something useful. The skills are: prompt design, structured outputs, tool use, streaming, prompt caching, error handling, observability, cost/latency budgeting, and basic eval discipline. Most "AI engineers" today are LLM application engineers.
The rungs¶
Rung 01-Anatomy of a chat completion request¶
- What: A request has: model, messages (system + user + assistant turns), parameters (temperature, max tokens, stop sequences). Response has content, finish reason, usage (token counts).
- Why it earns its place: Every API call is this shape. Knowing it cold accelerates everything.
- Resource: Anthropic Messages API docs (docs.anthropic.com) and OpenAI Chat Completions docs.
- Done when: You can call both Anthropic and OpenAI from Python and inspect the full response shape.
Rung 02-Prompt engineering fundamentals¶
- What: System prompt, few-shot examples, chain-of-thought, role-playing, output format instructions. Anti-patterns: vague asks, conflicting instructions, no examples.
- Why it earns its place: Most quality wins early in a project come from prompt structure, not model swap.
- Resource: Anthropic's prompt engineering docs. OpenAI cookbook examples. Plus the Prompt Engineering Guide (promptingguide.ai).
- Done when: You can take a vague task and produce a structured prompt with system instructions and 3 few-shot examples that materially improves output.
Rung 03-Structured outputs¶
- What: Force the LLM to return JSON matching a schema. Pydantic + Anthropic tool use / OpenAI structured outputs / function calling.
- Why it earns its place: Production LLM systems consume parseable outputs, not free text. This is the most-used technique in AI engineering today.
- Resource: Anthropic structured outputs docs. OpenAI structured outputs docs. The
instructorlibrary (search "instructor python library jxnl"). - Done when: You can define a Pydantic model and reliably extract instances of it from an LLM call with retries on validation failure.
Rung 04-Tool use¶
- What: Give the LLM a list of "tools" (functions with schemas); it decides when to call them; you execute the call and feed the result back; the LLM produces a final answer.
- Why it earns its place: Tool use is the basis of agents, RAG, and almost every interesting LLM application.
- Resource: Anthropic tool use docs (docs.anthropic.com/claude/docs/tool-use). OpenAI function calling docs.
- Done when: You can implement a multi-turn tool-calling loop with at least 2 tools (e.g., a calculator + a web search).
Rung 05-Streaming¶
- What: Receive the response token-by-token as it's generated. Used for chat UIs and to start downstream work earlier.
- Why it earns its place: Responsiveness is the difference between a usable product and a slow one. Streaming is also harder to instrument-bonus reason to learn.
- Resource: Anthropic streaming docs. OpenAI streaming docs.
- Done when: You can stream a response, accumulate tokens, and handle the "stream ended early" failure mode.
Rung 06-Prompt caching¶
- What: Mark long stable prefixes (system prompts, large context, examples) for caching so they're not reprocessed each call. Anthropic, OpenAI, Google all support variations.
- Why it earns its place: Cuts cost by 60–90% and latency by 30–80% on common patterns. Future-AI-engineer discipline.
- Resource: Anthropic prompt caching docs (docs.anthropic.com/claude/docs/prompt-caching). OpenAI prompt caching announcement.
- Done when: You can structure a long-context call to maximize cache hit rate and verify hits in the API response.
Rung 07-Cost, latency, and token accounting¶
- What: Tokens in, tokens out, cache reads, cache writes-each priced differently. p50/p95 latency for streaming vs non-streaming. Caching impact.
- Why it earns its place: Senior AI engineers know the unit economics of every prompt they ship. Junior ones don't.
- Resource: Pricing pages of major providers + tools like LiteLLM that abstract token accounting.
- Done when: For your project, you can answer "what does one user interaction cost in tokens / dollars / p95 ms?"
Rung 08-Provider abstraction with LiteLLM¶
- What: A library that gives a unified interface across Anthropic, OpenAI, Google, open-source models, etc.
- Why it earns its place: Provider lock-in is risky. Eval frameworks compare across providers. LiteLLM is the de facto standard.
- Resource: LiteLLM docs (docs.litellm.ai).
- Done when: You can swap model providers in your project with a one-line change.
Rung 09-Error handling, retries, rate limiting¶
- What: Exponential backoff on 429, 500, network errors. Idempotency keys. Concurrency control with semaphores.
- Why it earns its place: Naive error handling makes evals and agents flaky. Production reliability begins here.
- Resource: Tenacity library docs (tenacity.readthedocs.io). Anthropic and OpenAI both have rate-limit best-practices guides.
- Done when: Your code handles a transient 429 storm gracefully with bounded concurrency.
Rung 10-DSPy (a different paradigm)¶
- What: A library that treats prompts as compilable programs-you write signatures, DSPy optimizes the prompts.
- Why it earns its place: Even if you don't use DSPy in production, going through its tutorials changes how you think about prompts-toward declarative, testable specifications.
- Resource: DSPy docs (dspy.ai).
- Done when: You've completed at least one DSPy tutorial and can articulate the difference between prompt-as-string and prompt-as-program.
Rung 11-Evals (preview-full sequence in 12)¶
- What: Even at the application stage you need a small golden set, a metric, and a regression check before changing prompts.
- Why it earns its place: Without evals, "prompt improvements" are folklore.
- Resource: Hamel Husain's eval blog series (hamel.dev). Read the first three posts now.
- Done when: You have 20–50 golden examples and a Python script that scores any prompt change against them.
Minimum required to leave this sequence¶
- Make working calls to two LLM providers and compare responses.
- Write a structured-output Pydantic schema and reliably extract it.
- Implement a multi-tool-use loop.
- Stream a response and handle disconnection.
- Set up prompt caching and verify hit rate.
- Cost-account a single user interaction in tokens and dollars.
- Build a 30-example golden set and measure a prompt change against it.
Going further¶
- AI Engineering (Chip Huyen)-read cover to cover.
- Hands-On Large Language Models (Alammar & Grootendorst)-practical companion.
- Anthropic "Building Effective Agents" post-the patterns playbook.
- OpenAI Cookbook (github.com/openai/openai-cookbook)-code patterns galore.
How this sequence connects to the year¶
- Month 4: This sequence IS month 4.
- Months 5–6: RAG and agents build on rungs 03, 04, 09.
- Q3 onwards: Everything in this sequence is your daily toolkit.