Skip to content

09-LLM Application Engineering

Why this matters in the journey

This sequence is where backend engineering and AI fuse. You take a foundation model and turn it into a system that does something useful. The skills are: prompt design, structured outputs, tool use, streaming, prompt caching, error handling, observability, cost/latency budgeting, and basic eval discipline. Most "AI engineers" today are LLM application engineers.

The rungs

Rung 01-Anatomy of a chat completion request

  • What: A request has: model, messages (system + user + assistant turns), parameters (temperature, max tokens, stop sequences). Response has content, finish reason, usage (token counts).
  • Why it earns its place: Every API call is this shape. Knowing it cold accelerates everything.
  • Resource: Anthropic Messages API docs (docs.anthropic.com) and OpenAI Chat Completions docs.
  • Done when: You can call both Anthropic and OpenAI from Python and inspect the full response shape.

Rung 02-Prompt engineering fundamentals

  • What: System prompt, few-shot examples, chain-of-thought, role-playing, output format instructions. Anti-patterns: vague asks, conflicting instructions, no examples.
  • Why it earns its place: Most quality wins early in a project come from prompt structure, not model swap.
  • Resource: Anthropic's prompt engineering docs. OpenAI cookbook examples. Plus the Prompt Engineering Guide (promptingguide.ai).
  • Done when: You can take a vague task and produce a structured prompt with system instructions and 3 few-shot examples that materially improves output.

Rung 03-Structured outputs

  • What: Force the LLM to return JSON matching a schema. Pydantic + Anthropic tool use / OpenAI structured outputs / function calling.
  • Why it earns its place: Production LLM systems consume parseable outputs, not free text. This is the most-used technique in AI engineering today.
  • Resource: Anthropic structured outputs docs. OpenAI structured outputs docs. The instructor library (search "instructor python library jxnl").
  • Done when: You can define a Pydantic model and reliably extract instances of it from an LLM call with retries on validation failure.

Rung 04-Tool use

  • What: Give the LLM a list of "tools" (functions with schemas); it decides when to call them; you execute the call and feed the result back; the LLM produces a final answer.
  • Why it earns its place: Tool use is the basis of agents, RAG, and almost every interesting LLM application.
  • Resource: Anthropic tool use docs (docs.anthropic.com/claude/docs/tool-use). OpenAI function calling docs.
  • Done when: You can implement a multi-turn tool-calling loop with at least 2 tools (e.g., a calculator + a web search).

Rung 05-Streaming

  • What: Receive the response token-by-token as it's generated. Used for chat UIs and to start downstream work earlier.
  • Why it earns its place: Responsiveness is the difference between a usable product and a slow one. Streaming is also harder to instrument-bonus reason to learn.
  • Resource: Anthropic streaming docs. OpenAI streaming docs.
  • Done when: You can stream a response, accumulate tokens, and handle the "stream ended early" failure mode.

Rung 06-Prompt caching

  • What: Mark long stable prefixes (system prompts, large context, examples) for caching so they're not reprocessed each call. Anthropic, OpenAI, Google all support variations.
  • Why it earns its place: Cuts cost by 60–90% and latency by 30–80% on common patterns. Future-AI-engineer discipline.
  • Resource: Anthropic prompt caching docs (docs.anthropic.com/claude/docs/prompt-caching). OpenAI prompt caching announcement.
  • Done when: You can structure a long-context call to maximize cache hit rate and verify hits in the API response.

Rung 07-Cost, latency, and token accounting

  • What: Tokens in, tokens out, cache reads, cache writes-each priced differently. p50/p95 latency for streaming vs non-streaming. Caching impact.
  • Why it earns its place: Senior AI engineers know the unit economics of every prompt they ship. Junior ones don't.
  • Resource: Pricing pages of major providers + tools like LiteLLM that abstract token accounting.
  • Done when: For your project, you can answer "what does one user interaction cost in tokens / dollars / p95 ms?"

Rung 08-Provider abstraction with LiteLLM

  • What: A library that gives a unified interface across Anthropic, OpenAI, Google, open-source models, etc.
  • Why it earns its place: Provider lock-in is risky. Eval frameworks compare across providers. LiteLLM is the de facto standard.
  • Resource: LiteLLM docs (docs.litellm.ai).
  • Done when: You can swap model providers in your project with a one-line change.

Rung 09-Error handling, retries, rate limiting

  • What: Exponential backoff on 429, 500, network errors. Idempotency keys. Concurrency control with semaphores.
  • Why it earns its place: Naive error handling makes evals and agents flaky. Production reliability begins here.
  • Resource: Tenacity library docs (tenacity.readthedocs.io). Anthropic and OpenAI both have rate-limit best-practices guides.
  • Done when: Your code handles a transient 429 storm gracefully with bounded concurrency.

Rung 10-DSPy (a different paradigm)

  • What: A library that treats prompts as compilable programs-you write signatures, DSPy optimizes the prompts.
  • Why it earns its place: Even if you don't use DSPy in production, going through its tutorials changes how you think about prompts-toward declarative, testable specifications.
  • Resource: DSPy docs (dspy.ai).
  • Done when: You've completed at least one DSPy tutorial and can articulate the difference between prompt-as-string and prompt-as-program.

Rung 11-Evals (preview-full sequence in 12)

  • What: Even at the application stage you need a small golden set, a metric, and a regression check before changing prompts.
  • Why it earns its place: Without evals, "prompt improvements" are folklore.
  • Resource: Hamel Husain's eval blog series (hamel.dev). Read the first three posts now.
  • Done when: You have 20–50 golden examples and a Python script that scores any prompt change against them.

Minimum required to leave this sequence

  • Make working calls to two LLM providers and compare responses.
  • Write a structured-output Pydantic schema and reliably extract it.
  • Implement a multi-tool-use loop.
  • Stream a response and handle disconnection.
  • Set up prompt caching and verify hit rate.
  • Cost-account a single user interaction in tokens and dollars.
  • Build a 30-example golden set and measure a prompt change against it.

Going further

  • AI Engineering (Chip Huyen)-read cover to cover.
  • Hands-On Large Language Models (Alammar & Grootendorst)-practical companion.
  • Anthropic "Building Effective Agents" post-the patterns playbook.
  • OpenAI Cookbook (github.com/openai/openai-cookbook)-code patterns galore.

How this sequence connects to the year

  • Month 4: This sequence IS month 4.
  • Months 5–6: RAG and agents build on rungs 03, 04, 09.
  • Q3 onwards: Everything in this sequence is your daily toolkit.

Comments