Skip to content

Month 4-Week 1: LLM application toolkit + first non-trivial app

Week summary

  • Goal: Set up your LLM application toolkit. Make calls to two providers (Anthropic + one other). Implement structured outputs with Pydantic. Pick and start your Q2 anchor project.
  • Time: ~9–10 h over 3 sessions.
  • Output: New repo (e.g., incident-triage-llm) with provider abstraction, Pydantic schemas, structured-output extraction, retries, and async batching.
  • Sequences relied on: 09-llm-application-engineering rungs 01, 02, 03; 04-python-for-ml rungs 01, 06, 07.

Why this week matters

Q2 begins. You're shifting from building models to building with models-most "AI engineering" jobs in 2026 live here. This week is the toolkit setup that everything else in Q2 sits on. The decisions you make now (anchor project domain, schema design, error handling style) carry through 12 weeks of work.

The anchor project matters. Pick something with real domain meaning to you-your day job, a hobby, an open problem. Generic chatbots make weak portfolio pieces. Specific applied systems make strong ones.

Prerequisites

  • Q1 complete (transformers from scratch, papers readable).
  • API keys: Anthropic (required), one other (OpenAI, Gemini, or use OpenRouter).
  • Session A-Tue/Wed evening (~3 h): pick anchor + set up toolkit
  • Session B-Sat morning (~3.5 h): structured outputs + Pydantic
  • Session C-Sun afternoon (~2.5 h): async batching + retries + ship v0

Session A-Anchor project + provider basics

Goal: Anchor project chosen and documented. First successful API calls to two providers. Project skeleton committed.

Part 1-Pick and document the anchor (45 min)

Recommended: an LLM-powered incident triage / RCA system on synthetic or real CI/CD telemetry. This leverages your SRE/observability background, produces a credible bridge story, and the work generalizes to many AI engineering interview problems.

Other strong options: - A code-review assistant (over a small codebase you know). - A log analysis / anomaly explanation tool. - A doc-search Q&A over a large repo's documentation.

Constraints for a good anchor: - Has structure (not "freeform chatbot"). - Lets you measure quality (eval-able). - Has data you can synthesize or already have. - Personally interesting enough to sustain 12 weeks.

Document in incident-triage-llm/README.md: - Problem statement (1 paragraph). - Why it matters (1 paragraph). - Success metric (1 sentence-e.g., "automatically suggest the probable cause for 70% of incidents in our golden set"). - What it is NOT (avoid scope creep).

Part 2-Project skeleton (45 min)

mkdir incident-triage-llm && cd incident-triage-llm
uv init
uv add anthropic openai litellm pydantic python-dotenv tenacity pytest
mkdir -p src/triage tests evals
echo "ANTHROPIC_API_KEY=" > .env.example
echo ".env" > .gitignore
git init && git add . && git commit -m "scaffold"

Layout:

incident-triage-llm/
├── src/triage/
│   ├── __init__.py
│   ├── client.py        # provider abstraction
│   ├── prompts.py       # system + few-shot
│   ├── schemas.py       # Pydantic models
│   └── pipeline.py      # main entry
├── tests/
├── evals/
├── pyproject.toml
└── README.md

Part 3-First calls to two providers (90 min)

Anthropic:

import anthropic
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(resp.content[0].text)
print(resp.usage)

OpenAI (or alternative):

from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Why is the sky blue?"}],
)
print(resp.choices[0].message.content)
print(resp.usage)

LiteLLM unified interface:

from litellm import completion
resp = completion(model="claude-opus-4-7", messages=[...])
resp = completion(model="gpt-4o", messages=[...])  # same shape!

Inspect the response shapes. Notice token usage, finish reasons, model IDs. Internalize them-they're how you'll bill, alert, and debug.

Output of Session A

  • Project skeleton committed.
  • Successful calls to both providers.
  • Documented anchor project in README.

Session B-Structured outputs and Pydantic

Goal: Force the LLM to return structured JSON matching a Pydantic schema. Validate, retry on failure.

Part 1-Design the schema (45 min)

For incident triage:

# src/triage/schemas.py
from pydantic import BaseModel, Field
from typing import Literal
from enum import Enum

class Severity(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

class IncidentReport(BaseModel):
    severity: Severity
    affected_service: str = Field(description="The primary service impacted")
    probable_cause: str = Field(description="Most likely root cause, 1-2 sentences")
    confidence: float = Field(ge=0.0, le=1.0,
                              description="Confidence in probable_cause, 0-1")
    recommended_actions: list[str] = Field(min_length=1, max_length=5,
                                           description="Concrete next steps for on-call")
    requires_human_escalation: bool

Schemas are self-documenting-descriptions become tool descriptions for the LLM. Be specific.

Part 2-Anthropic tool use for structured outputs (90 min)

Anthropic doesn't have a native "JSON mode"-you use tool use to force structured output:

# src/triage/client.py
import anthropic
import json
from .schemas import IncidentReport

EXTRACT_TOOL = {
    "name": "report_incident",
    "description": "Submit a structured incident report",
    "input_schema": IncidentReport.model_json_schema()
}

def triage(incident_description: str) -> IncidentReport:
    client = anthropic.Anthropic()
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        tools=[EXTRACT_TOOL],
        tool_choice={"type": "tool", "name": "report_incident"},
        system="You are a senior on-call SRE. Triage the incident and submit a structured report.",
        messages=[{"role": "user", "content": incident_description}],
    )
    # Find the tool_use block
    for block in resp.content:
        if block.type == "tool_use" and block.name == "report_incident":
            return IncidentReport.model_validate(block.input)
    raise ValueError("No tool_use block returned")

OpenAI structured outputs (alternative):

from openai import OpenAI
from .schemas import IncidentReport

client = OpenAI()
resp = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a senior on-call SRE..."},
        {"role": "user", "content": incident_description},
    ],
    response_format=IncidentReport,
)
return resp.choices[0].message.parsed

Part 3-Tests (45 min)

# tests/test_triage.py
import pytest
from src.triage.client import triage

@pytest.fixture
def sample_incident():
    return """
    [2026-04-15 14:32 UTC] Sudden spike in 5xx errors on checkout-api,
    p95 latency from 200ms to 4s. Coincides with deploy of v2.3.4 at 14:30.
    Database connections at saturation.
    """

def test_basic_extraction(sample_incident):
    report = triage(sample_incident)
    assert report.severity in {"critical", "high"}
    assert "checkout" in report.affected_service.lower()
    assert len(report.recommended_actions) >= 1

def test_schema_validity(sample_incident):
    report = triage(sample_incident)
    # Pydantic validation already ran; just check we got a valid object
    assert 0.0 <= report.confidence <= 1.0

Run: pytest -v.

Output of Session B

  • Pydantic schemas in repo.
  • Structured-output extraction working.
  • 2+ tests passing.

Session C-Async, retries, ship v0

Goal: Add concurrency control, retries with exponential backoff, and ship v0 with documented usage.

Part 1-Tenacity retries (45 min)

LLM APIs throw 429 (rate limit), 500 (server), and transient network errors. Always wrap in retries:

# src/triage/client.py
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import httpx

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((
        anthropic.APIConnectionError,
        anthropic.RateLimitError,
        anthropic.APIStatusError,  # 5xx
    )),
)
def triage(incident_description: str) -> IncidentReport:
    ...

Part 2-Async batching with concurrency limits (60 min)

For evaluating against many incidents, you need parallelism:

# src/triage/batch.py
import asyncio
from anthropic import AsyncAnthropic

async def triage_async(client: AsyncAnthropic, description: str, sem: asyncio.Semaphore) -> IncidentReport:
    async with sem:  # cap concurrency
        resp = await client.messages.create(...)
        ...

async def triage_batch(descriptions: list[str], concurrency: int = 10) -> list[IncidentReport]:
    client = AsyncAnthropic()
    sem = asyncio.Semaphore(concurrency)
    tasks = [triage_async(client, d, sem) for d in descriptions]
    return await asyncio.gather(*tasks, return_exceptions=True)

Test it. Generate 50 synthetic incident descriptions (a list of strings). Run asyncio.run(triage_batch(descriptions)). Time it. Compare to sequential.

Expected: 50 calls in ~30 seconds with concurrency=10, vs ~5 minutes sequential.

Part 3-README + ship v0 (45 min)

Update README: - Quickstart (uv pip install, set API key, run example). - Architecture diagram (simple ASCII). - Roadmap pointing to M04-W02 (tool use), W03 (evals), etc.

Push v0 release:

git tag v0.1.0
git push --tags

Make repo public.

Output of Session C

  • Retries integrated.
  • Async batching working at 10× speedup.
  • v0.1.0 tagged and pushed.
  • README polished.

End-of-week artifact

  • Anchor project picked and documented
  • Two providers callable through LiteLLM
  • Pydantic schemas validating LLM outputs
  • Async batching of 50 calls
  • Retries + rate limiting working
  • v0.1.0 tagged, public

End-of-week self-assessment

  • I can call any LLM provider with a structured-output schema.
  • I can defend my anchor-project choice in 30 seconds.
  • My code handles transient failures gracefully.

Common failure modes for this week

  • Picking a chatbot as the anchor. Too generic. Pick something with structure.
  • Skipping schemas. Free-text outputs cannot be evaluated rigorously.
  • No async or retries. Without them, every Q2 week will fight infrastructure.

What's next (preview of M04-W02)

Tool use, streaming, prompt caching. Three primitives that separate demo apps from production-ready ones. Cost optimization begins.

Comments