Week 23 - Agents, Tools, Durable Execution, Cost & Safety¶
23.1 Conceptual Core¶
An "agent" is an LLM in a loop over a tool-use protocol with state, exit conditions, and observability. The dangerous failure modes:
- Runaway loops: turn caps, cost caps, time caps. All three.
- Bad tool inputs: validate aggressively at the tool boundary; treat the LLM as untrusted input.
- Silent quality drift: log every step; replay traces in tests.
- Permission escalation: an agent with bash is a remote-code-execution surface. Sandbox.
Read Anthropic's Building effective agents once. The taxonomy (workflows vs. agents; chains, routers, parallelization, orchestrator-workers, evaluator-optimizer) is load-bearing.
23.2 Mechanical Detail¶
- Frameworks worth knowing (in 2026):
pydantic-ai,dspy,instructor,langgraph(for graph-shaped flows), and the "build your own" path. Default to "build your own" until you've felt the pain - most frameworks add accidental complexity. - Durable execution:
temporal,inngest, or a state-machine table in Postgres. Critical when agents take minutes-to-hours and processes can crash. - Tool definitions: Pydantic schemas → JSON Schema → tool spec. Use
pydantic-aiorinstructorto generate. - Sandboxing:
e2b, Docker, gVisor, Firecracker. NeverexecLLM-generated code on the host. - Observability for agents:
langfuse,arize phoenix, or roll your own with OpenTelemetry spans-per-step.
23.3 Lab - "An Agent That Doesn't Burn Money"¶
- Build a research agent: takes a question, plans, calls
web_searchandfetch_urltools, synthesizes an answer with citations. - Add: max-turns=10, max-tokens=200k, max-wall-time=120s, max-cost=$0.50. Verify each cap fires correctly.
- Persist agent state (turn-by-turn) to Postgres. Recover after a kill -9.
- Write replay tests: feed a saved trace to a test, mock the LLM, assert tool calls happen in the right order.
- Add an evaluator-optimizer loop: a critic LLM grades the answer; if score < threshold, revise once.
23.4 Production Hardening Slice¶
- Add a "kill switch": a feature flag that immediately disables agent execution. Verify it works via end-to-end test.