11-Agents¶
Why this matters in the journey¶
"Agent" is overloaded-it covers everything from a simple tool-use loop to multi-agent research systems. The 2026 frontier is making agents reliable on complex tasks: SWE-bench, GAIA, real customer workflows. Your distributed-systems background is unusually well-matched to agent engineering-agents fail in distributed-systems-shaped ways (timeouts, partial failure, retries, idempotency, consistency).
The rungs¶
Rung 01-Tool-use loop (the agent baseline)¶
- What: Model decides → calls tool → reads result → decides again, until "done." This is the simplest possible agent.
- Why it earns its place: 80% of "agents" in production are this. Master it before reaching for frameworks.
- Resource: Anthropic tool use docs (a complete loop example is given). Plus the Anthropic "Building Effective Agents" post.
- Done when: You can implement a tool-use loop from scratch (no framework) with at least 3 tools.
Rung 02-ReAct: reasoning + acting¶
- What: Interleave "thought" and "action" steps. The reasoning is generated by the model and feeds the next action.
- Why it earns its place: ReAct is the canonical pattern that started the modern agent era. Every framework is a variation.
- Resource: ReAct paper (arxiv.org/abs/2210.03629). Plus a from-scratch implementation.
- Done when: You've implemented a ReAct agent and observed how its reasoning trace differs from a plain tool-use loop.
Rung 03-Planning patterns¶
- What: Plan-and-execute (plan first, then execute steps). ReWOO (plan with placeholders, fill in later). Hierarchical task decomposition.
- Why it earns its place: Pure ReAct fails on long-horizon tasks. Planning patterns address it.
- Resource: Plan-and-Execute paper (search "plan and execute langchain"). ReWOO (arxiv.org/abs/2305.18323).
- Done when: You can articulate when to choose plan-and-execute over ReAct.
Rung 04-Reflection and self-correction¶
- What: After an action, the agent critiques its own output and revises. Reflexion paper formalizes this.
- Why it earns its place: Many quality wins on agent tasks come from a critique step, not better tools.
- Resource: Reflexion paper (arxiv.org/abs/2303.11366). Plus Self-Refine (arxiv.org/abs/2303.17651).
- Done when: You've added a reflection step to your agent and measured the quality delta with an eval.
Rung 05-State management¶
- What: Agents have memory: short-term (within conversation), long-term (persistent across sessions), tool-result memory. State machines for control flow.
- Why it earns its place: State management is where naive agents break. Distributed-systems instincts transfer.
- Resource: LangGraph docs (search "langgraph"). Even if you don't use LangGraph, its state-machine model is a useful framing.
- Done when: You can sketch your agent as a state machine and identify where state is persisted.
Rung 06-Tool design¶
- What: Tools are APIs the model uses. Good tools have: clear names, focused scope, structured inputs, structured outputs, error messages the model can act on.
- Why it earns its place: Bad tools sink agents. This is an underrated craft.
- Resource: Anthropic's tool design guide. Plus reading the tool definitions in popular agent frameworks.
- Done when: You can critique a poorly designed tool and produce a redesigned version.
Rung 07-Multi-agent systems¶
- What: Multiple specialized agents coordinated by a router or supervisor. Examples: AutoGen, CrewAI, OpenAI Swarm.
- Why it earns its place: A 2024–2026 trend. Sometimes useful, often over-engineered. Knowing both sides is judgment.
- Resource: AutoGen docs (microsoft.github.io/autogen). CrewAI docs. Plus the OpenAI Swarm cookbook.
- Done when: You've built a 2-agent system (e.g., researcher + writer) and can articulate when this beats a single agent.
Rung 08-Agent benchmarks¶
- What: SWE-bench (real GitHub issues), GAIA (general assistant), WebArena (web navigation), τ-bench (customer service), AgentBench.
- Why it earns its place: Benchmarks ground hand-wavy "agent capability" claims. Submitting to one is a strong public signal.
- Resource: SWE-bench paper + leaderboard (swebench.com). GAIA paper (arxiv.org/abs/2311.12983).
- Done when: You've evaluated an agent on at least one public benchmark, even with low score.
Rung 09-Agent evaluation rigor¶
- What: Trajectory evaluation (was each step correct?), outcome evaluation (was the final answer correct?), tool-call accuracy, cost per task.
- Why it earns its place: Most agent demos are cherry-picked. Real eval requires rigor.
- Resource: Hamel Husain's posts on agent evals. Plus the Inspect AI docs.
- Done when: You have an eval that scores both trajectory and outcome on a real task set.
Rung 10-Failure modes and robustness¶
- What: Loops, hallucinated tool calls, runaway costs, prompt injection via tool outputs, infinite recursion.
- Why it earns its place: Production agents need guardrails. Distributed-systems thinking (timeouts, circuit breakers, budgets) directly applies.
- Resource: Simon Willison's prompt injection writing (simonwillison.net). Plus your own observability instincts applied.
- Done when: You've added: per-task budget cap, max-step cap, tool-call timeout, prompt-injection mitigation.
Rung 11-Agentic systems in production¶
- What: Async execution, observability per step, human-in-the-loop checkpoints, audit logs.
- Why it earns its place: This is where you bring your backend skills home. It's the bridge from prototype to product.
- Resource: Langfuse / LangSmith agent tracing docs. Plus OpenTelemetry GenAI semantic conventions.
- Done when: Your agent has full step-by-step traces, an audit log, and a kill switch.
Minimum required to leave this sequence¶
- Tool-use loop from scratch with 3 tools.
- ReAct agent implemented from scratch.
- Reflection step measured against a no-reflection baseline.
- One multi-agent system built and critiqued.
- Evaluated an agent on a public benchmark or 50+ task set.
- Agent has step traces, budget caps, and timeouts.
Going further¶
- Anthropic "Building Effective Agents"-re-read every quarter.
- Lilian Weng "LLM Powered Autonomous Agents" (lilianweng.github.io)-foundational survey.
- Designing Agentic AI Systems-emerging books in 2025/2026; check current titles.
How this sequence connects to the year¶
- Month 6: Rungs 01–04 are the build for month 6's agentic anchor.
- Q3 Track B: This sequence is your specialty if you pick agents.
- Q4: Robustness (rung 10) is what makes a capstone agent presentable.