11-Agents¶

Why this matters in the journey¶

"Agent" is overloaded-it covers everything from a simple tool-use loop to multi-agent research systems. The 2026 frontier is making agents reliable on complex tasks: SWE-bench, GAIA, real customer workflows. Your distributed-systems background is unusually well-matched to agent engineering-agents fail in distributed-systems-shaped ways (timeouts, partial failure, retries, idempotency, consistency).

The rungs¶

Rung 01-Tool-use loop (the agent baseline)¶

What: Model decides → calls tool → reads result → decides again, until "done." This is the simplest possible agent.
Why it earns its place: 80% of "agents" in production are this. Master it before reaching for frameworks.
Resource: Anthropic tool use docs (a complete loop example is given). Plus the Anthropic "Building Effective Agents" post.
Done when: You can implement a tool-use loop from scratch (no framework) with at least 3 tools.

Rung 02-ReAct: reasoning + acting¶

What: Interleave "thought" and "action" steps. The reasoning is generated by the model and feeds the next action.
Why it earns its place: ReAct is the canonical pattern that started the modern agent era. Every framework is a variation.
Resource: ReAct paper (arxiv.org/abs/2210.03629). Plus a from-scratch implementation.
Done when: You've implemented a ReAct agent and observed how its reasoning trace differs from a plain tool-use loop.

Rung 03-Planning patterns¶

What: Plan-and-execute (plan first, then execute steps). ReWOO (plan with placeholders, fill in later). Hierarchical task decomposition.
Why it earns its place: Pure ReAct fails on long-horizon tasks. Planning patterns address it.
Resource: Plan-and-Execute paper (search "plan and execute langchain"). ReWOO (arxiv.org/abs/2305.18323).
Done when: You can articulate when to choose plan-and-execute over ReAct.

Rung 04-Reflection and self-correction¶

What: After an action, the agent critiques its own output and revises. Reflexion paper formalizes this.
Why it earns its place: Many quality wins on agent tasks come from a critique step, not better tools.
Resource: Reflexion paper (arxiv.org/abs/2303.11366). Plus Self-Refine (arxiv.org/abs/2303.17651).
Done when: You've added a reflection step to your agent and measured the quality delta with an eval.

Rung 05-State management¶

What: Agents have memory: short-term (within conversation), long-term (persistent across sessions), tool-result memory. State machines for control flow.
Why it earns its place: State management is where naive agents break. Distributed-systems instincts transfer.
Resource: LangGraph docs (search "langgraph"). Even if you don't use LangGraph, its state-machine model is a useful framing.
Done when: You can sketch your agent as a state machine and identify where state is persisted.

Rung 06-Tool design¶

What: Tools are APIs the model uses. Good tools have: clear names, focused scope, structured inputs, structured outputs, error messages the model can act on.
Why it earns its place: Bad tools sink agents. This is an underrated craft.
Resource: Anthropic's tool design guide. Plus reading the tool definitions in popular agent frameworks.
Done when: You can critique a poorly designed tool and produce a redesigned version.

Rung 07-Multi-agent systems¶

What: Multiple specialized agents coordinated by a router or supervisor. Examples: AutoGen, CrewAI, OpenAI Swarm.
Why it earns its place: A 2024–2026 trend. Sometimes useful, often over-engineered. Knowing both sides is judgment.
Resource: AutoGen docs (microsoft.github.io/autogen). CrewAI docs. Plus the OpenAI Swarm cookbook.
Done when: You've built a 2-agent system (e.g., researcher + writer) and can articulate when this beats a single agent.

Rung 08-Agent benchmarks¶

What: SWE-bench (real GitHub issues), GAIA (general assistant), WebArena (web navigation), τ-bench (customer service), AgentBench.
Why it earns its place: Benchmarks ground hand-wavy "agent capability" claims. Submitting to one is a strong public signal.
Resource: SWE-bench paper + leaderboard (swebench.com). GAIA paper (arxiv.org/abs/2311.12983).
Done when: You've evaluated an agent on at least one public benchmark, even with low score.

Rung 09-Agent evaluation rigor¶

What: Trajectory evaluation (was each step correct?), outcome evaluation (was the final answer correct?), tool-call accuracy, cost per task.
Why it earns its place: Most agent demos are cherry-picked. Real eval requires rigor.
Resource: Hamel Husain's posts on agent evals. Plus the Inspect AI docs.
Done when: You have an eval that scores both trajectory and outcome on a real task set.

Rung 10-Failure modes and robustness¶

What: Loops, hallucinated tool calls, runaway costs, prompt injection via tool outputs, infinite recursion.
Why it earns its place: Production agents need guardrails. Distributed-systems thinking (timeouts, circuit breakers, budgets) directly applies.
Resource: Simon Willison's prompt injection writing (simonwillison.net). Plus your own observability instincts applied.
Done when: You've added: per-task budget cap, max-step cap, tool-call timeout, prompt-injection mitigation.

Rung 11-Agentic systems in production¶

What: Async execution, observability per step, human-in-the-loop checkpoints, audit logs.
Why it earns its place: This is where you bring your backend skills home. It's the bridge from prototype to product.
Resource: Langfuse / LangSmith agent tracing docs. Plus OpenTelemetry GenAI semantic conventions.
Done when: Your agent has full step-by-step traces, an audit log, and a kill switch.

Minimum required to leave this sequence¶

Tool-use loop from scratch with 3 tools.
ReAct agent implemented from scratch.
Reflection step measured against a no-reflection baseline.
One multi-agent system built and critiqued.
Evaluated an agent on a public benchmark or 50+ task set.
Agent has step traces, budget caps, and timeouts.

Going further¶

Anthropic "Building Effective Agents"-re-read every quarter.
Lilian Weng "LLM Powered Autonomous Agents" (lilianweng.github.io)-foundational survey.
Designing Agentic AI Systems-emerging books in 2025/2026; check current titles.

How this sequence connects to the year¶

Month 6: Rungs 01–04 are the build for month 6's agentic anchor.
Q3 Track B: This sequence is your specialty if you pick agents.
Q4: Robustness (rung 10) is what makes a capstone agent presentable.