Saltar a contenido

11-Agents

Why this matters in the journey

"Agent" is overloaded-it covers everything from a simple tool-use loop to multi-agent research systems. The 2026 frontier is making agents reliable on complex tasks: SWE-bench, GAIA, real customer workflows. Your distributed-systems background is unusually well-matched to agent engineering-agents fail in distributed-systems-shaped ways (timeouts, partial failure, retries, idempotency, consistency).

The rungs

Rung 01-Tool-use loop (the agent baseline)

  • What: Model decides → calls tool → reads result → decides again, until "done." This is the simplest possible agent.
  • Why it earns its place: 80% of "agents" in production are this. Master it before reaching for frameworks.
  • Resource: Anthropic tool use docs (a complete loop example is given). Plus the Anthropic "Building Effective Agents" post.
  • Done when: You can implement a tool-use loop from scratch (no framework) with at least 3 tools.

Rung 02-ReAct: reasoning + acting

  • What: Interleave "thought" and "action" steps. The reasoning is generated by the model and feeds the next action.
  • Why it earns its place: ReAct is the canonical pattern that started the modern agent era. Every framework is a variation.
  • Resource: ReAct paper (arxiv.org/abs/2210.03629). Plus a from-scratch implementation.
  • Done when: You've implemented a ReAct agent and observed how its reasoning trace differs from a plain tool-use loop.

Rung 03-Planning patterns

  • What: Plan-and-execute (plan first, then execute steps). ReWOO (plan with placeholders, fill in later). Hierarchical task decomposition.
  • Why it earns its place: Pure ReAct fails on long-horizon tasks. Planning patterns address it.
  • Resource: Plan-and-Execute paper (search "plan and execute langchain"). ReWOO (arxiv.org/abs/2305.18323).
  • Done when: You can articulate when to choose plan-and-execute over ReAct.

Rung 04-Reflection and self-correction

  • What: After an action, the agent critiques its own output and revises. Reflexion paper formalizes this.
  • Why it earns its place: Many quality wins on agent tasks come from a critique step, not better tools.
  • Resource: Reflexion paper (arxiv.org/abs/2303.11366). Plus Self-Refine (arxiv.org/abs/2303.17651).
  • Done when: You've added a reflection step to your agent and measured the quality delta with an eval.

Rung 05-State management

  • What: Agents have memory: short-term (within conversation), long-term (persistent across sessions), tool-result memory. State machines for control flow.
  • Why it earns its place: State management is where naive agents break. Distributed-systems instincts transfer.
  • Resource: LangGraph docs (search "langgraph"). Even if you don't use LangGraph, its state-machine model is a useful framing.
  • Done when: You can sketch your agent as a state machine and identify where state is persisted.

Rung 06-Tool design

  • What: Tools are APIs the model uses. Good tools have: clear names, focused scope, structured inputs, structured outputs, error messages the model can act on.
  • Why it earns its place: Bad tools sink agents. This is an underrated craft.
  • Resource: Anthropic's tool design guide. Plus reading the tool definitions in popular agent frameworks.
  • Done when: You can critique a poorly designed tool and produce a redesigned version.

Rung 07-Multi-agent systems

  • What: Multiple specialized agents coordinated by a router or supervisor. Examples: AutoGen, CrewAI, OpenAI Swarm.
  • Why it earns its place: A 2024–2026 trend. Sometimes useful, often over-engineered. Knowing both sides is judgment.
  • Resource: AutoGen docs (microsoft.github.io/autogen). CrewAI docs. Plus the OpenAI Swarm cookbook.
  • Done when: You've built a 2-agent system (e.g., researcher + writer) and can articulate when this beats a single agent.

Rung 08-Agent benchmarks

  • What: SWE-bench (real GitHub issues), GAIA (general assistant), WebArena (web navigation), τ-bench (customer service), AgentBench.
  • Why it earns its place: Benchmarks ground hand-wavy "agent capability" claims. Submitting to one is a strong public signal.
  • Resource: SWE-bench paper + leaderboard (swebench.com). GAIA paper (arxiv.org/abs/2311.12983).
  • Done when: You've evaluated an agent on at least one public benchmark, even with low score.

Rung 09-Agent evaluation rigor

  • What: Trajectory evaluation (was each step correct?), outcome evaluation (was the final answer correct?), tool-call accuracy, cost per task.
  • Why it earns its place: Most agent demos are cherry-picked. Real eval requires rigor.
  • Resource: Hamel Husain's posts on agent evals. Plus the Inspect AI docs.
  • Done when: You have an eval that scores both trajectory and outcome on a real task set.

Rung 10-Failure modes and robustness

  • What: Loops, hallucinated tool calls, runaway costs, prompt injection via tool outputs, infinite recursion.
  • Why it earns its place: Production agents need guardrails. Distributed-systems thinking (timeouts, circuit breakers, budgets) directly applies.
  • Resource: Simon Willison's prompt injection writing (simonwillison.net). Plus your own observability instincts applied.
  • Done when: You've added: per-task budget cap, max-step cap, tool-call timeout, prompt-injection mitigation.

Rung 11-Agentic systems in production

  • What: Async execution, observability per step, human-in-the-loop checkpoints, audit logs.
  • Why it earns its place: This is where you bring your backend skills home. It's the bridge from prototype to product.
  • Resource: Langfuse / LangSmith agent tracing docs. Plus OpenTelemetry GenAI semantic conventions.
  • Done when: Your agent has full step-by-step traces, an audit log, and a kill switch.

Minimum required to leave this sequence

  • Tool-use loop from scratch with 3 tools.
  • ReAct agent implemented from scratch.
  • Reflection step measured against a no-reflection baseline.
  • One multi-agent system built and critiqued.
  • Evaluated an agent on a public benchmark or 50+ task set.
  • Agent has step traces, budget caps, and timeouts.

Going further

  • Anthropic "Building Effective Agents"-re-read every quarter.
  • Lilian Weng "LLM Powered Autonomous Agents" (lilianweng.github.io)-foundational survey.
  • Designing Agentic AI Systems-emerging books in 2025/2026; check current titles.

How this sequence connects to the year

  • Month 6: Rungs 01–04 are the build for month 6's agentic anchor.
  • Q3 Track B: This sequence is your specialty if you pick agents.
  • Q4: Robustness (rung 10) is what makes a capstone agent presentable.

Comments