Month 6-Week 1: Agent foundations-tool-use loop and ReAct¶

Week summary¶

Goal: Build agents on top of your project. Implement a from-scratch tool-use loop with 5+ tools and budget caps. Implement ReAct on top. Compare to your simpler RAG-only system from M05.
Time: ~9 h over 3 sessions.
Output: Multi-step agent with 5+ tools; ReAct version; honest comparison vs RAG-only on the 30-query eval.
Sequences relied on: 11-agents rungs 01, 02.

Why this week matters¶

"Agents" is overloaded-it covers everything from a 5-line tool-use loop to research-grade multi-agent systems. The 2026 frontier is making agents reliable on complex tasks-SWE-bench, GAIA, real customer workflows. Your distributed-systems background is unusually well-matched: agents fail in distributed-systems-shaped ways (timeouts, partial failure, state, retries, idempotency). This week begins the agent arc that culminates in your Q3 specialty decision.

Equally important: this week teaches you to be honest about whether the agent helps. Many teams over-engineer agentic systems where a simpler pipeline would do. Measuring against the simpler baseline is the discipline that wins.

Prerequisites¶

M05 complete.
Tool-use mechanics from M04-W02.
RAG pipeline working from M05.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): foundations + read + design
Session B-Sat morning (~3.5 h): ReAct implementation
Session C-Sun afternoon (~2.5 h): RAG vs agent comparison + write up

Session A-Foundations: read + design¶

Goal: Internalize agent patterns. Design 5+ tools for your project. Begin the loop.

Part 1-Read deeply (75 min)¶

Anthropic Building Effective Agents: anthropic.com/engineering/building-effective-agents. Read twice.

Distinguish: - Workflows (predefined steps) vs agents (model decides flow). - Augmented LLM (single call with tools) vs agent (loop). - Routing, chaining, parallelization, orchestrator-workers patterns.

ReAct paper: arxiv.org/abs/2210.03629. Read sections 1–3. Key insight: interleaving "thought" and "action" steps materially improves reasoning on multi-step tasks.

Part 2-Tool inventory + design (60 min)¶

Your M04-W02 had 3 tools. Scale to 5+ for the agent:

For incident triage: 1. query_metrics(service, metric, time_range_minutes) - existing. 2.get_recent_deploys(service, since_minutes) - existing. 3. query_logs(service, query, limit) - existing. 4.get_dependency_graph(service) - what services this depends on / depends on it. 5. get_runbook(failure_type) - fetch a known runbook. 6.check_alerts(service, time_range_minutes) - recent alerts on/related-to service.

Tool design principles (Anthropic's guide): - Clear, focused: one tool, one concern. - Structured I/O: parse-able outputs. - Helpful errors: model can recover from "service not found." - Description quality: this is the prompt to the model-be specific.

Part 3-Loop scale-up (45 min)¶

Modify your M04-W02 loop: - max_steps = 12 (instead of 8). - Per-task budget cap (USD)-fail fast if cost runs away. - Per-tool timeout-don't let a slow tool block forever. - State accumulation: keep tool results addressable by step index for debugging.

@dataclass
class AgentState:
    messages: list[dict]
    tool_calls: list[dict]
    cost_so_far: float = 0.0
    started_at: float = field(default_factory=time.time)

def run_agent(state: AgentState, max_steps=12, budget_usd=0.50, step_timeout=30):
    for step in range(max_steps):
        if state.cost_so_far > budget_usd:
            raise BudgetExceeded()
        if time.time() - state.started_at > 300:
            raise TimeoutExceeded()
        ...

Output of Session A¶

Tool inventory + descriptions documented.
Loop with budget + timeout caps.

Session B-ReAct implementation¶

Goal: Implement ReAct (interleaved thought + action). Compare to vanilla tool-use.

Part 1-ReAct prompt design (45 min)¶

ReAct asks the model to produce explicit reasoning between actions. Key prompt structure:

REACT_SYSTEM = """You are a senior on-call SRE solving an incident. You can use tools to investigate.

For each step, output:
1. **Thought**: what do I know? what do I need to find out next? what's my hypothesis?
2. **Action**: which tool to use, with what arguments-or "Final Answer".
3. After tools return, update Thought before next Action.

Continue until you can give a Final Answer with high confidence. Don't fabricate; if you can't find evidence, say so.
"""

Some implementations enforce thought-before-action via prompting; others use a stricter scaffold. Start with prompting.

Part 2-Implement (105 min)¶

def run_react(initial: str, max_steps=12) -> AgentState:
    state = AgentState(messages=[{"role": "user", "content": initial}], tool_calls=[])
    for step in range(max_steps):
        if state.cost_so_far > 0.50: break
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=2048,
            tools=TOOLS,
            system=REACT_SYSTEM,
            messages=state.messages,
        )
        state.cost_so_far += cost_of(resp.usage)
        state.messages.append({"role": "assistant", "content": resp.content})

        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses: return state

        results = []
        for tu in tool_uses:
            try:
                with timeout(30):
                    result = TOOL_REGISTRY[tu.name](**tu.input)
                state.tool_calls.append({"step": step, "tool": tu.name, "input": tu.input, "output": str(result)[:1000]})
                results.append({"type": "tool_result", "tool_use_id": tu.id, "content": str(result)})
            except Exception as e:
                results.append({"type": "tool_result", "tool_use_id": tu.id, "content": f"ERROR: {e}", "is_error": True})
        state.messages.append({"role": "user", "content": results})
    return state

Part 3-Run + observe (30 min)¶

Run on 5 incidents. Print the trajectories. Notice: - Does the model produce useful reasoning between actions? - Does it call tools sequentially or in parallel where appropriate? - Are there steps that look wasteful?

Output of Session B¶

ReAct loop implemented.
5 sample trajectories captured.

Session C-RAG-only vs agent comparison + write up¶

Goal: Run both on the same 30 incidents. Compare with proper metrics. Write honestly.

Part 1-Define agent metrics (45 min)¶

Agents need richer eval than single-call systems:

Outcome accuracy: did the final answer match expected fields? (Heuristic + judge from M04.)
Trajectory accuracy: were the tool calls reasonable? (Manual or LLM judge per step.)
Step count: how many steps to reach answer?
Total cost USD per task.
Total wall-clock latency.

For trajectory eval, sample 5 from your 30 and label by hand: each tool call ✓ if reasonable, ✗ if wasteful or wrong.

Part 2-Run both, capture metrics (90 min)¶

results_rag = []
results_agent = []
for case in cases:
    # RAG-only (single-call with retrieved context)
    rag_out, ctx = rag_answer(case["input"])
    results_rag.append({"id": case["id"], "answer": rag_out, "context": ctx,
                        "cost": ..., "latency_ms": ...})

    # Agent
    agent_state = run_react(case["input"])
    results_agent.append({"id": case["id"], "trajectory": agent_state.tool_calls,
                          "answer": agent_state.messages[-1], "cost": agent_state.cost_so_far,
                          "latency_ms": ..., "n_steps": len(agent_state.tool_calls)})

# Score both with your M04 heuristic + judge

Aggregate: | Metric | RAG-only | Agent | |---|---|---| | Pass rate (heuristic) | 0.78 | 0.82 | | Faithfulness (judge) | 4.1 | 4.3 | | Mean cost USD | $0.018 | $0.087 | | Mean latency | 1.2 s | 8.4 s | | Mean steps | 1 | 4.7 |

Part 3-Honest write-up (30 min)¶

In your repo, write agent_vs_rag.md:

Agent gained ~4 percentage points on outcome accuracy and 0.2 points on faithfulness. Cost is ~5× higher. Latency is ~7× higher. Verdict: for incidents where retrieval suffices, RAG-only wins on every dimension except quality. For ambiguous incidents requiring multi-source synthesis, the agent earns its cost. Use a router: simple incidents → RAG; complex → agent.

This kind of honest tradeoff analysis is what senior engineers produce. It's also a great post topic.

Output of Session C¶

30-query comparison RAG-only vs ReAct agent.
Honest tradeoff write-up.

End-of-week artifact¶

5+ tools defined and implemented
Tool-use loop with budget + timeout caps
ReAct agent working with sample trajectories
30-query comparison RAG-only vs agent with all metrics
Tradeoff analysis committed

End-of-week self-assessment¶

I can write a tool-use loop from a blank file.
I can articulate when an agent earns its cost vs when RAG suffices.
My agent has guardrails (budget, steps, timeout)-not unbounded.

Common failure modes for this week¶

No budget cap. Agents can blow $10 on a single task. Always cap.
Treating "agent built" as "agent better." Compare honestly to the simpler baseline.
Tool descriptions vague. Tools are prompts to the model; specific descriptions improve everything.

What's next (preview of M06-W02)¶

Reflection (self-critique), state management, Langfuse / LangSmith observability. Production-grade agent infrastructure.