Month 6-Week 1: Agent foundations-tool-use loop and ReAct¶
Week summary¶
- Goal: Build agents on top of your project. Implement a from-scratch tool-use loop with 5+ tools and budget caps. Implement ReAct on top. Compare to your simpler RAG-only system from M05.
- Time: ~9 h over 3 sessions.
- Output: Multi-step agent with 5+ tools; ReAct version; honest comparison vs RAG-only on the 30-query eval.
- Sequences relied on: 11-agents rungs 01, 02.
Why this week matters¶
"Agents" is overloaded-it covers everything from a 5-line tool-use loop to research-grade multi-agent systems. The 2026 frontier is making agents reliable on complex tasks-SWE-bench, GAIA, real customer workflows. Your distributed-systems background is unusually well-matched: agents fail in distributed-systems-shaped ways (timeouts, partial failure, state, retries, idempotency). This week begins the agent arc that culminates in your Q3 specialty decision.
Equally important: this week teaches you to be honest about whether the agent helps. Many teams over-engineer agentic systems where a simpler pipeline would do. Measuring against the simpler baseline is the discipline that wins.
Prerequisites¶
- M05 complete.
- Tool-use mechanics from M04-W02.
- RAG pipeline working from M05.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): foundations + read + design
- Session B-Sat morning (~3.5 h): ReAct implementation
- Session C-Sun afternoon (~2.5 h): RAG vs agent comparison + write up
Session A-Foundations: read + design¶
Goal: Internalize agent patterns. Design 5+ tools for your project. Begin the loop.
Part 1-Read deeply (75 min)¶
Anthropic Building Effective Agents: anthropic.com/engineering/building-effective-agents. Read twice.
Distinguish: - Workflows (predefined steps) vs agents (model decides flow). - Augmented LLM (single call with tools) vs agent (loop). - Routing, chaining, parallelization, orchestrator-workers patterns.
ReAct paper: arxiv.org/abs/2210.03629. Read sections 1–3. Key insight: interleaving "thought" and "action" steps materially improves reasoning on multi-step tasks.
Part 2-Tool inventory + design (60 min)¶
Your M04-W02 had 3 tools. Scale to 5+ for the agent:
For incident triage:
1. query_metrics(service, metric, time_range_minutes) - existing.
2.get_recent_deploys(service, since_minutes) - existing.
3. query_logs(service, query, limit) - existing.
4.get_dependency_graph(service) - what services this depends on / depends on it.
5. get_runbook(failure_type) - fetch a known runbook.
6.check_alerts(service, time_range_minutes) - recent alerts on/related-to service.
Tool design principles (Anthropic's guide): - Clear, focused: one tool, one concern. - Structured I/O: parse-able outputs. - Helpful errors: model can recover from "service not found." - Description quality: this is the prompt to the model-be specific.
Part 3-Loop scale-up (45 min)¶
Modify your M04-W02 loop:
- max_steps = 12 (instead of 8).
- Per-task budget cap (USD)-fail fast if cost runs away.
- Per-tool timeout-don't let a slow tool block forever.
- State accumulation: keep tool results addressable by step index for debugging.
@dataclass
class AgentState:
messages: list[dict]
tool_calls: list[dict]
cost_so_far: float = 0.0
started_at: float = field(default_factory=time.time)
def run_agent(state: AgentState, max_steps=12, budget_usd=0.50, step_timeout=30):
for step in range(max_steps):
if state.cost_so_far > budget_usd:
raise BudgetExceeded()
if time.time() - state.started_at > 300:
raise TimeoutExceeded()
...
Output of Session A¶
- Tool inventory + descriptions documented.
- Loop with budget + timeout caps.
Session B-ReAct implementation¶
Goal: Implement ReAct (interleaved thought + action). Compare to vanilla tool-use.
Part 1-ReAct prompt design (45 min)¶
ReAct asks the model to produce explicit reasoning between actions. Key prompt structure:
REACT_SYSTEM = """You are a senior on-call SRE solving an incident. You can use tools to investigate.
For each step, output:
1. **Thought**: what do I know? what do I need to find out next? what's my hypothesis?
2. **Action**: which tool to use, with what arguments-or "Final Answer".
3. After tools return, update Thought before next Action.
Continue until you can give a Final Answer with high confidence. Don't fabricate; if you can't find evidence, say so.
"""
Some implementations enforce thought-before-action via prompting; others use a stricter scaffold. Start with prompting.
Part 2-Implement (105 min)¶
def run_react(initial: str, max_steps=12) -> AgentState:
state = AgentState(messages=[{"role": "user", "content": initial}], tool_calls=[])
for step in range(max_steps):
if state.cost_so_far > 0.50: break
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
tools=TOOLS,
system=REACT_SYSTEM,
messages=state.messages,
)
state.cost_so_far += cost_of(resp.usage)
state.messages.append({"role": "assistant", "content": resp.content})
tool_uses = [b for b in resp.content if b.type == "tool_use"]
if not tool_uses: return state
results = []
for tu in tool_uses:
try:
with timeout(30):
result = TOOL_REGISTRY[tu.name](**tu.input)
state.tool_calls.append({"step": step, "tool": tu.name, "input": tu.input, "output": str(result)[:1000]})
results.append({"type": "tool_result", "tool_use_id": tu.id, "content": str(result)})
except Exception as e:
results.append({"type": "tool_result", "tool_use_id": tu.id, "content": f"ERROR: {e}", "is_error": True})
state.messages.append({"role": "user", "content": results})
return state
Part 3-Run + observe (30 min)¶
Run on 5 incidents. Print the trajectories. Notice: - Does the model produce useful reasoning between actions? - Does it call tools sequentially or in parallel where appropriate? - Are there steps that look wasteful?
Output of Session B¶
- ReAct loop implemented.
- 5 sample trajectories captured.
Session C-RAG-only vs agent comparison + write up¶
Goal: Run both on the same 30 incidents. Compare with proper metrics. Write honestly.
Part 1-Define agent metrics (45 min)¶
Agents need richer eval than single-call systems:
- Outcome accuracy: did the final answer match expected fields? (Heuristic + judge from M04.)
- Trajectory accuracy: were the tool calls reasonable? (Manual or LLM judge per step.)
- Step count: how many steps to reach answer?
- Total cost USD per task.
- Total wall-clock latency.
For trajectory eval, sample 5 from your 30 and label by hand: each tool call ✓ if reasonable, ✗ if wasteful or wrong.
Part 2-Run both, capture metrics (90 min)¶
results_rag = []
results_agent = []
for case in cases:
# RAG-only (single-call with retrieved context)
rag_out, ctx = rag_answer(case["input"])
results_rag.append({"id": case["id"], "answer": rag_out, "context": ctx,
"cost": ..., "latency_ms": ...})
# Agent
agent_state = run_react(case["input"])
results_agent.append({"id": case["id"], "trajectory": agent_state.tool_calls,
"answer": agent_state.messages[-1], "cost": agent_state.cost_so_far,
"latency_ms": ..., "n_steps": len(agent_state.tool_calls)})
# Score both with your M04 heuristic + judge
Aggregate: | Metric | RAG-only | Agent | |---|---|---| | Pass rate (heuristic) | 0.78 | 0.82 | | Faithfulness (judge) | 4.1 | 4.3 | | Mean cost USD | $0.018 | $0.087 | | Mean latency | 1.2 s | 8.4 s | | Mean steps | 1 | 4.7 |
Part 3-Honest write-up (30 min)¶
In your repo, write agent_vs_rag.md:
Agent gained ~4 percentage points on outcome accuracy and 0.2 points on faithfulness. Cost is ~5× higher. Latency is ~7× higher. Verdict: for incidents where retrieval suffices, RAG-only wins on every dimension except quality. For ambiguous incidents requiring multi-source synthesis, the agent earns its cost. Use a router: simple incidents → RAG; complex → agent.
This kind of honest tradeoff analysis is what senior engineers produce. It's also a great post topic.
Output of Session C¶
- 30-query comparison RAG-only vs ReAct agent.
- Honest tradeoff write-up.
End-of-week artifact¶
- 5+ tools defined and implemented
- Tool-use loop with budget + timeout caps
- ReAct agent working with sample trajectories
- 30-query comparison RAG-only vs agent with all metrics
- Tradeoff analysis committed
End-of-week self-assessment¶
- I can write a tool-use loop from a blank file.
- I can articulate when an agent earns its cost vs when RAG suffices.
- My agent has guardrails (budget, steps, timeout)-not unbounded.
Common failure modes for this week¶
- No budget cap. Agents can blow $10 on a single task. Always cap.
- Treating "agent built" as "agent better." Compare honestly to the simpler baseline.
- Tool descriptions vague. Tools are prompts to the model; specific descriptions improve everything.
What's next (preview of M06-W02)¶
Reflection (self-critique), state management, Langfuse / LangSmith observability. Production-grade agent infrastructure.