Month 6-Week 2: Reflection, state management, observability¶
Week summary¶
- Goal: Add a self-reflection step. Externalize agent state with checkpoint/resume. Wire LLM observability with Langfuse or LangSmith. End-to-end traces visible for any agent run.
- Time: ~9 h over 3 sessions.
- Output: Agent with measured-effect reflection step; serialized state; traces in observability dashboard.
- Sequences relied on: 11-agents rungs 04, 05, 11; 13-llm-observability rungs 02, 06.
Why this week matters¶
Three rungs, each worth a week elsewhere: 1. Reflection-sometimes adds 5+ points of accuracy; sometimes just doubles cost. Measure. 2. State management-naive in-memory state is what breaks in production. Externalized state is what makes agents debuggable, replayable, resumable. 3. Observability-your existing strength applied to LLM workloads. This is your bridge sequence-it's what you'll write the most career-leveraged blog post about (M06-W04).
Prerequisites¶
- M06-W01 complete (ReAct agent + comparison).
- Pydantic fluency from M04.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): reflection
- Session B-Sat morning (~3.5 h): state + checkpoint
- Session C-Sun afternoon (~2.5 h): observability + ship
Session A-Reflection: Reflexion + Self-Refine¶
Goal: Read both papers. Add a critique-revise step. Measure effect.
Part 1-Read (60 min)¶
Reflexion (arxiv.org/abs/2303.11366): agents reflect on failed trajectories, store reflections in episodic memory, and reuse them in future attempts.
Self-Refine (arxiv.org/abs/2303.17651): produce → critique → revise → produce, iteratively. No external memory needed.
For your project, Self-Refine is simpler and more useful. Reflexion's value compounds across many attempts; you have one-shot incidents.
Part 2-Implement Self-Refine (90 min)¶
CRITIQUE_PROMPT = """Critique this incident triage report. List specific issues:
- Are claims supported by the evidence gathered?
- Are recommended actions concrete and prioritized?
- Is severity reasonable?
- What's missing?
Output strict JSON:
{"issues": ["...", "..."], "verdict": "good|needs_revision"}
Original input: <<<INPUT>>>
Report: <<<REPORT>>>
"""
REVISE_PROMPT = """The original triage produced this report:
<<<ORIGINAL>>>
A critique noted these issues:
<<<ISSUES>>>
Produce an improved report addressing the issues. If the original is good as-is,
return it unchanged."""
def run_react_with_reflection(initial: str) -> AgentState:
# Phase 1: original triage
state = run_react(initial)
original_answer = extract_answer(state)
# Phase 2: critique
critique = client.messages.create(...) # CRITIQUE_PROMPT
parsed = json.loads(critique.content[0].text)
if parsed["verdict"] == "good":
return state # no need to revise
# Phase 3: revise
revised = client.messages.create(...) # REVISE_PROMPT
state.messages.append({"role": "assistant", "content": revised.content})
return state
Part 3-Measure effect (30 min)¶
Run with-reflection on 30 incidents. Compare vs without: | Metric | No reflection | With reflection | Δ | |---|---|---|---| | Pass rate | 0.82 | 0.86 | +0.04 | | Faithfulness (judge) | 4.3 | 4.4 | +0.1 | | Cost per task | $0.087 | $0.158 | +$0.071 | | Latency | 8.4 s | 14.2 s | +5.8 s |
Honest takeaway: reflection helped 4 percentage points at ~2× cost. Worth it for high-stakes incidents; not for volume.
Output of Session A¶
- Self-Refine implemented.
- Measured comparison.
Session B-State management with checkpoint/resume¶
Goal: Externalize agent state. Implement save → reload → continue. Useful for debugging and for production resumability.
Part 1-Define a serializable state (60 min)¶
from pydantic import BaseModel
from datetime import datetime
class ToolCallRecord(BaseModel):
step: int
tool: str
input_args: dict
output: str
started_at: datetime
completed_at: datetime
error: str | None = None
class MessageRecord(BaseModel):
role: str
content: str | list # supports complex Anthropic content
timestamp: datetime
class AgentRun(BaseModel):
run_id: str
incident_id: str
messages: list[MessageRecord]
tool_calls: list[ToolCallRecord]
cost_so_far_usd: float = 0.0
started_at: datetime
completed_at: datetime | None = None
status: Literal["running", "completed", "failed", "budget_exceeded", "max_steps_exceeded"]
final_answer: dict | None = None # IncidentReport JSON
Save to disk (or sqlite) after every step:
def save_state(state: AgentRun, dir="runs"):
path = f"{dir}/{state.run_id}.json"
Path(path).write_text(state.model_dump_json(indent=2))
def load_state(run_id: str, dir="runs") -> AgentRun:
return AgentRun.model_validate_json(Path(f"{dir}/{run_id}.json").read_text())
Part 2-Resume (60 min)¶
def resume_run(run_id: str) -> AgentRun:
state = load_state(run_id)
if state.status != "running":
raise ValueError(f"Cannot resume run in status {state.status}")
# Continue the loop from where we left off
return continue_react(state)
Test it. Start a run, kill it midway, resume from disk, complete. The trace through the run_id is now your full audit log.
Part 3-Why this matters (30 min)¶
Write a 200-word note: "How externalized state makes agents debuggable."
Likely points: - Replay failed runs without re-paying API costs. - Debugging a step's reasoning is just reading the JSON. - Resumability lets long agent runs survive crashes. - Audit log for sensitive deployments (compliance).
Output of Session B¶
- AgentRun Pydantic model.
- save_state / load_state / resume_run.
- Resumability test passing.
Session C-Observability with Langfuse or LangSmith¶
Goal: Wire traces. Every agent run produces a parent trace with each step as a child span. Inspect failed runs in the dashboard.
Part 1-Pick + setup (45 min)¶
Langfuse: open source, self-hostable, Apache-licensed. Strong tracing primitives. LangSmith: managed, by LangChain. Strong UI, more out-of-the-box.
For learning + portability, Langfuse wins. For minimum setup, LangSmith does.
# Langfuse self-hosted
docker run --rm -p 3000:3000 langfuse/langfuse
# Or use cloud: langfuse.com (free tier)
from langfuse import Langfuse
lf = Langfuse(public_key="...", secret_key="...", host="http://localhost:3000")
Part 2-Trace agent runs (75 min)¶
from langfuse.decorators import observe
@observe(name="incident_triage_agent")
def run_react_traced(initial: str) -> AgentRun:
state = create_state(initial)
for step in range(max_steps):
with lf.span(name=f"step_{step}", input={"messages": state.messages[-3:]}) as span:
resp = client.messages.create(...)
span.update(output={"content": resp.content},
usage={"input": resp.usage.input_tokens, "output": resp.usage.output_tokens})
# tool calls also as spans
for tu in tool_uses:
with lf.span(name=f"tool_{tu.name}", input=tu.input) as tspan:
result = TOOL_REGISTRY[tu.name](**tu.input)
tspan.update(output=result)
return state
Part 3-Inspect a failed run (45 min)¶
Run on 5 incidents. Open the Langfuse dashboard. Pick a failed (or just multi-step) run. Walk through the trace: - Each step's input + output visible. - Each tool call's args + result visible. - Token usage + cost per call. - Total latency.
Could you debug from this alone? That's the test of good observability.
Push v0.6.0. Update README with screenshots from Langfuse.
Output of Session C¶
- Langfuse running.
- Traces wired into agent runs.
- Screenshot of a trace in README.
End-of-week artifact¶
- Reflection step measured against no-reflection baseline
- Externalized agent state (Pydantic) with save/load/resume
- Langfuse tracing wired into agent runs
- Trace screenshots in README
End-of-week self-assessment¶
- I can argue for or against reflection on a given workload with data.
- I can resume a killed agent run from saved state.
- I can debug an agent run from its trace alone.
Common failure modes for this week¶
- Adding reflection without measuring. Cost doubles for nothing measurable.
- In-memory state in "production" code. Restart kills everything.
- Traces too coarse. If you can't see tool args + outputs, the trace is decoration.
What's next (preview of M06-W03)¶
Adopt a real eval harness (Inspect AI). Migrate your golden set + metrics. Set up regression detection for prompt changes. Online eval prep.