Skip to content

Month 6-Week 2: Reflection, state management, observability

Week summary

  • Goal: Add a self-reflection step. Externalize agent state with checkpoint/resume. Wire LLM observability with Langfuse or LangSmith. End-to-end traces visible for any agent run.
  • Time: ~9 h over 3 sessions.
  • Output: Agent with measured-effect reflection step; serialized state; traces in observability dashboard.
  • Sequences relied on: 11-agents rungs 04, 05, 11; 13-llm-observability rungs 02, 06.

Why this week matters

Three rungs, each worth a week elsewhere: 1. Reflection-sometimes adds 5+ points of accuracy; sometimes just doubles cost. Measure. 2. State management-naive in-memory state is what breaks in production. Externalized state is what makes agents debuggable, replayable, resumable. 3. Observability-your existing strength applied to LLM workloads. This is your bridge sequence-it's what you'll write the most career-leveraged blog post about (M06-W04).

Prerequisites

  • M06-W01 complete (ReAct agent + comparison).
  • Pydantic fluency from M04.
  • Session A-Tue/Wed evening (~3 h): reflection
  • Session B-Sat morning (~3.5 h): state + checkpoint
  • Session C-Sun afternoon (~2.5 h): observability + ship

Session A-Reflection: Reflexion + Self-Refine

Goal: Read both papers. Add a critique-revise step. Measure effect.

Part 1-Read (60 min)

Reflexion (arxiv.org/abs/2303.11366): agents reflect on failed trajectories, store reflections in episodic memory, and reuse them in future attempts.

Self-Refine (arxiv.org/abs/2303.17651): produce → critique → revise → produce, iteratively. No external memory needed.

For your project, Self-Refine is simpler and more useful. Reflexion's value compounds across many attempts; you have one-shot incidents.

Part 2-Implement Self-Refine (90 min)

CRITIQUE_PROMPT = """Critique this incident triage report. List specific issues:
- Are claims supported by the evidence gathered?
- Are recommended actions concrete and prioritized?
- Is severity reasonable?
- What's missing?

Output strict JSON:
{"issues": ["...", "..."], "verdict": "good|needs_revision"}

Original input: <<<INPUT>>>
Report: <<<REPORT>>>
"""

REVISE_PROMPT = """The original triage produced this report:
<<<ORIGINAL>>>

A critique noted these issues:
<<<ISSUES>>>

Produce an improved report addressing the issues. If the original is good as-is,
return it unchanged."""

def run_react_with_reflection(initial: str) -> AgentState:
    # Phase 1: original triage
    state = run_react(initial)
    original_answer = extract_answer(state)

    # Phase 2: critique
    critique = client.messages.create(...)  # CRITIQUE_PROMPT
    parsed = json.loads(critique.content[0].text)

    if parsed["verdict"] == "good":
        return state  # no need to revise

    # Phase 3: revise
    revised = client.messages.create(...)  # REVISE_PROMPT
    state.messages.append({"role": "assistant", "content": revised.content})
    return state

Part 3-Measure effect (30 min)

Run with-reflection on 30 incidents. Compare vs without: | Metric | No reflection | With reflection | Δ | |---|---|---|---| | Pass rate | 0.82 | 0.86 | +0.04 | | Faithfulness (judge) | 4.3 | 4.4 | +0.1 | | Cost per task | $0.087 | $0.158 | +$0.071 | | Latency | 8.4 s | 14.2 s | +5.8 s |

Honest takeaway: reflection helped 4 percentage points at ~2× cost. Worth it for high-stakes incidents; not for volume.

Output of Session A

  • Self-Refine implemented.
  • Measured comparison.

Session B-State management with checkpoint/resume

Goal: Externalize agent state. Implement save → reload → continue. Useful for debugging and for production resumability.

Part 1-Define a serializable state (60 min)

from pydantic import BaseModel
from datetime import datetime

class ToolCallRecord(BaseModel):
    step: int
    tool: str
    input_args: dict
    output: str
    started_at: datetime
    completed_at: datetime
    error: str | None = None

class MessageRecord(BaseModel):
    role: str
    content: str | list  # supports complex Anthropic content
    timestamp: datetime

class AgentRun(BaseModel):
    run_id: str
    incident_id: str
    messages: list[MessageRecord]
    tool_calls: list[ToolCallRecord]
    cost_so_far_usd: float = 0.0
    started_at: datetime
    completed_at: datetime | None = None
    status: Literal["running", "completed", "failed", "budget_exceeded", "max_steps_exceeded"]
    final_answer: dict | None = None  # IncidentReport JSON

Save to disk (or sqlite) after every step:

def save_state(state: AgentRun, dir="runs"):
    path = f"{dir}/{state.run_id}.json"
    Path(path).write_text(state.model_dump_json(indent=2))

def load_state(run_id: str, dir="runs") -> AgentRun:
    return AgentRun.model_validate_json(Path(f"{dir}/{run_id}.json").read_text())

Part 2-Resume (60 min)

def resume_run(run_id: str) -> AgentRun:
    state = load_state(run_id)
    if state.status != "running":
        raise ValueError(f"Cannot resume run in status {state.status}")
    # Continue the loop from where we left off
    return continue_react(state)

Test it. Start a run, kill it midway, resume from disk, complete. The trace through the run_id is now your full audit log.

Part 3-Why this matters (30 min)

Write a 200-word note: "How externalized state makes agents debuggable."

Likely points: - Replay failed runs without re-paying API costs. - Debugging a step's reasoning is just reading the JSON. - Resumability lets long agent runs survive crashes. - Audit log for sensitive deployments (compliance).

Output of Session B

  • AgentRun Pydantic model.
  • save_state / load_state / resume_run.
  • Resumability test passing.

Session C-Observability with Langfuse or LangSmith

Goal: Wire traces. Every agent run produces a parent trace with each step as a child span. Inspect failed runs in the dashboard.

Part 1-Pick + setup (45 min)

Langfuse: open source, self-hostable, Apache-licensed. Strong tracing primitives. LangSmith: managed, by LangChain. Strong UI, more out-of-the-box.

For learning + portability, Langfuse wins. For minimum setup, LangSmith does.

# Langfuse self-hosted
docker run --rm -p 3000:3000 langfuse/langfuse
# Or use cloud: langfuse.com (free tier)
from langfuse import Langfuse
lf = Langfuse(public_key="...", secret_key="...", host="http://localhost:3000")

Part 2-Trace agent runs (75 min)

from langfuse.decorators import observe

@observe(name="incident_triage_agent")
def run_react_traced(initial: str) -> AgentRun:
    state = create_state(initial)
    for step in range(max_steps):
        with lf.span(name=f"step_{step}", input={"messages": state.messages[-3:]}) as span:
            resp = client.messages.create(...)
            span.update(output={"content": resp.content},
                        usage={"input": resp.usage.input_tokens, "output": resp.usage.output_tokens})
        # tool calls also as spans
        for tu in tool_uses:
            with lf.span(name=f"tool_{tu.name}", input=tu.input) as tspan:
                result = TOOL_REGISTRY[tu.name](**tu.input)
                tspan.update(output=result)
    return state

Part 3-Inspect a failed run (45 min)

Run on 5 incidents. Open the Langfuse dashboard. Pick a failed (or just multi-step) run. Walk through the trace: - Each step's input + output visible. - Each tool call's args + result visible. - Token usage + cost per call. - Total latency.

Could you debug from this alone? That's the test of good observability.

Push v0.6.0. Update README with screenshots from Langfuse.

Output of Session C

  • Langfuse running.
  • Traces wired into agent runs.
  • Screenshot of a trace in README.

End-of-week artifact

  • Reflection step measured against no-reflection baseline
  • Externalized agent state (Pydantic) with save/load/resume
  • Langfuse tracing wired into agent runs
  • Trace screenshots in README

End-of-week self-assessment

  • I can argue for or against reflection on a given workload with data.
  • I can resume a killed agent run from saved state.
  • I can debug an agent run from its trace alone.

Common failure modes for this week

  • Adding reflection without measuring. Cost doubles for nothing measurable.
  • In-memory state in "production" code. Restart kills everything.
  • Traces too coarse. If you can't see tool args + outputs, the trace is decoration.

What's next (preview of M06-W03)

Adopt a real eval harness (Inspect AI). Migrate your golden set + metrics. Set up regression detection for prompt changes. Online eval prep.

Comments