Month 6-Week 2: Reflection, state management, observability¶

Week summary¶

Goal: Add a self-reflection step. Externalize agent state with checkpoint/resume. Wire LLM observability with Langfuse or LangSmith. End-to-end traces visible for any agent run.
Time: ~9 h over 3 sessions.
Output: Agent with measured-effect reflection step; serialized state; traces in observability dashboard.
Sequences relied on: 11-agents rungs 04, 05, 11; 13-llm-observability rungs 02, 06.

Why this week matters¶

Three rungs, each worth a week elsewhere: 1. Reflection-sometimes adds 5+ points of accuracy; sometimes just doubles cost. Measure. 2. State management-naive in-memory state is what breaks in production. Externalized state is what makes agents debuggable, replayable, resumable. 3. Observability-your existing strength applied to LLM workloads. This is your bridge sequence-it's what you'll write the most career-leveraged blog post about (M06-W04).

Prerequisites¶

M06-W01 complete (ReAct agent + comparison).
Pydantic fluency from M04.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): reflection
Session B-Sat morning (~3.5 h): state + checkpoint
Session C-Sun afternoon (~2.5 h): observability + ship

Session A-Reflection: Reflexion + Self-Refine¶

Goal: Read both papers. Add a critique-revise step. Measure effect.

Part 1-Read (60 min)¶

Reflexion (arxiv.org/abs/2303.11366): agents reflect on failed trajectories, store reflections in episodic memory, and reuse them in future attempts.

Self-Refine (arxiv.org/abs/2303.17651): produce → critique → revise → produce, iteratively. No external memory needed.

For your project, Self-Refine is simpler and more useful. Reflexion's value compounds across many attempts; you have one-shot incidents.

Part 2-Implement Self-Refine (90 min)¶

CRITIQUE_PROMPT = """Critique this incident triage report. List specific issues:
- Are claims supported by the evidence gathered?
- Are recommended actions concrete and prioritized?
- Is severity reasonable?
- What's missing?

Output strict JSON:
{"issues": ["...", "..."], "verdict": "good|needs_revision"}

Original input: <<<INPUT>>>
Report: <<<REPORT>>>
"""

REVISE_PROMPT = """The original triage produced this report:
<<<ORIGINAL>>>

A critique noted these issues:
<<<ISSUES>>>

Produce an improved report addressing the issues. If the original is good as-is,
return it unchanged."""

def run_react_with_reflection(initial: str) -> AgentState:
    # Phase 1: original triage
    state = run_react(initial)
    original_answer = extract_answer(state)

    # Phase 2: critique
    critique = client.messages.create(...)  # CRITIQUE_PROMPT
    parsed = json.loads(critique.content[0].text)

    if parsed["verdict"] == "good":
        return state  # no need to revise

    # Phase 3: revise
    revised = client.messages.create(...)  # REVISE_PROMPT
    state.messages.append({"role": "assistant", "content": revised.content})
    return state

Part 3-Measure effect (30 min)¶

Run with-reflection on 30 incidents. Compare vs without: | Metric | No reflection | With reflection | Δ | |---|---|---|---| | Pass rate | 0.82 | 0.86 | +0.04 | | Faithfulness (judge) | 4.3 | 4.4 | +0.1 | | Cost per task | $0.087 | $0.158 | +$0.071 | | Latency | 8.4 s | 14.2 s | +5.8 s |

Honest takeaway: reflection helped 4 percentage points at ~2× cost. Worth it for high-stakes incidents; not for volume.

Output of Session A¶

Self-Refine implemented.
Measured comparison.

Session B-State management with checkpoint/resume¶

Goal: Externalize agent state. Implement save → reload → continue. Useful for debugging and for production resumability.

Part 1-Define a serializable state (60 min)¶

from pydantic import BaseModel
from datetime import datetime

class ToolCallRecord(BaseModel):
    step: int
    tool: str
    input_args: dict
    output: str
    started_at: datetime
    completed_at: datetime
    error: str | None = None

class MessageRecord(BaseModel):
    role: str
    content: str | list  # supports complex Anthropic content
    timestamp: datetime

class AgentRun(BaseModel):
    run_id: str
    incident_id: str
    messages: list[MessageRecord]
    tool_calls: list[ToolCallRecord]
    cost_so_far_usd: float = 0.0
    started_at: datetime
    completed_at: datetime | None = None
    status: Literal["running", "completed", "failed", "budget_exceeded", "max_steps_exceeded"]
    final_answer: dict | None = None  # IncidentReport JSON

Save to disk (or sqlite) after every step:

def save_state(state: AgentRun, dir="runs"):
    path = f"{dir}/{state.run_id}.json"
    Path(path).write_text(state.model_dump_json(indent=2))

def load_state(run_id: str, dir="runs") -> AgentRun:
    return AgentRun.model_validate_json(Path(f"{dir}/{run_id}.json").read_text())

Part 2-Resume (60 min)¶

def resume_run(run_id: str) -> AgentRun:
    state = load_state(run_id)
    if state.status != "running":
        raise ValueError(f"Cannot resume run in status {state.status}")
    # Continue the loop from where we left off
    return continue_react(state)

Test it. Start a run, kill it midway, resume from disk, complete. The trace through the run_id is now your full audit log.

Part 3-Why this matters (30 min)¶

Write a 200-word note: "How externalized state makes agents debuggable."

Likely points: - Replay failed runs without re-paying API costs. - Debugging a step's reasoning is just reading the JSON. - Resumability lets long agent runs survive crashes. - Audit log for sensitive deployments (compliance).

Output of Session B¶

AgentRun Pydantic model.
save_state / load_state / resume_run.
Resumability test passing.

Session C-Observability with Langfuse or LangSmith¶

Goal: Wire traces. Every agent run produces a parent trace with each step as a child span. Inspect failed runs in the dashboard.

Part 1-Pick + setup (45 min)¶

Langfuse: open source, self-hostable, Apache-licensed. Strong tracing primitives. LangSmith: managed, by LangChain. Strong UI, more out-of-the-box.

For learning + portability, Langfuse wins. For minimum setup, LangSmith does.

# Langfuse self-hosted
docker run --rm -p 3000:3000 langfuse/langfuse
# Or use cloud: langfuse.com (free tier)

from langfuse import Langfuse
lf = Langfuse(public_key="...", secret_key="...", host="http://localhost:3000")

Part 2-Trace agent runs (75 min)¶

from langfuse.decorators import observe

@observe(name="incident_triage_agent")
def run_react_traced(initial: str) -> AgentRun:
    state = create_state(initial)
    for step in range(max_steps):
        with lf.span(name=f"step_{step}", input={"messages": state.messages[-3:]}) as span:
            resp = client.messages.create(...)
            span.update(output={"content": resp.content},
                        usage={"input": resp.usage.input_tokens, "output": resp.usage.output_tokens})
        # tool calls also as spans
        for tu in tool_uses:
            with lf.span(name=f"tool_{tu.name}", input=tu.input) as tspan:
                result = TOOL_REGISTRY[tu.name](**tu.input)
                tspan.update(output=result)
    return state

Part 3-Inspect a failed run (45 min)¶

Run on 5 incidents. Open the Langfuse dashboard. Pick a failed (or just multi-step) run. Walk through the trace: - Each step's input + output visible. - Each tool call's args + result visible. - Token usage + cost per call. - Total latency.

Could you debug from this alone? That's the test of good observability.

Push v0.6.0. Update README with screenshots from Langfuse.

Output of Session C¶

Langfuse running.
Traces wired into agent runs.
Screenshot of a trace in README.

End-of-week artifact¶

Reflection step measured against no-reflection baseline
Externalized agent state (Pydantic) with save/load/resume
Langfuse tracing wired into agent runs
Trace screenshots in README

End-of-week self-assessment¶

I can argue for or against reflection on a given workload with data.
I can resume a killed agent run from saved state.
I can debug an agent run from its trace alone.

Common failure modes for this week¶

Adding reflection without measuring. Cost doubles for nothing measurable.
In-memory state in "production" code. Restart kills everything.
Traces too coarse. If you can't see tool args + outputs, the trace is decoration.

What's next (preview of M06-W03)¶

Adopt a real eval harness (Inspect AI). Migrate your golden set + metrics. Set up regression detection for prompt changes. Online eval prep.