Saltar a contenido

Workshop - Multi-agent supervisor pattern

DifficultyDeepTime90 min
Needs: Python 3.11+, httpx, Anthropic API key (~$0.50 in tokens), the agent kernel from Workshop 4

Before you start:

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to AI Systems -> Month 05 -> Week 17-18: Serving Systems, and the seventh in the AI implementations workshop series. By now you can build a single-agent loop (Workshop 4) and you can give it tools (Workshops 1-3). This workshop is about what happens when one agent isn't enough - when the task naturally decomposes into specialist sub-tasks that benefit from independent context, parallel execution, or different prompting. By the end you'll have a supervisor that dispatches to three specialists (researcher, writer, reviewer), watch them run in parallel on a real task, and you'll know when this pattern earns its complexity and when it's overkill.

~90 minutes. Needs: Python 3.11+, Anthropic API key (~$0.50 in tokens for the workshop), the agent kernel from Workshop 4 as a starting point. No GPU.

What you'll build, and the idea it makes concrete

You'll build a research-and-write system: the user asks "write a 500-word brief on X" and the system runs three specialists in coordination. A researcher gathers facts via a search tool. A writer drafts the brief from those facts. A reviewer critiques the draft against quality criteria and the source facts. A supervisor decides who runs when, in what order, and when the work is done. You'll see the orchestration graph fire in real time, measure the cost and latency overhead vs. a single-agent baseline, and stress-test it with the three classic multi-agent failure modes.

The idea this makes concrete:

Multi-agent is a context-isolation pattern, not a capability pattern. A single Claude with the right prompts can do research, writing, and review - it doesn't need three agents to do those three things. What three agents buy you is separate contexts: the researcher's prompt is optimized for thorough search and doesn't have to apologize for being verbose; the writer's context contains only the gathered facts, not the search noise; the reviewer's context contains only the draft and the rubric, not the writer's thought process. Each agent thinks better in a smaller, focused context than one agent thinks in a bigger, mixed one. The cost is coordination overhead; the benefit is sharper sub-task performance. Pick the pattern when the sub-tasks have meaningfully different optimal prompts; skip it when they don't.

A second idea, equally important:

The supervisor is the load-bearing piece, not the specialists. Anyone can write a "researcher prompt." The hard part is the supervisor: deciding when work is done, how to handle a failed sub-agent, when to retry vs. accept partial results, how to merge conflicting outputs, when to escalate to the user. Most multi-agent failures in production are supervisor failures - infinite delegation loops, lost handoffs, the supervisor losing track of which sub-agent did what. Build the supervisor first; the specialists are easier than they look.

Step 0: the architecture you're about to assemble

                       +-----------+
                       |   USER    |
                       |  prompt   |
                       +-----+-----+
                             v
              +--------------+-------------+
              |       SUPERVISOR           |
              |  - parse task into stages  |
              |  - dispatch to specialists |
              |  - merge results           |
              |  - decide when done        |
              +--------------+-------------+
                  |          |          |
            (async fan-out: handoffs)
                  v          v          v
        +-----------+ +-----------+ +-----------+
        | RESEARCHER| |  WRITER   | |  REVIEWER |
        | (search   | | (draft    | | (critique |
        |  tools)   | |  tool)    | |  + rubric)|
        +-----------+ +-----------+ +-----------+
              |             |             |
           findings      draft        critique
              |             |             |
              +------> SUPERVISOR <-------+
                   (loop until done)
                          v
                   +-------------+
                   |   USER      |
                   | final brief |
                   +-------------+

A few things this diagram makes explicit:

  • The supervisor has its own LLM call. It is not deterministic routing code; it is a model reading the current state and deciding the next step. Deterministic routing (Workshop 4's loop) is fine when the steps are fixed; supervisor LLM routing earns its keep when the steps depend on the data.
  • Specialists do not talk to each other. Every message passes through the supervisor. This is the hub-and-spoke topology. It prevents N² communication paths from sprawling into a chaotic graph and keeps the supervisor as the single source of truth for "what's been done."
  • Each specialist has its own conversation history. The researcher's messages and tool results never enter the writer's context. The writer sees only the supervisor's structured handoff. This is the context-isolation idea in code.

Step 1: start with the kernel from Workshop 4

The same agent() function from Workshop 4 (raw httpx, while loop) is the kernel for each specialist here. Wrap it in a small class so you can give each specialist its own system prompt and tool set:

import httpx, json, os

API_URL = "https://api.anthropic.com/v1/messages"
HEADERS = {
    "x-api-key": os.environ["ANTHROPIC_API_KEY"],
    "anthropic-version": "2023-06-01",
    "content-type": "application/json",
}

class Agent:
    def __init__(self, name, system_prompt, tools, model="claude-sonnet-4-6"):
        self.name = name
        self.system = system_prompt
        self.tools = tools           # list of tool schemas
        self.tool_fns = {t["name"]: t.pop("fn") for t in tools}  # registry
        self.model = model

    def run(self, task: str, max_turns: int = 8) -> str:
        messages = [{"role": "user", "content": task}]
        for _ in range(max_turns):
            resp = httpx.post(API_URL, headers=HEADERS, timeout=60, json={
                "model": self.model,
                "max_tokens": 2048,
                "system": self.system,
                "tools": self.tools,
                "messages": messages,
            }).json()
            messages.append({"role": "assistant", "content": resp["content"]})
            if resp["stop_reason"] != "tool_use":
                return "".join(b["text"] for b in resp["content"] if b["type"] == "text")
            results = []
            for b in resp["content"]:
                if b["type"] == "tool_use":
                    out = self.tool_fns[b["name"]](**b["input"])
                    results.append({"type": "tool_result", "tool_use_id": b["id"], "content": str(out)})
            messages.append({"role": "user", "content": results})
        return "(max turns reached)"

This is roughly 30 lines and identical in shape to Workshop 4. The specialists below are just instances of this class with different system prompts and tools.

Step 2: build the three specialists

Each specialist gets a tight system prompt focused on one job. This is the prompt-isolation payoff - the researcher's prompt says nothing about writing style; the writer's prompt says nothing about searching.

# --- A pretend search tool. Plug in a real one for production. -------------
def web_search(query: str) -> list[dict]:
    """Pretend to search. Returns 3 results with title + snippet."""
    return [
        {"title": "...", "snippet": "...", "url": "..."},
        # ...
    ]

researcher = Agent(
    name="researcher",
    system=(
        "You are a research specialist. Given a topic, use the search tool "
        "to gather 5-8 factual snippets with citations. Return a bullet list "
        "of facts with [source] markers. Do not write prose; do not "
        "summarize. Just facts and citations."
    ),
    tools=[{
        "name": "web_search",
        "fn": web_search,
        "description": "Search the web for a query. Returns title, snippet, url.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string"}},
            "required": ["query"],
        },
    }],
)

writer = Agent(
    name="writer",
    system=(
        "You are a writing specialist. Given a topic and a list of facts with "
        "citations, write a 500-word brief in clear, neutral prose. Cite "
        "every factual claim with [source]. Do not invent facts. If a fact "
        "is missing, say so explicitly rather than guessing."
    ),
    tools=[],
)

reviewer = Agent(
    name="reviewer",
    system=(
        "You are an editorial reviewer. Given a brief and the source facts it "
        "was drawn from, evaluate three criteria: (1) factual accuracy - every "
        "claim in the brief is supported by the facts; (2) coverage - the brief "
        "uses the important facts; (3) clarity - the brief reads cleanly. "
        "Reply with PASS or REVISE plus a 2-3 sentence justification. If "
        "REVISE, name the specific issues."
    ),
    tools=[],
)

Three specialists, three system prompts, two of them with no tools. The prompts are shorter than they would be in a single-agent design because each one has a narrower job.

Step 3: build the supervisor

The supervisor decides what to do next based on the state of the work. It can be deterministic (Workshop 4-style hardcoded sequence) or LLM-driven. Start deterministic; this is a research-write-review pipeline whose order is fixed.

def supervisor(topic: str) -> dict:
    """Run the research-write-review pipeline. Returns the final brief plus
    metadata about each stage."""
    state = {"topic": topic, "stages": []}

    # Stage 1: research
    facts = researcher.run(f"Research the topic: {topic}")
    state["facts"] = facts
    state["stages"].append({"agent": "researcher", "output_len": len(facts)})

    # Stage 2: write
    draft = writer.run(
        f"Topic: {topic}\n\nFacts to draw from:\n{facts}\n\nWrite the brief."
    )
    state["draft"] = draft
    state["stages"].append({"agent": "writer", "output_len": len(draft)})

    # Stage 3: review (and possibly loop back to writer)
    for revision in range(2):
        verdict = reviewer.run(
            f"Facts:\n{facts}\n\nDraft:\n{draft}\n\nEvaluate."
        )
        state["stages"].append({"agent": "reviewer", "verdict": verdict[:200]})
        if verdict.strip().upper().startswith("PASS"):
            state["final"] = draft
            return state
        # Reviewer said REVISE - send the draft + verdict back to the writer.
        draft = writer.run(
            f"Topic: {topic}\n\nFacts:\n{facts}\n\nPrevious draft:\n{draft}\n\n"
            f"Editorial feedback:\n{verdict}\n\nProduce a revised brief."
        )
        state["stages"].append({"agent": "writer (revision)", "output_len": len(draft)})

    state["final"] = draft
    return state

Run it on a real topic. The console output traces every stage:

$ python supervisor.py "the history of MCP (Model Context Protocol)"
[researcher] running...    -> 1240 chars of facts
[writer]     running...    -> 2870 chars of draft
[reviewer]   running...    -> REVISE: missing citation for the 2024 launch date
[writer]     running...    -> 2950 chars of revised draft
[reviewer]   running...    -> PASS
final brief (2950 chars):
...

The graph fired. Three specialists, one supervisor, one revision loop, real output. Total cost on Claude Sonnet: about $0.30 for a single brief. Total wall-clock: about 25-40 seconds depending on the revision loop.

Step 4: parallelize the fan-out

The pipeline above is sequential. For tasks where multiple specialists can work independently on different subtasks, you fan them out in parallel.

Example: "compare the trade-offs of pgvector, Qdrant, and Pinecone." The supervisor dispatches three parallel research tasks (one per database), waits for all three, then hands the merged findings to the writer.

import concurrent.futures

def supervisor_parallel(topic: str, sub_topics: list[str]) -> str:
    # Fan-out: research each sub-topic in parallel.
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(sub_topics)) as pool:
        futures = {
            pool.submit(researcher.run, f"Research: {st}"): st
            for st in sub_topics
        }
        findings = {}
        for fut in concurrent.futures.as_completed(futures):
            findings[futures[fut]] = fut.result()

    # Fan-in: write a comparison brief from all findings.
    merged = "\n\n".join(f"## {st}\n{f}" for st, f in findings.items())
    return writer.run(f"Topic: {topic}\n\nFindings:\n{merged}\n\nWrite the brief.")

Three parallel research calls = three concurrent API calls = roughly 1/3 the wall-clock time of doing them serially. For a real task with 5-10 sub-topics, this is the difference between a 60-second response and a 6-second response.

Always parallelize independent specialist calls. The pattern is so consistent that the production agent SDKs (Anthropic Agent SDK, OpenAI Agents SDK) make parallel handoffs the default.

Step 5: LLM-driven supervision (when the steps aren't fixed)

The deterministic supervisor above works when the pipeline shape is known up front (research → write → review). For open-ended tasks ("help me figure out what's wrong with my data pipeline"), you don't know which specialists to call in what order until you've started. The supervisor itself becomes an agent.

Pattern: define each specialist as a "handoff tool" the supervisor can call:

HANDOFF_TOOLS = [
    {
        "name": "handoff_to_researcher",
        "description": "Hand off a research question to the researcher. Use when you need facts you don't have.",
        "input_schema": {
            "type": "object",
            "properties": {"question": {"type": "string"}},
            "required": ["question"],
        },
    },
    {
        "name": "handoff_to_writer",
        "description": "Hand off a writing task to the writer. Use when you have all the facts and need prose.",
        "input_schema": {
            "type": "object",
            "properties": {"topic": {"type": "string"}, "facts": {"type": "string"}},
            "required": ["topic", "facts"],
        },
    },
    # ... handoff_to_reviewer ...
    {
        "name": "finalize",
        "description": "Return the final answer to the user. Call when work is complete.",
        "input_schema": {
            "type": "object",
            "properties": {"answer": {"type": "string"}},
            "required": ["answer"],
        },
    },
]

def handoff_to_researcher(question: str) -> str:
    return researcher.run(f"Research: {question}")

def handoff_to_writer(topic: str, facts: str) -> str:
    return writer.run(f"Topic: {topic}\n\nFacts:\n{facts}\n\nWrite.")

Now the supervisor is itself an Agent with these handoff tools. Its system prompt explains the team and the workflow patterns; the supervisor model decides which specialist to call when. This is the topology Anthropic Agent SDK and OpenAI Agents SDK adopt natively, and it generalizes from "research-write-review" to "any team of specialists."

Step 6: break it (the three classic multi-agent failure modes)

6.1 Infinite delegation

The reviewer says REVISE every single time. The writer keeps revising. The supervisor keeps looping. Without a cap, you've spent $50 on one brief. Always cap revisions. The example above hardcodes range(2). Production code uses a Budget (see Workshop 4) shared across all specialists.

6.2 The lost handoff

The researcher returned findings, the supervisor passed them to the writer, but the writer's prompt mentioned "Topic: " without the facts. The writer hallucinates facts because the structured handoff was malformed. Inspect handoff payloads in logs. Production supervisors log every handoff with the full payload; without this, "the writer keeps hallucinating" is undebuggable.

6.3 The runaway specialist

The researcher decides it needs to search 30 times to be thorough, burning the budget. The writer decides to make 20 tool calls "for clarity." Each specialist needs its own max_turns budget enforced in the kernel (see Workshop 4). The supervisor enforces a global budget across all specialists; specialists enforce local budgets within themselves. Both layers are necessary.

Step 7: when this pattern earns its complexity

Multi-agent is not always the right answer. Honest criteria for picking it:

Use multi-agent when:

  • The sub-tasks have meaningfully different optimal prompts (research vs. write vs. review is a clean split; "answer this question" vs. "answer this question politely" is not).
  • The work benefits from context isolation - the writer doesn't need to see the search noise; the reviewer doesn't need to see the writer's thought process.
  • Parallel execution is on the table (multiple independent research questions, multiple files to review). Multi-agent is the unlock.
  • The system needs swappable specialists - different research backends, different writing styles - and you want each as a separate concern.

Stick to single-agent when:

  • The sub-tasks share most of their context anyway.
  • The "specialists" are just different prompts with no meaningful capability difference.
  • The coordination cost (extra LLM calls, extra latency, extra failure modes) exceeds the prompt-isolation benefit. For a 2-step task this is often the case.
  • You can't yet articulate what each specialist does differently in one sentence. If you can't, the split is premature.

The Anthropic and OpenAI Agent SDKs both bias toward multi-agent in their documentation because it shows off the framework. In production you'll find the right ratio is closer to "70% single-agent with tools, 25% multi-agent with 2-3 specialists, 5% larger." Don't reach for the complex pattern because it's available; reach for it because the simple one ran out.

Now extend it

  1. Add a "memory" specialist. A separate agent whose job is to maintain a long-term notes file (markdown, on disk). Other specialists can query and update it via handoff. The supervisor uses it to remember context across sessions.
  2. Swap the supervisor to use Anthropic Agent SDK. Anthropic's SDK provides first-class supervisor + handoff abstractions, with built-in tracing and parallel handoff support. Compare LOC, performance, and debuggability against your hand-rolled version.
  3. Add quality gates. The reviewer's PASS/REVISE is one gate. Add others: a fact-checker that cross-references claims against a database, a style checker that runs a regex, a length enforcer. Each gate is another specialist that vetoes the work; the supervisor coordinates the loop.
  4. Add user-in-the-loop checkpoints. Some tasks shouldn't fully automate. Have the supervisor pause after the research stage to show the user the facts, with a "looks right? proceed" prompt before invoking the writer.
  5. Trace the whole thing. Workshop 9 wires OpenTelemetry into agent code; doing it here lets you see the supervisor's decisions, every handoff, every specialist's turn count and cost, in one trace UI. This is what production multi-agent debugging looks like.

What you might wonder

"Why not just give one Claude all three roles in one prompt?" You can; it works for short tasks. As the conversation grows past a few thousand tokens of mixed research / draft / review, the model's attention drifts and quality drops measurably. Multi-agent buys back focus by keeping each agent's context small. The break-even point varies by model strength - stronger models tolerate bigger mixed contexts, so the multi-agent benefit shrinks. Sonnet-class models still benefit at the kind of tasks this workshop covers.

"How do specialists share state if they can't talk to each other?" Through the supervisor. Every piece of state is in the supervisor's hands; specialists receive structured inputs and produce structured outputs, but they don't carry persistent state themselves. For state that outlives a single supervisor invocation (across sessions), use a database or vector store the supervisor reads/writes.

"What about agent frameworks - CrewAI, AutoGen, LangGraph?" They wrap the same pattern with different opinions. CrewAI emphasizes role-playing personas. AutoGen emphasizes multi-agent conversation as a metaphor. LangGraph emphasizes the workflow as a state machine. None of them add capability you can't build yourself in ~100 lines; they save you the wiring. Pick the one whose abstractions match your mental model, or pick the Anthropic/OpenAI Agent SDK if you want the minimum framework over the minimum protocol.

"How do I evaluate a multi-agent system?" The same way you evaluate any system: a golden set of (task, expected output) pairs and a metric. The complication is that "expected output" for a long brief is fuzzy; use LLM-as-judge on a rubric ("does the brief cover the key facts? is it under 500 words? does it cite every claim?") rather than exact match. Cost and latency are separate metrics that multi-agent often regresses - track them too.

"When does the supervisor become a bottleneck?" When every handoff routes through it, you're serialized on the supervisor's LLM call. Mitigations: parallel fan-out (step 4), letting specialists do small tool calls without consulting the supervisor, or a tree-of-supervisors topology for very large teams (~10+ specialists). For 3-5 specialists, the simple hub-and-spoke is fine.

What this gave you

  • You built a 3-specialist + 1-supervisor multi-agent system (~150 lines on top of the Workshop 4 kernel).
  • You watched the graph fire on a real task with stage-by-stage logging.
  • You parallelized independent specialist calls and saw the 3-5x wall-clock win.
  • You saw both deterministic (fixed pipeline) and LLM-driven (handoff tools) supervisor styles.
  • You know the three classic multi-agent failure modes (infinite delegation, lost handoff, runaway specialist) and how to defend against each.
  • You have honest criteria for when to pick this pattern vs. a single-agent design.

The bigger transfer: most "multi-agent" systems in production are 2-3 agents with clearly different jobs, coordinated by a thin supervisor. The headline architectures with dozens of agents in elaborate topologies are mostly research papers; the workhorses are small teams. Build the small team well first.

Next: Workshop 8 - Streaming agent with mid-stream tool use, where tokens arrive as the model thinks and tool calls fire while text is still streaming.

Submit your build

When you finish this workshop, share what you built so others can see and learn from your work. Include:

  • Public repo with your supervisor + 3 specialists code
  • Log showing the orchestration graph firing on a real task (stage transitions, parallel fan-out, revision loops)
  • Cost + latency comparison vs. a single-agent baseline on the same task
  • Short note (3 to 5 sentences) on whether your task actually justified multi-agent or if a single agent would have done as well

Submit your build  Request feedback on your output  Discuss this workshop

Browse the gallery  |  All discussions

Comments