Month 4-Week 2: Tool use, streaming, prompt caching¶

Week summary¶

Goal: Add tool use (3+ tools), response streaming, and prompt caching to your project. Cut cost by >50% on common patterns. Begin cost accounting per interaction.
Time: ~9 h over 3 sessions.
Output: Project with multi-step tool use, streaming UX, caching, cost dashboards.
Sequences relied on: 09-llm-application-engineering rungs 04, 05, 06; 04-python-for-ml rungs 08, 09.

Why this week matters¶

These three primitives separate "demo apps" from "production-ready" apps: - Tool use unlocks agents and grounding-the LLM can fetch data instead of hallucinating it. - Streaming unlocks UX-users see output start in <1s instead of waiting for the full response. - Prompt caching unlocks unit economics-production-grade systems cut cost by 60–90% with caching.

Skipping any of these limits what your Q2 anchor can become. Mastering them now compounds for the rest of Q2.

Prerequisites¶

M04-W01 complete (schemas, retries, async).

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): tool use deep dive
Session B-Sat morning (~3.5 h): streaming + caching
Session C-Sun afternoon (~2.5 h): cost accounting + ship

Session A-Tool use, multi-step¶

Goal: Implement a multi-step tool-use loop with at least 3 tools. Watch the LLM chain calls.

Part 1-Tool use mental model (45 min)¶

A tool-use turn: 1. You send a request with tools=[...]. 2. The model responds with either a final answer, or a tool_use block requesting a tool call. 3. You execute the tool, send back a tool_result. 4. The model sees the result and either calls another tool or produces the final answer. 5. Loop until done or max-step exceeded.

Read Anthropic's tool use guide: docs.anthropic.com/claude/docs/tool-use.

Part 2-Define 3 tools (60 min)¶

For incident-triage:

# src/triage/tools.py
TOOLS = [
    {
        "name": "query_metrics",
        "description": "Query time-series metrics for a service over a time range.",
        "input_schema": {
            "type": "object",
            "properties": {
                "service": {"type": "string"},
                "metric": {"type": "string", "enum": ["latency_p95", "error_rate", "throughput", "cpu_pct"]},
                "time_range_minutes": {"type": "integer", "minimum": 1, "maximum": 1440},
            },
            "required": ["service", "metric", "time_range_minutes"],
        },
    },
    {
        "name": "get_recent_deploys",
        "description": "Get recent deployments for a service.",
        "input_schema": {
            "type": "object",
            "properties": {
                "service": {"type": "string"},
                "since_minutes": {"type": "integer", "minimum": 1, "maximum": 1440},
            },
            "required": ["service", "since_minutes"],
        },
    },
    {
        "name": "query_logs",
        "description": "Query log lines for a service matching a substring.",
        "input_schema": {
            "type": "object",
            "properties": {
                "service": {"type": "string"},
                "query": {"type": "string"},
                "limit": {"type": "integer", "default": 50},
            },
            "required": ["service", "query"],
        },
    },
]

Implement mock data sources for each tool. (Real integrations would replace these later.)

Part 3-The loop (75 min)¶

# src/triage/loop.py
def tool_use_loop(initial_message: str, max_steps: int = 8):
    messages = [{"role": "user", "content": initial_message}]
    for step in range(max_steps):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=TOOLS,
            system=SYSTEM_PROMPT,
            messages=messages,
        )
        # Append the assistant's full response (may include text + tool_use blocks)
        messages.append({"role": "assistant", "content": resp.content})

        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return resp  # done-final answer

        # Execute each tool, append results
        tool_results = []
        for tu in tool_uses:
            try:
                result = TOOL_REGISTRY[tu.name](**tu.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tu.id,
                    "content": str(result),
                })
            except Exception as e:
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tu.id,
                    "content": f"ERROR: {e}",
                    "is_error": True,
                })
        messages.append({"role": "user", "content": tool_results})
    raise RuntimeError(f"max_steps={max_steps} exceeded")

Test on a multi-step incident. Watch the LLM call get_recent_deploys, then query_metrics, then produce a final answer. Print the trace.

Output of Session A¶

3 tools defined and implemented (with mock data).
Tool-use loop with max-step cap.
Test case showing multi-step reasoning.

Session B-Streaming + prompt caching¶

Goal: Stream responses to the user. Add prompt caching to cut cost.

Part 1-Streaming (75 min)¶

Streaming uses Server-Sent Events (SSE). Anthropic's SDK exposes it as an async iterator:

async def triage_stream(incident_description: str):
    async with client.messages.stream(
        model="claude-opus-4-7",
        max_tokens=4096,
        tools=TOOLS,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": incident_description}],
    ) as stream:
        async for event in stream:
            if event.type == "content_block_delta" and event.delta.type == "text_delta":
                yield event.delta.text  # token chunks
            elif event.type == "content_block_start" and event.content_block.type == "tool_use":
                yield f"\n[calling tool: {event.content_block.name}]\n"

Test in a CLI:

async def main():
    async for chunk in triage_stream("..."):
        print(chunk, end="", flush=True)

asyncio.run(main())

You should see text appearing token-by-token.

Part 2-Prompt caching (90 min)¶

Anthropic prompt caching: mark large stable prefixes (system prompt, examples, long context) for caching. After the first call, subsequent calls reuse the cached prefix at ~10% the cost and 30–80% the latency.

SYSTEM_PROMPT_BLOCKS = [
    {
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,  # detailed instructions, taxonomy, etc.
        "cache_control": {"type": "ephemeral"},  # mark for caching
    },
    {
        "type": "text",
        "text": FEW_SHOT_EXAMPLES,  # 5-10 worked examples
        "cache_control": {"type": "ephemeral"},
    },
]

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    system=SYSTEM_PROMPT_BLOCKS,
    messages=[...],
)
print(resp.usage)
# First call: cache_creation_input_tokens > 0
# Subsequent calls: cache_read_input_tokens > 0, input_tokens decreases

Cache hit rate test: call the same prompt 10 times. Verify hits via usage.cache_read_input_tokens. Expected: first call creates cache; next 9 hit.

Part 3-Read the docs deeper (15 min)¶

Read Anthropic prompt caching docs in full: docs.anthropic.com/claude/docs/prompt-caching.

Note constraints: - Cache TTL: 5 minutes (or 1 hour for some plans). - Min cacheable size: ~1024 tokens (varies by model). - Order matters: cached blocks must come first.

Output of Session B¶

Streaming working in CLI.
Prompt caching with verified hit rate.

Session C-Cost accounting and ship¶

Goal: Per-interaction cost accounting. Updated README with performance section.

Part 1-Cost logger (60 min)¶

# src/triage/cost.py
PRICING = {
    "claude-opus-4-7": {
        "input": 15.0,    # per 1M tokens
        "cache_write": 18.75,
        "cache_read": 1.5,
        "output": 75.0,
    },
}

def compute_cost(usage, model: str) -> float:
    p = PRICING[model]
    return (
        usage.input_tokens          * p["input"] / 1e6
      + usage.cache_creation_input_tokens * p["cache_write"] / 1e6
      + usage.cache_read_input_tokens     * p["cache_read"]  / 1e6
      + usage.output_tokens         * p["output"] / 1e6
    )

# Aggregate over a session
class CostTracker:
    def __init__(self):
        self.total = 0.0
        self.calls = []
    def record(self, usage, model: str):
        c = compute_cost(usage, model)
        self.calls.append({"model": model, "cost": c, "usage": usage})
        self.total += c
    def report(self):
        return {"total_usd": self.total, "n_calls": len(self.calls), ...}

Plumb through every LLM call.

Part 2-Latency measurement (45 min)¶

For each call, capture: - time_to_first_token (TTFT)-useful for streaming UX. - total_latency_ms. - tokens_per_second.

import time

start = time.time()
first_token_at = None
async for chunk in stream:
    if first_token_at is None:
        first_token_at = time.time()
    ...
end = time.time()
ttft_ms = (first_token_at - start) * 1000
total_ms = (end - start) * 1000

Aggregate p50, p95 over a batch of calls.

Part 3-README performance section (45 min)¶

Update README with a "Cost & Performance" section:

## Cost & Performance (10-incident batch, claude-opus-4-7, with caching)

| Metric | Value |
|---|---|
| Avg cost per incident | $0.018 |
| p50 TTFT | 870 ms |
| p95 TTFT | 1240 ms |
| Cache hit rate | 92% (warmed) |
| Avg input tokens | 4200 |
| Avg output tokens | 380 |

Numbers will change next week (when you add evals and refine prompts), but the discipline of publishing real numbers starts now.

Push to v0.2.0.

Output of Session C¶

Cost tracker integrated.
Latency metrics captured.
README with Performance section.

End-of-week artifact¶

3-tool tool use loop with multi-step reasoning observed
Streaming working end-to-end
Prompt cache hit rate >70% on warmed calls
Cost dashboard with per-call breakdown
v0.2.0 tagged

End-of-week self-assessment¶

I can write a tool-use loop from a blank file.
I can explain prompt caching mechanics.
I can quote my project's $/interaction without checking notes.

Common failure modes for this week¶

Tools too vague. "search" instead of "search_logs(service, query, time_range)". Specific tools win.
Caching skipped because "premature optimization." It isn't. It's table-stakes for production cost.
Cost numbers as approximations. Track real usage. Numbers compound to user trust.

What's next (preview of M04-W03)¶

Eval foundations. Build a 50-example golden set. Set up heuristic checks + LLM-as-judge. Validate your judge against human labels. This is the discipline that defines your Q3 specialty.