Skip to content

Month 4-Week 2: Tool use, streaming, prompt caching

Week summary

  • Goal: Add tool use (3+ tools), response streaming, and prompt caching to your project. Cut cost by >50% on common patterns. Begin cost accounting per interaction.
  • Time: ~9 h over 3 sessions.
  • Output: Project with multi-step tool use, streaming UX, caching, cost dashboards.
  • Sequences relied on: 09-llm-application-engineering rungs 04, 05, 06; 04-python-for-ml rungs 08, 09.

Why this week matters

These three primitives separate "demo apps" from "production-ready" apps: - Tool use unlocks agents and grounding-the LLM can fetch data instead of hallucinating it. - Streaming unlocks UX-users see output start in <1s instead of waiting for the full response. - Prompt caching unlocks unit economics-production-grade systems cut cost by 60–90% with caching.

Skipping any of these limits what your Q2 anchor can become. Mastering them now compounds for the rest of Q2.

Prerequisites

  • M04-W01 complete (schemas, retries, async).
  • Session A-Tue/Wed evening (~3 h): tool use deep dive
  • Session B-Sat morning (~3.5 h): streaming + caching
  • Session C-Sun afternoon (~2.5 h): cost accounting + ship

Session A-Tool use, multi-step

Goal: Implement a multi-step tool-use loop with at least 3 tools. Watch the LLM chain calls.

Part 1-Tool use mental model (45 min)

A tool-use turn: 1. You send a request with tools=[...]. 2. The model responds with either a final answer, or a tool_use block requesting a tool call. 3. You execute the tool, send back a tool_result. 4. The model sees the result and either calls another tool or produces the final answer. 5. Loop until done or max-step exceeded.

Read Anthropic's tool use guide: docs.anthropic.com/claude/docs/tool-use.

Part 2-Define 3 tools (60 min)

For incident-triage:

# src/triage/tools.py
TOOLS = [
    {
        "name": "query_metrics",
        "description": "Query time-series metrics for a service over a time range.",
        "input_schema": {
            "type": "object",
            "properties": {
                "service": {"type": "string"},
                "metric": {"type": "string", "enum": ["latency_p95", "error_rate", "throughput", "cpu_pct"]},
                "time_range_minutes": {"type": "integer", "minimum": 1, "maximum": 1440},
            },
            "required": ["service", "metric", "time_range_minutes"],
        },
    },
    {
        "name": "get_recent_deploys",
        "description": "Get recent deployments for a service.",
        "input_schema": {
            "type": "object",
            "properties": {
                "service": {"type": "string"},
                "since_minutes": {"type": "integer", "minimum": 1, "maximum": 1440},
            },
            "required": ["service", "since_minutes"],
        },
    },
    {
        "name": "query_logs",
        "description": "Query log lines for a service matching a substring.",
        "input_schema": {
            "type": "object",
            "properties": {
                "service": {"type": "string"},
                "query": {"type": "string"},
                "limit": {"type": "integer", "default": 50},
            },
            "required": ["service", "query"],
        },
    },
]

Implement mock data sources for each tool. (Real integrations would replace these later.)

Part 3-The loop (75 min)

# src/triage/loop.py
def tool_use_loop(initial_message: str, max_steps: int = 8):
    messages = [{"role": "user", "content": initial_message}]
    for step in range(max_steps):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=TOOLS,
            system=SYSTEM_PROMPT,
            messages=messages,
        )
        # Append the assistant's full response (may include text + tool_use blocks)
        messages.append({"role": "assistant", "content": resp.content})

        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return resp  # done-final answer

        # Execute each tool, append results
        tool_results = []
        for tu in tool_uses:
            try:
                result = TOOL_REGISTRY[tu.name](**tu.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tu.id,
                    "content": str(result),
                })
            except Exception as e:
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tu.id,
                    "content": f"ERROR: {e}",
                    "is_error": True,
                })
        messages.append({"role": "user", "content": tool_results})
    raise RuntimeError(f"max_steps={max_steps} exceeded")

Test on a multi-step incident. Watch the LLM call get_recent_deploys, then query_metrics, then produce a final answer. Print the trace.

Output of Session A

  • 3 tools defined and implemented (with mock data).
  • Tool-use loop with max-step cap.
  • Test case showing multi-step reasoning.

Session B-Streaming + prompt caching

Goal: Stream responses to the user. Add prompt caching to cut cost.

Part 1-Streaming (75 min)

Streaming uses Server-Sent Events (SSE). Anthropic's SDK exposes it as an async iterator:

async def triage_stream(incident_description: str):
    async with client.messages.stream(
        model="claude-opus-4-7",
        max_tokens=4096,
        tools=TOOLS,
        system=SYSTEM_PROMPT,
        messages=[{"role": "user", "content": incident_description}],
    ) as stream:
        async for event in stream:
            if event.type == "content_block_delta" and event.delta.type == "text_delta":
                yield event.delta.text  # token chunks
            elif event.type == "content_block_start" and event.content_block.type == "tool_use":
                yield f"\n[calling tool: {event.content_block.name}]\n"

Test in a CLI:

async def main():
    async for chunk in triage_stream("..."):
        print(chunk, end="", flush=True)

asyncio.run(main())

You should see text appearing token-by-token.

Part 2-Prompt caching (90 min)

Anthropic prompt caching: mark large stable prefixes (system prompt, examples, long context) for caching. After the first call, subsequent calls reuse the cached prefix at ~10% the cost and 30–80% the latency.

SYSTEM_PROMPT_BLOCKS = [
    {
        "type": "text",
        "text": LONG_SYSTEM_PROMPT,  # detailed instructions, taxonomy, etc.
        "cache_control": {"type": "ephemeral"},  # mark for caching
    },
    {
        "type": "text",
        "text": FEW_SHOT_EXAMPLES,  # 5-10 worked examples
        "cache_control": {"type": "ephemeral"},
    },
]

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    system=SYSTEM_PROMPT_BLOCKS,
    messages=[...],
)
print(resp.usage)
# First call: cache_creation_input_tokens > 0
# Subsequent calls: cache_read_input_tokens > 0, input_tokens decreases

Cache hit rate test: call the same prompt 10 times. Verify hits via usage.cache_read_input_tokens. Expected: first call creates cache; next 9 hit.

Part 3-Read the docs deeper (15 min)

Read Anthropic prompt caching docs in full: docs.anthropic.com/claude/docs/prompt-caching.

Note constraints: - Cache TTL: 5 minutes (or 1 hour for some plans). - Min cacheable size: ~1024 tokens (varies by model). - Order matters: cached blocks must come first.

Output of Session B

  • Streaming working in CLI.
  • Prompt caching with verified hit rate.

Session C-Cost accounting and ship

Goal: Per-interaction cost accounting. Updated README with performance section.

Part 1-Cost logger (60 min)

# src/triage/cost.py
PRICING = {
    "claude-opus-4-7": {
        "input": 15.0,    # per 1M tokens
        "cache_write": 18.75,
        "cache_read": 1.5,
        "output": 75.0,
    },
}

def compute_cost(usage, model: str) -> float:
    p = PRICING[model]
    return (
        usage.input_tokens          * p["input"] / 1e6
      + usage.cache_creation_input_tokens * p["cache_write"] / 1e6
      + usage.cache_read_input_tokens     * p["cache_read"]  / 1e6
      + usage.output_tokens         * p["output"] / 1e6
    )

# Aggregate over a session
class CostTracker:
    def __init__(self):
        self.total = 0.0
        self.calls = []
    def record(self, usage, model: str):
        c = compute_cost(usage, model)
        self.calls.append({"model": model, "cost": c, "usage": usage})
        self.total += c
    def report(self):
        return {"total_usd": self.total, "n_calls": len(self.calls), ...}

Plumb through every LLM call.

Part 2-Latency measurement (45 min)

For each call, capture: - time_to_first_token (TTFT)-useful for streaming UX. - total_latency_ms. - tokens_per_second.

import time

start = time.time()
first_token_at = None
async for chunk in stream:
    if first_token_at is None:
        first_token_at = time.time()
    ...
end = time.time()
ttft_ms = (first_token_at - start) * 1000
total_ms = (end - start) * 1000

Aggregate p50, p95 over a batch of calls.

Part 3-README performance section (45 min)

Update README with a "Cost & Performance" section:

## Cost & Performance (10-incident batch, claude-opus-4-7, with caching)

| Metric | Value |
|---|---|
| Avg cost per incident | $0.018 |
| p50 TTFT | 870 ms |
| p95 TTFT | 1240 ms |
| Cache hit rate | 92% (warmed) |
| Avg input tokens | 4200 |
| Avg output tokens | 380 |

Numbers will change next week (when you add evals and refine prompts), but the discipline of publishing real numbers starts now.

Push to v0.2.0.

Output of Session C

  • Cost tracker integrated.
  • Latency metrics captured.
  • README with Performance section.

End-of-week artifact

  • 3-tool tool use loop with multi-step reasoning observed
  • Streaming working end-to-end
  • Prompt cache hit rate >70% on warmed calls
  • Cost dashboard with per-call breakdown
  • v0.2.0 tagged

End-of-week self-assessment

  • I can write a tool-use loop from a blank file.
  • I can explain prompt caching mechanics.
  • I can quote my project's $/interaction without checking notes.

Common failure modes for this week

  • Tools too vague. "search" instead of "search_logs(service, query, time_range)". Specific tools win.
  • Caching skipped because "premature optimization." It isn't. It's table-stakes for production cost.
  • Cost numbers as approximations. Track real usage. Numbers compound to user trust.

What's next (preview of M04-W03)

Eval foundations. Build a 50-example golden set. Set up heuristic checks + LLM-as-judge. Validate your judge against human labels. This is the discipline that defines your Q3 specialty.

Comments