Month 4-Week 2: Tool use, streaming, prompt caching¶
Week summary¶
- Goal: Add tool use (3+ tools), response streaming, and prompt caching to your project. Cut cost by >50% on common patterns. Begin cost accounting per interaction.
- Time: ~9 h over 3 sessions.
- Output: Project with multi-step tool use, streaming UX, caching, cost dashboards.
- Sequences relied on: 09-llm-application-engineering rungs 04, 05, 06; 04-python-for-ml rungs 08, 09.
Why this week matters¶
These three primitives separate "demo apps" from "production-ready" apps: - Tool use unlocks agents and grounding-the LLM can fetch data instead of hallucinating it. - Streaming unlocks UX-users see output start in <1s instead of waiting for the full response. - Prompt caching unlocks unit economics-production-grade systems cut cost by 60–90% with caching.
Skipping any of these limits what your Q2 anchor can become. Mastering them now compounds for the rest of Q2.
Prerequisites¶
- M04-W01 complete (schemas, retries, async).
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): tool use deep dive
- Session B-Sat morning (~3.5 h): streaming + caching
- Session C-Sun afternoon (~2.5 h): cost accounting + ship
Session A-Tool use, multi-step¶
Goal: Implement a multi-step tool-use loop with at least 3 tools. Watch the LLM chain calls.
Part 1-Tool use mental model (45 min)¶
A tool-use turn:
1. You send a request with tools=[...].
2. The model responds with either a final answer, or a tool_use block requesting a tool call.
3. You execute the tool, send back a tool_result.
4. The model sees the result and either calls another tool or produces the final answer.
5. Loop until done or max-step exceeded.
Read Anthropic's tool use guide: docs.anthropic.com/claude/docs/tool-use.
Part 2-Define 3 tools (60 min)¶
For incident-triage:
# src/triage/tools.py
TOOLS = [
{
"name": "query_metrics",
"description": "Query time-series metrics for a service over a time range.",
"input_schema": {
"type": "object",
"properties": {
"service": {"type": "string"},
"metric": {"type": "string", "enum": ["latency_p95", "error_rate", "throughput", "cpu_pct"]},
"time_range_minutes": {"type": "integer", "minimum": 1, "maximum": 1440},
},
"required": ["service", "metric", "time_range_minutes"],
},
},
{
"name": "get_recent_deploys",
"description": "Get recent deployments for a service.",
"input_schema": {
"type": "object",
"properties": {
"service": {"type": "string"},
"since_minutes": {"type": "integer", "minimum": 1, "maximum": 1440},
},
"required": ["service", "since_minutes"],
},
},
{
"name": "query_logs",
"description": "Query log lines for a service matching a substring.",
"input_schema": {
"type": "object",
"properties": {
"service": {"type": "string"},
"query": {"type": "string"},
"limit": {"type": "integer", "default": 50},
},
"required": ["service", "query"],
},
},
]
Implement mock data sources for each tool. (Real integrations would replace these later.)
Part 3-The loop (75 min)¶
# src/triage/loop.py
def tool_use_loop(initial_message: str, max_steps: int = 8):
messages = [{"role": "user", "content": initial_message}]
for step in range(max_steps):
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=TOOLS,
system=SYSTEM_PROMPT,
messages=messages,
)
# Append the assistant's full response (may include text + tool_use blocks)
messages.append({"role": "assistant", "content": resp.content})
tool_uses = [b for b in resp.content if b.type == "tool_use"]
if not tool_uses:
return resp # done-final answer
# Execute each tool, append results
tool_results = []
for tu in tool_uses:
try:
result = TOOL_REGISTRY[tu.name](**tu.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": tu.id,
"content": str(result),
})
except Exception as e:
tool_results.append({
"type": "tool_result",
"tool_use_id": tu.id,
"content": f"ERROR: {e}",
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
raise RuntimeError(f"max_steps={max_steps} exceeded")
Test on a multi-step incident. Watch the LLM call get_recent_deploys, then query_metrics, then produce a final answer. Print the trace.
Output of Session A¶
- 3 tools defined and implemented (with mock data).
- Tool-use loop with max-step cap.
- Test case showing multi-step reasoning.
Session B-Streaming + prompt caching¶
Goal: Stream responses to the user. Add prompt caching to cut cost.
Part 1-Streaming (75 min)¶
Streaming uses Server-Sent Events (SSE). Anthropic's SDK exposes it as an async iterator:
async def triage_stream(incident_description: str):
async with client.messages.stream(
model="claude-opus-4-7",
max_tokens=4096,
tools=TOOLS,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": incident_description}],
) as stream:
async for event in stream:
if event.type == "content_block_delta" and event.delta.type == "text_delta":
yield event.delta.text # token chunks
elif event.type == "content_block_start" and event.content_block.type == "tool_use":
yield f"\n[calling tool: {event.content_block.name}]\n"
Test in a CLI:
async def main():
async for chunk in triage_stream("..."):
print(chunk, end="", flush=True)
asyncio.run(main())
You should see text appearing token-by-token.
Part 2-Prompt caching (90 min)¶
Anthropic prompt caching: mark large stable prefixes (system prompt, examples, long context) for caching. After the first call, subsequent calls reuse the cached prefix at ~10% the cost and 30–80% the latency.
SYSTEM_PROMPT_BLOCKS = [
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # detailed instructions, taxonomy, etc.
"cache_control": {"type": "ephemeral"}, # mark for caching
},
{
"type": "text",
"text": FEW_SHOT_EXAMPLES, # 5-10 worked examples
"cache_control": {"type": "ephemeral"},
},
]
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
system=SYSTEM_PROMPT_BLOCKS,
messages=[...],
)
print(resp.usage)
# First call: cache_creation_input_tokens > 0
# Subsequent calls: cache_read_input_tokens > 0, input_tokens decreases
Cache hit rate test: call the same prompt 10 times. Verify hits via usage.cache_read_input_tokens. Expected: first call creates cache; next 9 hit.
Part 3-Read the docs deeper (15 min)¶
Read Anthropic prompt caching docs in full: docs.anthropic.com/claude/docs/prompt-caching.
Note constraints: - Cache TTL: 5 minutes (or 1 hour for some plans). - Min cacheable size: ~1024 tokens (varies by model). - Order matters: cached blocks must come first.
Output of Session B¶
- Streaming working in CLI.
- Prompt caching with verified hit rate.
Session C-Cost accounting and ship¶
Goal: Per-interaction cost accounting. Updated README with performance section.
Part 1-Cost logger (60 min)¶
# src/triage/cost.py
PRICING = {
"claude-opus-4-7": {
"input": 15.0, # per 1M tokens
"cache_write": 18.75,
"cache_read": 1.5,
"output": 75.0,
},
}
def compute_cost(usage, model: str) -> float:
p = PRICING[model]
return (
usage.input_tokens * p["input"] / 1e6
+ usage.cache_creation_input_tokens * p["cache_write"] / 1e6
+ usage.cache_read_input_tokens * p["cache_read"] / 1e6
+ usage.output_tokens * p["output"] / 1e6
)
# Aggregate over a session
class CostTracker:
def __init__(self):
self.total = 0.0
self.calls = []
def record(self, usage, model: str):
c = compute_cost(usage, model)
self.calls.append({"model": model, "cost": c, "usage": usage})
self.total += c
def report(self):
return {"total_usd": self.total, "n_calls": len(self.calls), ...}
Plumb through every LLM call.
Part 2-Latency measurement (45 min)¶
For each call, capture:
- time_to_first_token (TTFT)-useful for streaming UX.
- total_latency_ms.
- tokens_per_second.
import time
start = time.time()
first_token_at = None
async for chunk in stream:
if first_token_at is None:
first_token_at = time.time()
...
end = time.time()
ttft_ms = (first_token_at - start) * 1000
total_ms = (end - start) * 1000
Aggregate p50, p95 over a batch of calls.
Part 3-README performance section (45 min)¶
Update README with a "Cost & Performance" section:
## Cost & Performance (10-incident batch, claude-opus-4-7, with caching)
| Metric | Value |
|---|---|
| Avg cost per incident | $0.018 |
| p50 TTFT | 870 ms |
| p95 TTFT | 1240 ms |
| Cache hit rate | 92% (warmed) |
| Avg input tokens | 4200 |
| Avg output tokens | 380 |
Numbers will change next week (when you add evals and refine prompts), but the discipline of publishing real numbers starts now.
Push to v0.2.0.
Output of Session C¶
- Cost tracker integrated.
- Latency metrics captured.
- README with Performance section.
End-of-week artifact¶
- 3-tool tool use loop with multi-step reasoning observed
- Streaming working end-to-end
- Prompt cache hit rate >70% on warmed calls
- Cost dashboard with per-call breakdown
- v0.2.0 tagged
End-of-week self-assessment¶
- I can write a tool-use loop from a blank file.
- I can explain prompt caching mechanics.
- I can quote my project's $/interaction without checking notes.
Common failure modes for this week¶
- Tools too vague. "search" instead of "search_logs(service, query, time_range)". Specific tools win.
- Caching skipped because "premature optimization." It isn't. It's table-stakes for production cost.
- Cost numbers as approximations. Track real usage. Numbers compound to user trust.
What's next (preview of M04-W03)¶
Eval foundations. Build a 50-example golden set. Set up heuristic checks + LLM-as-judge. Validate your judge against human labels. This is the discipline that defines your Q3 specialty.