Deep Dive 09-LLM Observability¶

The chapter where your SRE instincts become an AI-engineering superpower.

0. Orientation: why this chapter is the moat¶

Most engineers entering the AI space come from one of two directions. The data-scientist path arrives fluent in models and statistics but vague about p99s, dashboards, alert routing, and the unglamorous mechanics of keeping a service alive at 3 a.m. The web-dev path arrives fluent in shipping features but unfamiliar with the special pathologies of non-deterministic systems where "correct" is fuzzy and the cost per request swings by 50x.

You are coming from a third direction-backend / SRE with a Bamboo + Datadog plugin background-and that direction is, right now, the rarest and most leveraged. The teams shipping LLM features in 2026 are flooded with prototype code and starved for people who understand SLIs, error budgets, trace propagation, cardinality discipline, and what a healthy on-call rotation looks like. The leap from "I instrument services" to "I instrument LLM-powered services" is smaller than the leap most candidates have to make. This chapter exists to convert that latent advantage into something explicit, demonstrable, and portfolio-ready.

The thesis of this document: LLM observability is observability with five new failure modes layered on top. If you internalize the new failure modes and translate your existing patterns to them, you can walk into any AI-product team and immediately be the most valuable person in the room on questions of reliability, cost, and debuggability.

We will build up the picture from first principles, derive everything (no hand-waving), and end with concrete exercises that produce artifacts you can put on a resume.

1. Why LLM observability is different from traditional observability¶

Traditional service observability rests on a set of unstated assumptions that mostly hold for CRUD systems and fail in characteristic ways for LLM systems. Naming the assumptions explicitly makes the differences crisp.

1.1 Determinism is gone¶

In a traditional service, identical inputs produce identical outputs (modulo clock and randomness you control). When a request fails, you can usually replay it and reproduce the failure. The replay is a foundational debugging primitive.

LLM calls are non-deterministic by default. Even at temperature=0, providers do not guarantee bit-identical outputs across calls-backend routing changes, model versions are revised silently, batching dynamics shift token sampling. A bug report that says "the model said something dumb at 14:32" cannot be reproduced by re-running the same prompt; the model may now produce a perfectly fine response. This breaks the replay primitive and forces a different debugging stance: you must capture the exact input and output at the moment of the failure, because you cannot get them back.

Implication: storage and trace fidelity matter more than they did before. A trace that says "LLM call took 4.2s and returned an error" without the actual prompt and response is nearly useless.

1.2 Quality is graded, not boolean¶

A traditional 200 response is "correct"-the service did the thing it said it would do. A 500 is wrong. There is no middle.

LLM responses occupy a continuous quality spectrum. A response can be technically successful (HTTP 200, finish_reason=stop), syntactically valid (parseable JSON), and still be wrong (hallucinated a field, picked the wrong customer, leaked a PII fragment). This means the HTTP layer's status code is not the truth. You must add a separate quality signal-typically derived from evals on a sampled subset of production traffic, or from downstream user signals (thumbs-up, retry rate, conversion).

Implication: error rate alone is a misleading SLI. You will need both api_success_rate and output_quality_score (the latter sampled), and they will sometimes diverge dramatically.

1.3 Cost varies dramatically per request¶

Traditional services have nearly-flat per-request cost (CPU-bound, predictable). LLM requests have wildly variable cost driven by token counts. A single user can issue a 200-token request that costs $0.001 and, ten seconds later, paste a 50,000-token document that costs $0.50. The 500x spread is not an outlier-it's the median day.

Implication: cost becomes a first-class signal alongside latency. You need cost-per-request, cost-per-feature, cost-per-tenant, and you need them visible in the same dashboards as latency and errors. Many production incidents in 2025–2026 are not "the service is down" but "the bill tripled overnight"-and the on-call engineer who can isolate the offending feature in five minutes is the on-call engineer who gets promoted.

1.4 The prompt is part of the service¶

In a traditional service, the deployable artifact is the binary or container. You version it with git, you deploy it through CI, you can roll back.

In an LLM service, the prompt template is just as much part of the runtime behavior as the code, but it is often stored separately (in a YAML file, a database, a feature flag service) and edited by people who are not engineers. A 50-character change to a system prompt can change quality, cost, and refusal rate by 30%. This means prompt versioning is a deployment event and observability must treat it that way: every span should record which prompt-template version produced it, and your dashboards must let you slice by that version.

Implication: a prompt.template.id and prompt.template.sha tag on every span is non-negotiable. Without it, you cannot answer "did latency change because of code or because of the prompt?".

1.5 Fan-out: one user request becomes a tree¶

A traditional request is mostly linear: ingress → service → DB → response. The trace is a near-line.

A modern agentic LLM request fans out. One user message triggers an LLM call, which requests three tool calls (one of which is itself an LLM call to summarize a document), each of which may retrieve from a vector store, each of which is an embedding call, then a planner LLM call decides whether to loop. A single user request can produce 5–50 spans across multiple LLM providers and tool services.

Implication: trace structure matters more than ever. You cannot understand performance or correctness from individual spans; you must see the tree. Your tooling must support span trees with depth ≥ 4 by default, and your engineers must read them fluently.

1.6 Summary of differences¶

Property	Traditional	LLM
Determinism	Mostly	No
Correctness	Boolean	Graded
Per-request cost	~Flat	50–500x spread
Service artifact	Code	Code + prompt
Trace shape	Line	Tree, depth 4+

Every section that follows is an answer to one of these differences. Keep this table near you.

2. The four golden signals, LLM edition¶

The Google SRE book canonized latency, traffic, errors, saturation as the four golden signals-the minimum set of indicators that, if monitored, will catch most user-visible failures. The framing is durable because it is grounded in user experience: each signal corresponds to a way the user notices the service is unhealthy.

For LLM services, each signal needs to be re-derived. The names stay the same; the metrics inside them change.

2.1 Latency, three numbers not one¶

For a synchronous HTTP service, "latency" is one number: time from request received to response sent. For a streaming LLM call, that single number conceals the experience. Users care about three different things:

TTFT-time to first token. From request issued to the first token arriving in the client. This is what dictates whether the chat UI feels responsive. Below ~500ms feels instant; above ~2s feels broken.
TPOT-time per output token. Steady-state token-generation rate after the first token. This dictates whether long answers feel painful. ~30 tokens/sec (33ms/token) is a comfortable read; below 10 tokens/sec is grating.
Total response latency. TTFT + (output_tokens × TPOT). The bottom line for non-streaming use cases.

You will measure all three. Most providers report TTFT either explicitly or implicitly (you can compute it from your client). TPOT requires capturing chunk arrival timestamps. Total latency is end-to-end and easy.

# Sketch-capture the three latency signals around a streaming call.
import time

def call_with_latency_capture(client, **kwargs):
    t_start = time.perf_counter()
    t_first = None
    chunk_times = []
    output_text_parts = []

    with client.messages.stream(**kwargs) as stream:
        for event in stream:
            now = time.perf_counter()
            if event.type == "content_block_delta":
                if t_first is None:
                    t_first = now
                chunk_times.append(now)
                output_text_parts.append(event.delta.text)
        final = stream.get_final_message()

    t_end = time.perf_counter()
    output_tokens = final.usage.output_tokens
    ttft = (t_first - t_start) if t_first else None
    total = t_end - t_start
    tpot = ((t_end - t_first) / max(output_tokens - 1, 1)) if t_first and output_tokens > 1 else None
    return final, {"ttft_s": ttft, "tpot_s": tpot, "total_s": total, "output_tokens": output_tokens}

The SRE bridge: in your prior work you tracked request_duration_seconds as a histogram. Do the same here, but emit three histograms with appropriate names: llm_ttft_seconds, llm_tpot_seconds, llm_total_latency_seconds. SLOs (section 12) target the first two for streaming UIs and the third for batch/non-streaming work.

2.2 Traffic¶

Traffic is twofold for LLM services. Requests-per-second is the familiar half. Tokens-per-second is the new half-it is the actual capacity-relevant unit because providers throttle on tokens, not requests, and your bill is denominated in tokens.

Track:

llm_requests_total (counter, by provider/model/feature)
llm_input_tokens_total (counter, by provider/model/feature)
llm_output_tokens_total (counter, by provider/model/feature)
llm_cache_read_tokens_total (counter, where applicable-Anthropic, OpenAI cached)

A traffic dashboard that shows requests-per-second only, with no token-per-second panel, is hiding half the picture. You cannot diagnose a provider rate-limit incident with requests-per-second alone if the cause is one feature suddenly sending 10x larger inputs.

2.3 Errors¶

The error category fractures into three sub-buckets, and the buckets behave differently. Treating them as one signal will cause you to miss incidents.

API errors. HTTP 4xx (bad request, auth, content policy), 5xx (provider down), 429 (rate limit). These are the familiar shape.
Validation errors. The call returned 200 but the output failed your structural validation-failed JSON parse, missing required field, wrong enum value. This is uniquely an LLM problem; in a traditional service the schema is enforced server-side.
Guardrail rejections. Either provider-side (finish_reason=content_filter) or your-side (a downstream policy classifier flagged the response). Your error budget probably tolerates a small constant rate of these; a sudden spike means an upstream input distribution shift or a prompt regression.

Emit them as separate counters, not as labels on a single counter. You want different alert thresholds, different runbooks, and often different on-call rotations (provider outages page the on-call engineer; validation-error spikes page the prompt owner).

2.4 Saturation¶

For traditional services, saturation is CPU, memory, disk, file descriptors. For LLM services, the resources you can saturate are different:

Provider quota-requests-per-minute and tokens-per-minute, set per API key or per organization. The rate-limit headers expose your remaining budget; surface them as gauges.
Concurrency limit-many providers cap concurrent in-flight requests per key. When you hit the cap, latency spikes (queueing) before any errors appear.
Queue depth-if you put a queue between your service and the provider (often wise), its depth is a saturation signal.
GPU memory-only for self-hosted models. If you're not self-hosting, ignore.

The SRE-bridge here is direct: your existing instinct to alert on "we are at 80% of capacity" applies verbatim. The capacity dimension is just different.

2.5 Why the framing matters¶

The deeper point is not the specific metrics. It is that insisting on the four-golden-signals framing forces parity between LLM and non-LLM observability in your org. AI teams left to their own devices will invent ad-hoc dashboards with model-specific jargon ("perplexity over time", "logprob distributions") that no on-call engineer can read at 3 a.m. The four-golden-signals framing keeps the dashboards legible to everyone who already does on-call. This legibility is the bridge.

3. OpenTelemetry GenAI semantic conventions¶

Standards exist because the alternative-every team inventing its own attribute names-produces tooling that cannot be shared, dashboards that cannot be ported between providers, and engineers who have to relearn the conventions every time they change jobs. OpenTelemetry's GenAI semantic conventions (the gen_ai.* attribute namespace) are the emerging standard. The spec is evolving (status: experimental as of 2025), but the skeleton is durable.

Adopt the conventions. The cost is small (it's just attribute naming); the benefit is that everything downstream-Tempo, Datadog LLM Observability, Langfuse, Phoenix, OpenLLMetry-already knows how to render gen_ai.* spans without custom dashboards.

3.1 The span attributes¶

For a single LLM call span, these are the attributes that should be set. The grouping below is functional, not part of the spec.

Identity of the call - gen_ai.system - provider identifier. Examples:anthropic,openai,vertex_ai,bedrock,azure,cohere. -gen_ai.request.model - the model the caller asked for, e.g. claude-3-5-sonnet-20241022. - gen_ai.response.model - the model that actually served the call. May differ when providers route across versions or alias names. -gen_ai.response.id - provider's request id. Critical for support tickets.

Request parameters - gen_ai.request.max_tokens - gen_ai.request.temperature - gen_ai.request.top_p - gen_ai.request.top_k - gen_ai.request.frequency_penalty, gen_ai.request.presence_penalty (where applicable) - gen_ai.request.stop_sequences (array)

Response shape - gen_ai.response.finish_reasons - array, typically one element. Values:stop,length,tool_calls,content_filter,error`.

Token usage - gen_ai.usage.input_tokens - gen_ai.usage.output_tokens - gen_ai.usage.cache_read_input_tokens - cached prompt tokens (Anthropic prompt caching, OpenAI cached input). -gen_ai.usage.cache_creation_input_tokens - tokens written into cache on this call.

Operation type - gen_ai.operation.name - typicallychat,completion,embedding,tool_call`.

The span name should be <operation> <model>, e.g. chat claude-3-5-sonnet-20241022. This lets the trace UI show the operation at a glance.

3.2 Span events for prompts and completions¶

The conventions lean toward putting prompt and completion content into events rather than attributes, for two reasons. First, attribute values have size limits in many backends (Tempo's default is 32KB). Second, content is the most privacy-sensitive payload; making it an event means it can be filtered out at the collector for some pipelines and kept for others.

Event names:

`gen_ai.system.message - the system prompt.
`gen_ai.user.message - a user turn.
`gen_ai.assistant.message - an assistant turn (input history).
`gen_ai.choice - a generated choice (in streaming, one event per chunk is excessive; prefer one event per completed choice with the full text, plus metrics for streaming dynamics).
`gen_ai.tool.message - tool result fed back to the model.

Each event carries the content as an attribute (content) plus role-specific fields (e.g. tool_call_id).

3.3 Tool calls¶

Tool calls produce their own spans, child of the LLM call that requested them. Attributes:

gen_ai.tool.name
`gen_ai.tool.call.id - id assigned by the model so results can be matched.
Tool arguments as a span event with the JSON payload.
Tool results as a span event with the JSON payload (redacted as needed).

3.4 What the conventions don't cover (yet)¶

The spec has gaps you will need to fill with custom attributes. Use a stable namespace like app.llm.* for these so you don't collide with future additions.

`app.llm.feature - your application's notion of which user-facing feature issued this call. (The conventions assume a single application; in practice you have many features.)
app.llm.prompt.template.id and `app.llm.prompt.template.sha - version identity for the prompt template.
`app.llm.experiment.variant - A/B variant if you have one.
`app.llm.tenant.id - organization or account, for multi-tenant SaaS. Not user id-that's high-cardinality (see section 5).

3.5 Code skeleton: a span produced correctly¶

from opentelemetry import trace

tracer = trace.get_tracer("myapp.llm")

def call_anthropic(client, model, system, messages, feature, prompt_id, prompt_sha, tenant):
    with tracer.start_as_current_span(f"chat {model}") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.operation.name", "chat")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)
        span.set_attribute("gen_ai.request.temperature", 0.0)
        span.set_attribute("app.llm.feature", feature)
        span.set_attribute("app.llm.prompt.template.id", prompt_id)
        span.set_attribute("app.llm.prompt.template.sha", prompt_sha)
        span.set_attribute("app.llm.tenant.id", tenant)

        # Optionally record prompt content as events (subject to redaction policy)
        span.add_event("gen_ai.system.message", {"content": system})
        for m in messages:
            span.add_event(f"gen_ai.{m['role']}.message", {"content": m["content"]})

        try:
            resp = client.messages.create(
                model=model, system=system, messages=messages, max_tokens=1024, temperature=0.0,
            )
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            raise

        span.set_attribute("gen_ai.response.model", resp.model)
        span.set_attribute("gen_ai.response.id", resp.id)
        span.set_attribute("gen_ai.response.finish_reasons", [resp.stop_reason or "stop"])
        span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)
        if hasattr(resp.usage, "cache_read_input_tokens"):
            span.set_attribute("gen_ai.usage.cache_read_input_tokens",
                               getattr(resp.usage, "cache_read_input_tokens", 0))
        span.add_event("gen_ai.choice", {
            "content": "".join(b.text for b in resp.content if b.type == "text"),
            "index": 0,
        })
        return resp

This 30-line function, replicated as a decorator (Exercise 1), is the single most leveraged code you will write in your first month on an AI team.

4. Span design for LLM applications¶

Span design is the part most teams get wrong. The two failure modes: (a) one giant span per user request, with all the LLM details mashed into attributes-useless for analyzing fan-out; (b) a span per token chunk in streaming-overwhelms the backend, blows up cost, makes the trace tree unreadable. Get this right and your traces will be self-explanatory; get it wrong and you'll spend years rebuilding instrumentation.

4.1 The shape¶

The principles, derived from the trace-as-tree property in section 1.5:

One span per LLM call. Parent context = the request handler (or higher-level workflow); span = the single call to a single model.
One span per tool call. Child of the LLM call that requested it. The model decides; the tool span records the actual call execution.
One span per retrieval step. RAG: embedding span (child of LLM call or workflow), vector search span (sibling), context-assembly span (sibling). These compose into a recognizable RAG sub-tree.
Streaming: one span for the whole call. Use events for chunk-level granularity. Use metrics (histograms) for chunk gap distributions, not spans.
Multi-call orchestration:
Sequential-spans appear in order, each a child of the workflow span.
Parallel-sibling spans with overlapping time ranges.
Map-reduce-N sibling "map" spans, then a "reduce" span that depends on them. Most tracing UIs render this naturally as long as the parent context is propagated to each parallel call.
Loops-every iteration is its own span. Don't reuse a span across iterations; you lose per-iteration timing and the trace becomes unreadable.

4.2 An agentic example¶

A user types "summarize the last 10 emails about X and draft a reply." The trace tree:

workflow: handle_user_request                   (span A, 8.4s)
  retrieve_emails                                (span B, 0.3s)
  llm_planner: chat claude-3-5-sonnet            (span C, 1.1s)   -> decides to call summarize_email tool 10x
    tool: summarize_email                        (span D1, 0.6s)  -> child llm call
      llm_summary: chat claude-3-5-haiku         (span D1.1, 0.5s)
    tool: summarize_email                        (span D2, 0.7s)
      llm_summary: chat claude-3-5-haiku         (span D2.1, 0.6s)
    ... (D3..D10 in parallel)
  llm_drafter: chat claude-3-5-sonnet            (span E, 2.0s)   -> writes the reply

Span A is the workflow. B is a deterministic tool call (DB query). C is the planner LLM. D1..D10 are tool calls dispatched in parallel; each contains a child LLM call (the summarizer). E is the final composer LLM call. Total wall-clock = 8.4s; the parallel summarizer block contributes ~0.7s (the slowest of the 10).

This shape is legible: a new engineer reading it understands the architecture in 30 seconds. The dollar cost can be computed by summing token-usage attributes across the tree. The latency bottleneck is obvious (span E, the composer). The shape is achievable in OTel-Python with one decorator and consistent context propagation; nothing exotic.

4.3 The streaming rule, derived¶

It's tempting to emit a span per chunk in streaming, because each chunk is a network event. Don't. Reasoning:

Backends charge per span (storage cost). 100 chunks × N requests/sec quickly exceeds budget.
Trace UIs render badly with thousands of micro-spans.
The questions you actually want to answer (TTFT, TPOT, chunk-gap variance) are histogram questions, not span questions.

Solution: one span for the call, with gen_ai.choice events at the start and end, and metrics for chunk dynamics:

chunk_gap_histogram = meter.create_histogram(
    "llm.streaming.chunk_gap_seconds",
    description="Time between consecutive output chunks during streaming.",
)
# Inside the streaming loop, on each chunk:
chunk_gap_histogram.record(now - last_chunk_time, attributes={"model": model, "feature": feature})

Histograms scale; spans don't.

4.4 SRE bridge¶

In your past work, request-level spans had child spans for each downstream call (DB, cache, external API). Same pattern here-the "downstream calls" of an LLM service include LLM calls and tool calls. The discipline of "every external call gets a span" carries over verbatim.

4.5 Mini-exercise¶

Take an existing agentic flow in your codebase (or write a small one-planner LLM that calls 3 search tools and a summarizer LLM). Add OTel spans following the rules above. Open the trace in Tempo or Jaeger. If the architecture isn't obvious from the tree, the spans are wrong; iterate.

5. Cost attribution¶

Cost is the new latency. In 2024 most LLM-product post-mortems were about latency or quality; in 2025 the plurality were about cost surprises. The teams that handle this well treat cost as a first-class signal with its own dashboards, alerts, and SLOs.

5.1 The base computation¶

Per-span cost is mechanical:

def cost_usd(provider, model, input_tokens, output_tokens, cache_read_tokens=0, cache_write_tokens=0):
    p = PRICES[(provider, model)]   # dict with input, output, cache_read, cache_write per 1M tokens
    return (
        (input_tokens - cache_read_tokens - cache_write_tokens) * p["input"] / 1_000_000
        + cache_read_tokens * p["cache_read"] / 1_000_000
        + cache_write_tokens * p["cache_write"] / 1_000_000
        + output_tokens * p["output"] / 1_000_000
    )

Maintain PRICES as a JSON file in the repo. Update it monthly. An illustrative shape (as of ~2025; verify with provider docs before relying on these for billing):

{
  "anthropic|claude-3-5-sonnet-20241022": {
    "input": 3.00, "output": 15.00, "cache_read": 0.30, "cache_write": 3.75
  },
  "anthropic|claude-3-5-haiku-20241022": {
    "input": 0.80, "output": 4.00, "cache_read": 0.08, "cache_write": 1.00
  },
  "openai|gpt-4o-2024-08-06": {
    "input": 2.50, "output": 10.00, "cache_read": 1.25, "cache_write": 0.00
  }
}

Numbers are illustrative; pricing changes. The structure does not.

Emit cost as both a metric and a span attribute:

span.set_attribute("app.llm.cost_usd", cost)
cost_counter.add(cost, attributes={
    "provider": provider, "model": model, "feature": feature, "tenant": tenant,
})

5.2 The aggregation dimensions¶

You will be asked, in roughly this order:

What did we spend yesterday?-sum(rate(llm_cost_usd_total[24h])).
Per feature?-sum by (feature) (rate(llm_cost_usd_total[24h])). The chart that triggers most cost discussions.
Per tenant?-sum by (tenant) (rate(llm_cost_usd_total[24h])). Critical for multi-tenant SaaS pricing decisions.
Per prompt-template version?-sum by (prompt_sha) (rate(llm_cost_usd_total[24h])). The chart that catches "we shipped a longer system prompt and didn't notice it doubled cost."
Per model?-sum by (model) (rate(llm_cost_usd_total[24h])). Shows whether your routing is sending too much to the expensive model.

Each of these requires an attribute on the cost metric. Keep the attribute set tight (see 5.4 on cardinality).

5.3 The prompt-version regression pattern¶

This is the highest-leverage pattern in the chapter. Prompts evolve continuously; engineers tend to add ("oh let's also tell it to ..."), rarely subtract. After a year, the system prompt is 3000 tokens longer than it was, and nobody knows which addition was worth it.

The detection mechanism:

# Pseudocode for a daily job
yesterday = query_metric("llm_cost_usd_total", group_by=["feature", "prompt_sha"], window="24h")
last_week = query_metric("llm_cost_usd_total", group_by=["feature", "prompt_sha"], window="7d", offset="7d")

for (feature, sha), cost_24h in yesterday.items():
    baseline_per_call = last_week_baseline_for(feature)  # exclude this sha
    current_per_call = cost_24h / call_count[feature, sha]
    if current_per_call > 1.5 * baseline_per_call:
        alert(f"Prompt regression: {feature}/{sha[:8]} costs {current_per_call:.4f}/call, "
              f"baseline {baseline_per_call:.4f}/call.")

Fifty lines of code, one Slack alert, saves five-figure monthly bills. Exercise 4 has you build it.

5.4 The cardinality trap¶

Tagging cost metrics with user_id looks tempting-"which user is the most expensive?" is a reasonable question. Don't do it. A user_id label produces a unique time series per user; with 100K users you have 100K time series, and Prometheus (and most metric backends) fall over. The time-series cost dwarfs the LLM bill.

Three safer patterns:

Tenant, not user. tenant_id typically has hundreds or low thousands of distinct values; that's manageable cardinality.
Top-N tracking. A daily job computes the top 100 users by cost from raw logs/traces, writes them to a low-cardinality "top users" table that the dashboard queries.
Sampled per-user metrics. Hash user_id and only emit a metric for 1% of users, with the user_id as a label; multiply by 100 for population estimates. Bounded cardinality, statistically representative.

The general rule, recycled from your prior life: labels are dimensions of slicing, not identifiers of individuals. Anything that looks like an opaque ID needs to go in logs/traces, not in metric labels. This is one of the fastest things you'll teach AI-only engineers.

5.5 SRE bridge¶

Cost SLOs are budget SLOs. The arithmetic is identical to availability SLOs (section 12). "Per-feature daily cost ≤ $X" is a hard SLO; the error budget is the daily slack, the burn rate is consumption velocity. Page when burn rate would exhaust the monthly budget before month-end.

6. Latency breakdown¶

Total latency is a sum of four contributions; understanding each lets you point at the right cause when a number drifts.

6.1 The components¶

For a non-streaming call:

total_latency = network_rtt + provider_queue + prefill + decode

Network RTT. Your client to provider edge. ~10–80ms typically. Stable per region; watch for sudden jumps.
Provider queue. Time spent waiting for a slot before the model starts processing. Highly variable under load; this is what spikes during provider incidents.
Prefill. Roughly proportional to input tokens. The model "reads" the prompt. Per-token prefill cost varies by model size; for big models it can be 1–5ms per input token.
Decode. Output token generation. Also roughly per-token, dominated by the autoregressive sampling loop. ~20–50ms per token for large models.

For a streaming call, TTFT corresponds to network_rtt + provider_queue + prefill. TPOT corresponds to decode per token. The decomposition is cleaner under streaming, which is a small reason to prefer streaming for instrumentation purposes.

6.2 What you can measure vs. what you can't¶

You can measure: - total_latency - wall clock across the call. -ttft - first chunk arrival. - tpot - chunk-arrival cadence. -network_rtt - separately, by pinging the provider's API endpoint.

You cannot directly measure: - provider_queue - opaque to you. -prefillvs.decode` split-providers don't expose it; you can estimate it using the model's known per-token rates.

Most providers don't separate queue and prefill. What you can do is track TTFT and decompose statistically:

TTFT ≈ network_rtt + provider_queue + (input_tokens × per_token_prefill_estimate)

If TTFT spikes while input_tokens stays stable and network_rtt stays stable, the spike is provider-side queue or backend issues. That's actionable-you page the provider's status page or fail over to a secondary.

6.3 The "tokens per second decoded" metric¶

A single derived metric that is more useful than raw latency:

tokens_per_second_decoded = output_tokens / (total_latency - ttft)

This is TPOT inverted, normalized across request sizes. It's a clean, model-comparable number: GPT-4o at full health decodes ~30–80 tok/s, Claude 3.5 Sonnet ~40–80 tok/s, Haiku/4o-mini 100+ tok/s. When this number drops noticeably, the provider is unhealthy.

hist_decode_rate = meter.create_histogram("llm.decode.tokens_per_second")
# At end of streaming call:
decode_rate = output_tokens / max(total - ttft, 1e-3)
hist_decode_rate.record(decode_rate, attributes={"provider": provider, "model": model})

6.4 SRE bridge¶

The decomposition replaces the traditional db_time + app_time + network_time breakdown. Same instinct, different layers. When latency spikes, your first move (in both worlds) is to identify which layer moved. The mental discipline is the same.

7. Token usage tracking¶

Tokens are the unit of measurement that drives both cost and latency. Track them precisely.

7.1 The categories¶

Input tokens-system prompt + tool definitions + conversation history + current user message. All of it. The provider tokenizes it all.
Output tokens-the model's response.
Cache-read tokens-input tokens served from prompt cache (Anthropic prompt caching, OpenAI cached input). Priced at 10–20% of normal input rate, depending on provider. Worth tracking separately.
Cache-write tokens-input tokens written into cache on this call. Priced at ~125% of normal input rate (Anthropic) or free (OpenAI). Track to ensure you're amortizing the write across enough reads.
Reasoning tokens-for reasoning models (o1, o3, Claude with extended thinking), the hidden chain-of-thought tokens. Charged as output but not visible in the response. Track separately if you use reasoning models.

7.2 The cache-hit-rate signal¶

Prompt caching is the single biggest cost lever for chat applications with long system prompts. The signal to monitor is cache hit rate:

cache_hit_rate = cache_read_input_tokens / (input_tokens)

Per feature, per model. A healthy chat app with a stable system prompt should see cache hit rates of 70–95% on conversations after the first turn. If your cache hit rate drops, something invalidated the cache-typically a system prompt change, a date stamp leaking into the prompt, or a tool definition that varies per request.

hist_cache_hit = meter.create_histogram("llm.cache_hit_ratio")
hist_cache_hit.record(
    cache_read / max(input_tokens, 1),
    attributes={"feature": feature, "model": model},
)

Alert when the rate drops by >20% week-over-week. Saves more money than almost anything else you can instrument.

7.3 Per-conversation token tracking¶

In chat applications, per-call tokens hide the real story. The user starts a conversation, exchanges 30 turns over an hour, and now each call is 50K input tokens because history accumulates. The 30th call costs 30x what the 1st cost-and the user has no idea.

Track cumulative tokens per conversation:

# At the end of each call
running = redis.incrby(f"conv_tokens:{conversation_id}", input_tokens + output_tokens)
span.set_attribute("app.llm.conversation.cumulative_tokens", running)
if running > THRESHOLD_WARN:
    span.set_attribute("app.llm.conversation.expensive", True)

This lets you build a "expensive conversations" dashboard and decide whether to summarize/compact history at certain thresholds.

7.4 The token-budget pattern¶

For each feature, define a token budget-the maximum input + output you expect a single call to consume in normal operation. Set it based on observed p99 plus 50% headroom. Alert when calls exceed the budget; they typically indicate a runaway loop, a giant pasted document, or a prompt regression.

TOKEN_BUDGETS = {
    "summarize_email": 8_000,
    "draft_reply": 12_000,
    "agent_planner": 30_000,
}
if input_tokens + output_tokens > TOKEN_BUDGETS.get(feature, float("inf")):
    span.set_attribute("app.llm.budget_exceeded", True)

A simple counter on this attribute, alerted, catches more bugs than any other single signal.

8. Sampling, not logging everything¶

The naive approach: log every prompt and every completion. The naive approach is wrong, for three reasons.

Volume. A modest LLM service doing 1M calls/day with 5K tokens each is logging 5GB/day of text. At cloud-storage rates (~$0.02/GB-month) that's a few hundred dollars a year, but the indexing and search costs in tools like Datadog or Splunk are an order of magnitude more.
Privacy. Every prompt potentially contains PII. Storing it all multiplies your compliance surface area (GDPR, HIPAA, SOC 2). If you don't need it, don't store it.
Debuggability. 5GB/day is hard to search. Engineers will give up before they find the relevant entry. Less, more relevant data is more debuggable than more, less relevant data.

The right approach is multi-tier sampling: keep the trace skeleton always, sample content selectively.

8.1 Tail-based sampling¶

The classic pattern, ported from APM:

100% of errors. Any span with status=error or any of the error classes from section 2.3 (validation, guardrail) is kept with full content.
100% of slow requests. Anything above your p95 latency. These are the ones you'll be asked about.
100% of high-cost requests. Anything above your token budget. These are the ones that drive bills.
1% of normal traffic. Statistically representative for routine analysis.

OpenTelemetry Collector has a tail-sampling processor that implements this directly. The decision is made after the trace is complete, so it can use the full trace's properties (latency, status). Configuration sketch:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow, type: latency, latency: { threshold_ms: 5000 } }
      - { name: expensive, type: numeric_attribute,
          numeric_attribute: { key: app.llm.cost_usd, min_value: 0.10 } }
      - { name: baseline, type: probabilistic, probabilistic: { sampling_percentage: 1 } }

8.2 Skeleton vs. content tiers¶

A finer separation: always record the skeleton (span name, attributes, timing, token counts, cost) for all traffic; record full prompt and completion content only for the sampled subset.

This gives you: - 100% accurate metrics (counts, costs, latency)-the skeleton is everything you need. - 100% accurate per-feature aggregations. - Sampled but representative content for debugging.

In OTel terms: attributes on every span, events (which carry the heavy content) only added when a "log_content" flag is on. Decide the flag at request entry based on tail-sampling rules where possible (some are knowable upfront, like "this user is in the debug cohort"), and at span end for the rest.

8.3 Privacy-redacted logging¶

For the sampled content you do keep, redact aggressively. Section 9 goes deeper; the sampling-side principle is: redaction is the default, opt-out is rare and audited.

8.4 Exercise hook¶

Exercise 5 has you write the tail-sampling rule explicitly. Worth doing because the configuration is finicky and the trade-offs are illuminating.

9. Privacy and PII¶

Prompts contain PII because users put PII in prompts. They paste emails (containing addresses and names), they dictate medical histories, they share customer records. If you log prompts indiscriminately, your observability system becomes a regulated data store overnight.

9.1 The redaction layers¶

Combine multiple redaction techniques; each catches different things:

Regex-high-precision detection of structured PII: emails, phone numbers, credit cards, SSNs, IP addresses, common ID formats. Cheap, fast, deterministic.
NER (spaCy or similar)-entity recognition for names, organizations, locations. Lower precision but covers what regex misses. ~10ms per prompt at small batch sizes.
LLM-based redactor-a small/cheap model (Haiku, 4o-mini) prompted to identify and redact PII. Catches contextual PII (medical conditions, financial states) that regex and NER miss. Slow and ironic ("we use an LLM to make our LLM logs safe"), but for sensitive domains it's necessary.

Layer them: regex first (cheap and high precision), NER second (covers the regex gaps), LLM last (only for high-sensitivity surfaces or unsampled content).

import re

EMAIL = re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b")
PHONE = re.compile(r"\b(?:\+?\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b")
SSN = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")

def redact_regex(text):
    text = EMAIL.sub("[EMAIL]", text)
    text = PHONE.sub("[PHONE]", text)
    text = SSN.sub("[SSN]", text)
    return text

9.2 Storage and retention¶

Encryption at rest. Trace and log stores must use encryption (most managed backends do this by default; verify).
Per-tenant isolation. For multi-tenant SaaS, ensure logs are partitioned per tenant in queryable backends. A tenant-A engineer must not be able to query tenant-B traces.
Retention policy. 90 days is typical for production traces. Shorter for sampled-content tiers; longer for aggregated metrics. Encode it in the backend, not in human discipline.
Right to deletion. GDPR Article 17 obliges you to delete user data on request. This means your trace store needs the ability to find-and-delete by user_id (which is why user_id should be in span attributes-high cardinality but searchable-even though it's not in metric labels).

9.3 The audit trail¶

Treat access to raw prompt content as a privileged operation. Log who queried which traces, with what filter. In a regulated environment this is a compliance requirement; in an unregulated one it's still good hygiene because a single curious engineer reading user conversations is a story you don't want to be in.

9.4 SRE bridge¶

You have done access logging and retention work in your SRE life. The same patterns apply with one new wrinkle: the data here is qualitatively more sensitive than typical service logs because users say more in chat than they do in form fields. Set the bar one notch higher than you would for a CRUD service.

10. Drift detection¶

Models don't change without you knowing (assuming you pin model versions; do this-never use claude-3-5-sonnet-latest in production). But the world the model operates in does. Drift detection catches changes in inputs, outputs, and quality before users complain.

10.1 Input drift¶

Things that should be roughly stable in healthy operation:

Prompt length distribution. Median, p95, p99 of input tokens. A sudden right-shift means users (or your code) are sending bigger inputs-could be a feature change, could be a runaway loop.
Language distribution. Detect language per request (langdetect or fasttext); track distribution. A sudden shift to a new language is a sign of either a new market or an attack.
Topic distribution. Embed each prompt with a small embedding model; cluster periodically; track cluster proportions. Compute KL divergence between this week's distribution and a 4-week baseline. KL > threshold → topic drift alert.

def kl_divergence(p, q, eps=1e-12):
    import math
    return sum(pi * math.log((pi + eps) / (qi + eps)) for pi, qi in zip(p, q))

The KL signal is the gold standard; it catches things human eyes miss. Run it daily.

10.2 Output drift¶

Response length distribution. If outputs suddenly get longer, the model is "yapping more"-usually a prompt change, sometimes a model-version flip.
Refusal rate. Fraction of responses that decline ("I can't help with that"). Spikes indicate a policy change on the provider side or an input distribution shift toward sensitive topics.
Failed-JSON rate. For structured-output features, fraction of responses that fail to parse. Spikes are often due to a model update changing formatting habits.
Sentiment. Optional but cheap; aggregate sentiment of responses. Sudden mood shifts are a useful soft signal.

10.3 Quality drift¶

Quality drift requires evaluation, which is the topic of chapter 08. The integration is: a continuous-eval pipeline runs evals on a sampled subset of production traffic daily and writes scores back as metrics. Your dashboards then have eval_score_p50 alongside latency_p50 and you can see them all move together.

10.4 The drift dashboard¶

A single drift dashboard should show:

Input length distribution overlaid on baseline.
Language distribution stacked-area.
KL divergence of topic distribution (single line, alert when above threshold).
Output length distribution overlaid on baseline.
Refusal rate, failed-JSON rate, eval score (three lines).

Five panels. When something drifts, you know within 15 seconds of opening the dashboard which signal moved. This is the second portfolio dashboard.

11. Debugging in production¶

A production bug report arrives: "the model gave me a weird answer at 14:32." What do you need to reproduce and fix it?

11.1 Trace replay¶

The replay primitive: given a trace ID, fetch:

The exact input messages.
The exact system prompt (by prompt.template.id + sha).
The exact model version (gen_ai.response.model).
The exact temperature and other parameters.
The exact tool definitions (if applicable).

Re-issue the call. The output will likely differ (non-determinism), but if the bug is reproducible at all, you'll see it in 1–10 retries. If not, the bug was a one-off and you have the original trace's input/output captured for analysis.

def replay(trace_id):
    trace = fetch_trace(trace_id)
    llm_span = trace.find_span(operation="chat")
    prompt = fetch_prompt_by_sha(llm_span.attributes["app.llm.prompt.template.sha"])
    return client.messages.create(
        model=llm_span.attributes["gen_ai.response.model"],
        system=prompt.system,
        messages=prompt.assemble(llm_span.events),
        max_tokens=llm_span.attributes["gen_ai.request.max_tokens"],
        temperature=llm_span.attributes["gen_ai.request.temperature"],
    )

A replay(trace_id) function in your repo is a 50-line investment that pays back the first time you debug a production issue. Add it on day one.

11.2 Prompt diffs¶

When quality drops on a feature, the question "did the prompt change?" must be answerable in 30 seconds. Mechanism:

Prompts versioned in git (or a dedicated prompt store with versioning).
Every span tagged with prompt.template.sha.
A dashboard that, given a feature, shows quality and cost metrics overlaid with vertical lines marking each prompt-version change. Eyeballing the chart against the version markers is usually enough.
A git diff <old_sha> <new_sha> button in the dashboard, or at minimum a documented procedure.

11.3 A/B traceability¶

If you run prompt or model A/B tests (you should), every span needs an experiment.variant attribute. Then the same dashboards filter by variant, and you can compare control vs. treatment without separate dashboards.

span.set_attribute("app.llm.experiment.id", "system_prompt_shortening_v3")
span.set_attribute("app.llm.experiment.variant", "control" or "treatment")

11.4 The "five questions" runbook¶

When you investigate an LLM incident, the five questions, in order:

Did anything deploy in the last 24h? Code, prompt, model version, infra. (95% of incidents are caused by a recent change.)
Did the input distribution change? Drift signals from section 10.
Is the provider healthy? Status page; decode-tokens/sec; queue saturation.
Is it a specific feature/tenant/segment? Slice your dashboard.
Reproduce with replay(). If reproducible, fix; if not, it's an outlier and you collect more cases.

This runbook structure is identical to the one you used for traditional services, with question 2 (input distribution) added. The transferable instinct is huge.

12. The bridge from Datadog/SRE¶

This section is the explicit translation table from your existing skill set. It is where the chapter pays off.

12.1 SLIs¶

A Service Level Indicator is a quantitative measure of one aspect of service health, expressed as a ratio. For LLM services:

API success rate = (non-error_calls) / (total_calls) over a window.
Output validity rate = (calls_with_valid_structured_output) / (total_calls). Only meaningful for structured-output features.
TTFT-good rate = (calls_with_ttft < 1s) / (total_calls).
Total-latency-good rate = (calls_with_total < 5s) / (total_calls).
Cost-per-call SLI = (calls_with_cost < $0.10) / (total_calls). Yes, cost can be an SLI.
Quality SLI = (sampled_calls_with_eval_score >= 0.8) / (sampled_calls).

Each is a number between 0 and 1.

12.2 SLOs¶

A Service Level Objective is a target for an SLI over a window. Example SLO set for an LLM-powered chatbot:

service: chatbot-v2
slos:
  - name: api_success
    sli: (non-error_calls) / (total_calls)
    objective: 0.995
    window: 30d
  - name: ttft_under_1s
    sli: (calls_with_ttft < 1s) / (total_calls)
    objective: 0.95
    window: 30d
  - name: total_under_5s
    sli: (calls_with_total < 5s) / (total_calls)
    objective: 0.99
    window: 30d
  - name: cost_under_threshold
    sli: (calls_with_cost < 0.10) / (total_calls)
    objective: 0.99
    window: 30d
  - name: quality_above_threshold
    sli: (sampled_calls_with_eval_score >= 0.8) / sampled_calls
    objective: 0.95
    window: 7d

Notice nothing about this is LLM-specific in shape. You wrote SLOs of this form in your last job. The bridge is exactly what it looks like.

12.3 Error budgets¶

An error budget is (1 - SLO) worth of bad events. For a 99.5% availability SLO over 30 days:

budget = 0.005 × 30 × 24 × 60 = 216 minutes/month

For the LLM-cost SLO above (99% of calls < $0.10), the budget is "1% of calls may exceed $0.10"-measured in calls, not minutes. Same arithmetic, different unit.

12.4 Burn rates¶

Burn rate = how fast you are consuming the budget. A 1x burn rate means you'll exhaust the budget exactly at the end of the window. A 2x burn rate means you'll exhaust it at the halfway point.

Multi-window, multi-burn-rate alerting is the standard:

Fast burn (page now): 14.4x burn over 1h. Catches "2% of monthly budget consumed in the last hour."
Slow burn (ticket): 1x burn over 6h. Catches "we're trending toward exhaustion."

The numbers come from Google's SRE workbook; they apply unchanged to LLM SLOs. This is portable knowledge-your strongest single asset.

12.5 The on-call experience¶

Your on-call rotation for an LLM service looks similar to a traditional one:

Pager fires on fast-burn SLO alerts.
Runbook (section 11.4) tells you what to do.
Common actions: roll back the prompt, fail over to a secondary provider, throttle a runaway feature.

Some incidents are LLM-specific (prompt regression, model deprecation), but the operational frame-alert → runbook → mitigate → post-mortem-is yours already.

13. Tool landscape¶

The market in 2025–2026 is fragmented and moving. Pick based on architecture, not brand. Below: descriptive notes, not endorsements.

13.1 Langfuse (open source, self-hostable)¶

Tracing-first; eval and prompt management as adjacent surfaces. Self-hosting is a first-class option (Docker compose to a full setup). Strong fit for teams who want to keep prompts and traces inside their own infrastructure for compliance reasons. Has its own SDK; OTel support is via a bridge.

When to pick: you need self-hosting; you want a UI specifically designed for LLM tracing rather than a general-purpose APM repurposed.

13.2 LangSmith (closed, by LangChain)¶

Same shape as Langfuse, managed-only. Tightest integration with LangChain code paths. Pricing per trace.

When to pick: you're already on LangChain heavily; you want managed and don't have compliance reasons to self-host.

13.3 Arize / Phoenix¶

Arize is the commercial side; Phoenix is the open-source counterpart. Strong on drift detection and eval pipelines in addition to tracing. Roots in classical ML platform tooling, so the lineage features feel mature.

When to pick: drift and eval are first-order concerns for you; you want one tool covering both pre-production and production observability.

13.4 Helicone (open source proxy)¶

Different architecture: it's an HTTP proxy. You point your client at Helicone instead of the provider; Helicone forwards and records. Zero code change for instrumentation, at the cost of an extra hop and a single point of failure.

When to pick: you want to add observability with minimal code changes; you're comfortable with the proxy architecture.

13.5 OpenLLMetry (open source library)¶

OTel-native instrumentation library. Auto-instruments common LLM SDKs (Anthropic, OpenAI, etc.) so spans are emitted with gen_ai.* attributes without manual code. Send to any OTel-compatible backend (Tempo, Jaeger, Datadog, Honeycomb).

When to pick: you want to use existing OTel infra and add LLM coverage to it. This is often the right answer for teams already invested in OTel.

13.6 Datadog LLM Observability¶

Native LLM module inside Datadog. If you're already paying for Datadog, this is the cheapest start: enable, install, see traces in the existing dashboards.

When to pick: you're a Datadog shop; a unified dashboard with non-LLM services is more important than picking the best-of-breed LLM tool.

13.7 The decision shape¶

Three axes:

Self-hosted vs. managed. Compliance and cost.
OTel-native vs. proprietary SDK. Portability and lock-in.
LLM-specialized vs. general-APM-with-LLM-bolt-on. Depth of LLM features vs. unified ops.

Your background suggests Datadog LLM Observability or OpenLLMetry-into-existing-stack as the natural starting points. Build a small POC with both; pick on observed ergonomics.

14. Production runbook patterns¶

Five recurring incidents, with the diagnostic moves for each.

14.1 Latency spike¶

Diagnosis: 1. Open the latency dashboard. Which signal moved-TTFT, TPOT, or total? 2. Slice by gen_ai.response.model. Did one model degrade? (Provider issue.) 3. Slice by app.llm.feature. Did one feature degrade? (Prompt or input change.) 4. Check the provider's status page. 5. Check decode_tokens_per_second distribution. If it dropped, provider is slow; not your fault.

Mitigation: if provider-side, fail over (section 14.5). If feature-side, roll back the recent change.

14.2 Cost spike¶

Diagnosis: 1. Top features by cost over the last 24h vs. baseline. Which feature spiked? 2. For that feature, top prompt.template.sha by cost. New version? 3. Top calls by cost_usd for the feature. Are they all big inputs (one user pasted a massive document?) or all big outputs (model is rambling)? 4. Check for stuck loops: same conversation_id with cumulative_tokens climbing absurdly.

Mitigation: revert the prompt, add input-size caps, add loop guards. Section 7.4 token budget catches this earliest.

14.3 Quality drop¶

Diagnosis: 1. Eval score dashboard. Which feature dropped? 2. Per prompt.template.sha, what's the eval score? Is the new version worse? 3. Per gen_ai.response.model, what's the eval score? Did the provider silently change the model? 4. Per input segment (language, topic cluster), where is the drop concentrated?

Mitigation: roll back prompt; pin model version; if provider changed, escalate.

14.4 Refusal spike¶

Diagnosis: 1. Refusal rate by feature. Spread across all features (provider policy change) or one feature (input distribution change)? 2. Per language. Did refusals concentrate in a new language? 3. Sample 10 refusals. Read them. Are they reasonable?

Mitigation: if provider-policy: contact provider, prepare for a model swap. If input-distribution: check whether new traffic is legitimate; if so, adjust prompt to handle.

14.5 Provider outage¶

Diagnosis: 1. 5xx rate spiking? Errors clustered in one provider? 2. Status page red?

Mitigation: 1. Circuit-break: stop sending to the failing provider after N consecutive 5xx. 2. Fail over to secondary provider. Have the routing in place before the incident-you can't write it during. 3. Communicate: status page, in-app banner. 4. Post-incident: review SLO impact, replenish budget if applicable.

The fail-over capability is non-trivial because different providers have different APIs, different prompt-caching behaviors, and different output styles. Maintaining a "secondary that actually works" is engineering work, not a config switch. Teams that take this seriously have a quarterly fail-over drill.

15. Custom dashboards: the SRE-AI engineer's first artifact¶

The dashboard you build in your first two weeks on an AI team is the artifact that demonstrates the moat. Build it well and it follows you to interviews.

15.1 Layout¶

Top of fold (executive summary): - SLO compliance traffic-light: green/yellow/red per SLO. - Cost-per-day, last 7 days, with trendline. - Error budget burn rate (number + arrow).

Middle (operator view): - Latency: TTFT and total, p50/p95/p99, by feature. - Cost: per-feature, per-day stacked area. - Errors: API errors, validation errors, refusals-three separate lines. - Tokens: input, output, cache-read, cache-write-stacked area. - Cache hit rate: per feature. - Decode rate (tokens/sec): per provider/model.

Bottom (debug surface): - Trace explorer filtered to last 1h, sorted by latency descending. - Recent failures: trace_id, feature, error_class, model. - Recent prompt-version changes (annotation strip across all charts).

15.2 Why this layout¶

Top of fold answers the executive question: are we okay? Middle answers the operator question: what's moving? Bottom is the debug surface for incidents. Every dashboard you build for an LLM service should follow this three-tier layout; it scales from one feature to fifty.

15.3 The portfolio shape¶

For your portfolio, build this against a real LLM service (your own side project counts), screenshot it, and write a one-page README explaining each panel and why it exists. That artifact, attached to a GitHub repo with the corresponding instrumentation code, is more compelling than any certificate.

16. Building from scratch (no SaaS)¶

You don't have to start with a vendor. The minimal viable stack:

OTel SDK (Python) for instrumentation.
OTel Collector as the routing layer.
Tempo (Grafana) for trace storage.
Prometheus for metrics.
Grafana for dashboards.
A small Python decorator that auto-emits gen_ai.* spans.

Total: ~200 LOC of glue code; everything else is config. Worth doing once, even if you adopt a SaaS later, because:

You understand exactly what's instrumented and why.
You can debug instrumentation issues in any vendor by knowing what good output looks like.
You build a transferable skill that's not vendor-locked.

16.1 The decorator¶

import time
import functools
from opentelemetry import trace

tracer = trace.get_tracer("myapp.llm")

def trace_llm_call(provider, feature):
    """
    Decorator that wraps an LLM client call and emits a gen_ai.* span.
    Expects the wrapped function to return an object with .model, .id, .stop_reason,
    .usage.input_tokens, .usage.output_tokens, .usage.cache_read_input_tokens (optional),
    .content (list of blocks with .type and .text).
    """
    def deco(fn):
        @functools.wraps(fn)
        def inner(*args, **kwargs):
            model = kwargs.get("model", "unknown")
            with tracer.start_as_current_span(f"chat {model}") as span:
                span.set_attribute("gen_ai.system", provider)
                span.set_attribute("gen_ai.operation.name", "chat")
                span.set_attribute("gen_ai.request.model", model)
                if "max_tokens" in kwargs:
                    span.set_attribute("gen_ai.request.max_tokens", kwargs["max_tokens"])
                if "temperature" in kwargs:
                    span.set_attribute("gen_ai.request.temperature", kwargs["temperature"])
                span.set_attribute("app.llm.feature", feature)

                t0 = time.perf_counter()
                try:
                    resp = fn(*args, **kwargs)
                except Exception as e:
                    span.record_exception(e)
                    span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
                    raise
                dt = time.perf_counter() - t0

                span.set_attribute("gen_ai.response.model", getattr(resp, "model", model))
                span.set_attribute("gen_ai.response.id", getattr(resp, "id", ""))
                span.set_attribute("gen_ai.response.finish_reasons",
                                   [getattr(resp, "stop_reason", "stop") or "stop"])
                span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
                span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)
                cache_read = getattr(resp.usage, "cache_read_input_tokens", 0) or 0
                if cache_read:
                    span.set_attribute("gen_ai.usage.cache_read_input_tokens", cache_read)
                # Cost
                cost = cost_usd(provider, model, resp.usage.input_tokens,
                                resp.usage.output_tokens, cache_read)
                span.set_attribute("app.llm.cost_usd", cost)
                span.set_attribute("app.llm.total_latency_s", dt)
                return resp
        return inner
    return deco

This is exercise 1 in production form. ~50 lines. The rest is configuration.

16.2 The metric layer¶

Metrics are emitted by a small adjacent module that observes spans. With OTel, you can emit metrics directly from inside the decorator:

from opentelemetry import metrics

meter = metrics.get_meter("myapp.llm")
cost_counter = meter.create_counter("llm_cost_usd_total")
input_tokens_counter = meter.create_counter("llm_input_tokens_total")
output_tokens_counter = meter.create_counter("llm_output_tokens_total")
total_latency_hist = meter.create_histogram("llm_total_latency_seconds")

Then inside the decorator, after the call:

attrs = {"provider": provider, "model": model, "feature": feature}
cost_counter.add(cost, attrs)
input_tokens_counter.add(resp.usage.input_tokens, attrs)
output_tokens_counter.add(resp.usage.output_tokens, attrs)
total_latency_hist.record(dt, attrs)

Metrics flow through the OTel Collector to Prometheus; spans flow through the same Collector to Tempo. Grafana queries both. End-to-end, you have a working observability system in a long afternoon.

17. Practical exercises¶

These are the artifacts. Doing them is the chapter; reading them isn't.

Exercise 1-`@trace_llm_call` decorator¶

Implement the decorator from section 16.1 against the real Anthropic Python SDK. Verify with a few calls that:

The span name is chat <model>.
gen_ai.system, gen_ai.request.model, gen_ai.response.id, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons are all set.
Errors are recorded with record_exception and the span status is ERROR.
Cost is computed correctly against your PRICES table.

Acceptance: trace JSON dumped to a local file matches a hand-written reference.

Exercise 2-Cost-per-feature from a 7-day trace dump¶

Given a JSON file with 100K spans (one per LLM call, with the attributes from section 3), produce:

Total cost.
Cost per feature, sorted descending.
Cost per (feature, prompt_sha).
Cost per tenant.
Cache hit rate per feature.

Acceptance: output is a Markdown report with five tables. Code in <100 lines.

Exercise 3-SLO.yaml for a chatbot¶

Author an SLO.yaml for an LLM-powered support chatbot covering:

API success rate (objective and window).
TTFT (objective and window).
Total latency (objective and window).
Cost per call (objective).
Quality (eval score from sampled traffic).

Include error-budget arithmetic for each SLO and the fast-burn / slow-burn alert thresholds.

Acceptance: file is committable to a real service's repo and an on-call engineer could implement the alerts from it without further information.

Exercise 4-Cost regression detector (<50 lines)¶

Given the trace dump from exercise 2 (or a streaming feed), detect when a (feature, prompt_sha) combination's cost-per-call exceeds 1.5x the same feature's baseline cost-per-call from the prior week. Emit a structured alert.

Acceptance: planted regression in the test data is detected; no false positives on a clean week. Code <50 lines.

Exercise 5-Tail-sampling rule¶

Write the OTel Collector tail-sampling configuration for:

100% of errors.
100% of spans where app.llm.total_latency_s > 5.
100% of spans where app.llm.cost_usd > 0.10.
1% baseline.

Test with a synthetic span stream; verify that the kept set has the right composition.

Acceptance: configuration file plus a 20-line test script that asserts sampling rates within tolerance.

Exercise 6-Datadog → OTel migration plan¶

Imagine a fictional service with 20 LLM call sites currently instrumented with Datadog tracing. Write a migration plan to OTel that includes:

Inventory: catalog the 20 call sites and their current instrumentation depth.
Bridging: how to send OTel data to Datadog during the migration so dashboards keep working.
Cutover sequence: which call sites move first (low-risk), which last (high-traffic).
Validation: what metrics to compare before/after to confirm parity.
Rollback plan: how to revert if a regression appears.

Acceptance: a 2–3 page plan that an engineering manager could approve.

18. Closing-your unique advantage, made explicit¶

Most candidates entering AI engineering in 2026 will tell interviewers about their RAG project, their fine-tuning experiment, their agentic prototype. Almost none will be able to talk fluently about p99 TTFT, cardinality discipline in cost metrics, tail-sampling configurations, error-budget burn rates for cost SLOs, or the failure modes of trace replay under non-determinism.

That gap is your chapter. The skills above are not advanced AI knowledge; they are advanced operational knowledge applied to an AI substrate. You already have the operational knowledge. The translation work-what TTFT means, why cost is an SLI, why prompts get a SHA-is the bridge that this chapter built.

Two suggestions for converting reading into leverage:

Ship the artifacts. The dashboard from section 15 and the decorator from section 16, applied to a real LLM-powered side project, with screenshots and a README. Linkable in an application; defensible in an interview.
Develop the runbook reflex. When you read about an LLM incident in a postmortem (Anthropic, OpenAI, third-party), trace through the section 14 runbooks and ask: which signal would have caught it earliest? This builds the diagnostic intuition that's hard to fake.

The teams hiring for AI-engineering roles need exactly one of you on the team. Walk in able to talk about everything in this chapter and they will recognize the shape of person they've been looking for.

Appendix A-Quick reference: the `gen_ai.*` attributes¶

Attribute	Type	Notes
`gen_ai.system`	string	Provider name
`gen_ai.operation.name`	string	`chat`, `completion`, `embedding`, `tool_call`
`gen_ai.request.model`	string	Requested model id
`gen_ai.request.max_tokens`	int
`gen_ai.request.temperature`	double
`gen_ai.request.top_p`	double
`gen_ai.request.top_k`	int
`gen_ai.request.stop_sequences`	string[]
`gen_ai.response.model`	string	Actual serving model
`gen_ai.response.id`	string	Provider request id
`gen_ai.response.finish_reasons`	string[]
`gen_ai.usage.input_tokens`	int
`gen_ai.usage.output_tokens`	int
`gen_ai.usage.cache_read_input_tokens`	int	Where supported
`gen_ai.usage.cache_creation_input_tokens`	int	Where supported
`gen_ai.tool.name`	string	On tool spans
`gen_ai.tool.call.id`	string	On tool spans

Custom additions (suggested namespace app.llm.*):

Attribute	Type	Notes
`app.llm.feature`	string	Application's feature name
`app.llm.prompt.template.id`	string	Prompt template identifier
`app.llm.prompt.template.sha`	string	Content hash of the prompt
`app.llm.experiment.id`	string	A/B test identifier
`app.llm.experiment.variant`	string	`control` / `treatment`
`app.llm.tenant.id`	string	Multi-tenant tenant id
`app.llm.cost_usd`	double	Computed cost
`app.llm.conversation.id`	string	Chat conversation id
`app.llm.conversation.cumulative_tokens`	int	Running token total
`app.llm.budget_exceeded`	bool	Token budget breached

The conventions are evolving. Treat this table as a 2025-era snapshot; check the OpenTelemetry semantic conventions site before relying on it for anything load-bearing.

Appendix B-Metric catalog¶

Metric	Type	Labels	Use
`llm_requests_total`	counter	provider, model, feature, status	Traffic + error rate
`llm_input_tokens_total`	counter	provider, model, feature	Cost driver
`llm_output_tokens_total`	counter	provider, model, feature	Cost driver
`llm_cache_read_tokens_total`	counter	provider, model, feature	Cache effectiveness
`llm_cost_usd_total`	counter	provider, model, feature, tenant	Cost SLI
`llm_ttft_seconds`	histogram	provider, model, feature	Latency SLI
`llm_tpot_seconds`	histogram	provider, model	Latency SLI
`llm_total_latency_seconds`	histogram	provider, model, feature	Latency SLI
`llm_decode_tokens_per_second`	histogram	provider, model	Provider health
`llm_streaming_chunk_gap_seconds`	histogram	provider, model	Provider health
`llm_cache_hit_ratio`	histogram	model, feature	Cost optimization
`llm_validation_errors_total`	counter	feature, error_class	Quality SLI
`llm_guardrail_rejections_total`	counter	feature, source	Quality SLI

Every label set above is bounded-cardinality. None of them include user_id. That's the discipline.

Appendix C-The SRE-to-AI translation card¶

Print this. Stick it on your monitor. Refer back as you build.

SRE concept	LLM equivalent
Request latency	TTFT, TPOT, total-three numbers
RPS	RPS + tokens-per-second
Error rate	API errors + validation errors + guardrail rejections
Saturation (CPU, mem)	Provider quota, concurrency, queue depth
Service binary version	Code version + prompt-template SHA
Replay from logs	Replay from trace + prompt store + pinned model
SLO availability	SLO success rate + cost SLO + quality SLO
Error budget (minutes)	Error budget (calls, dollars, quality)
Burn rate alerting	Same arithmetic, applied to cost & quality too
Trace = line	Trace = tree (depth 4+)
Cardinality discipline	Same-keep user_id out of metric labels
Post-mortem	Post-mortem + prompt diff + model version pin

This card is the entire chapter, compressed. If on a given day you remember nothing else, remember the card.