Workshop - Agent observability with OpenTelemetry¶

DifficultyDeepTime75 min

Needs: Python 3.11+, Docker (for Jaeger + Langfuse), opentelemetry-* packages, an Anthropic API key

Before you start:

Built a tool-use loop without a framework
Comfortable with Python decorators and context managers
Familiar with the idea of distributed tracing (span / parent / child)

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to AI Systems -> Appendix A: Hardening and Observability, and the ninth in the AI implementations workshop series. The first eight workshops shipped working AI features. This workshop is about being able to see what they're doing in production - every LLM call, every tool execution, every agent turn, with their token counts, latency, errors, and cost - in a queryable trace UI. By the end you'll have instrumented the Workshop 4 agent with OpenTelemetry, viewed the trace in Jaeger and Langfuse, and you'll be able to diagnose a slow or expensive agent run in minutes instead of "hours of grepping logs."

~75 minutes. Needs: Python 3.11+, Docker (for Jaeger or Langfuse local), the opentelemetry-* packages, an Anthropic API key. Optional: Langfuse cloud account (free tier) for hosted traces. No GPU.

What you'll build, and the idea it makes concrete¶

You'll instrument the Workshop 4 agent end-to-end with OpenTelemetry using the GenAI semantic conventions (the OTel-blessed schema for LLM telemetry, finalized in 2024). Every LLM call becomes a span with standard attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens). Every tool call becomes a child span with tool.name, tool.duration_ms, tool.error. You'll send the traces to a local Jaeger (open source) and a Langfuse (LLM-specific) instance running in Docker, browse the resulting trace UI, and add custom attributes for the things the standard doesn't cover (cost in USD, agent turn number).

The idea this makes concrete:

The right unit of observability for an AI system is the agent turn, not the HTTP request. A user-facing AI feature handles one request that might span a dozen LLM calls, two dozen tool invocations, three model providers, and 30 seconds of wall-clock time. Logging "POST /chat - 200" is useless. What you need is a tree of spans - the root span is the user request; every LLM call and tool call is a child or grandchild; attributes carry the model, tokens, cost, errors. With that tree in a trace UI, "why was this request slow?" or "why did this cost $2?" becomes a 30-second click-through. Without it, the same question is "open the logs and grep until your eyes bleed."

A second idea, equally important:

Pick the LLM-specific layer on top of the generic backbone, not instead of it. Generic OpenTelemetry gives you spans, attributes, and the trace data model that integrates with everything else your stack already has (your service-level traces, your database query traces, your message-queue traces). LLM-specific tools (Langfuse, Phoenix, Helicone, Weights & Biases Traces) add LLM-aware UI on top: side-by-side prompt vs. completion view, token-cost breakdowns, prompt-version diffs, eval result overlays. Use both. OTel is the wire format; the LLM-specific tool is the UI for the LLM-shaped questions.

Step 0: the architecture you're about to assemble¶

+----------------------+
|   YOUR AGENT CODE    |
|                      |
|  tracer.start_span ("agent.turn")
|    |- gen_ai.system="anthropic"
|    |- gen_ai.request.model="claude-sonnet-4-6"
|    |
|    +-- start_span ("gen_ai.completion")
|    |     attrs: input_tokens, output_tokens, stop_reason
|    |     events: prompt content, completion content
|    |
|    +-- start_span ("tool.execution")
|    |     attrs: tool.name="query_db", tool.duration_ms=83
|    |
|    +-- ...
|                      |
+----------+-----------+
           |
           | OTLP (OpenTelemetry Protocol) gRPC or HTTP
           v
+---------------------------+      +---------------------------+
|     OTEL COLLECTOR        |      |   LANGFUSE / PHOENIX      |
|  (optional but recommended)|      |   (LLM-specific UI)      |
|  - batches, retries,       |      |  - prompt vs completion  |
|    samples, splits        |      |  - cost breakdowns       |
+----+----------------------+      |  - eval overlays         |
     |                              +---------------------------+
     +--> +---------------------------+
          |          JAEGER           |
          |  (generic trace UI)       |
          |  - timeline view          |
          |  - service map            |
          |  - search by attribute    |
          +---------------------------+

Two things to notice:

The collector is optional but pays for itself. Direct-to-backend is fine for development; in production the collector buffers spans during backend outages, samples to reduce volume, and lets you route to multiple backends without touching application code.
Generic and LLM-specific backends coexist. Jaeger shows you "this agent turn was slow because the third tool call took 8 seconds." Langfuse shows you "the model used the wrong tool for that prompt - here are the two paths side by side." Different questions, different UIs.

Step 1: install and instrument with the GenAI semantic conventions¶

Install the dependencies:

$ pip install opentelemetry-api opentelemetry-sdk \
              opentelemetry-exporter-otlp \
              opentelemetry-instrumentation-httpx

The OpenTelemetry GenAI semantic conventions (gen_ai.* attribute namespace) were finalized in 2024 and are now supported by most LLM tracing tools. They define standard names for the things every LLM trace needs: gen_ai.system (the provider), gen_ai.request.model (the model id), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. Use these names exactly so any downstream tool understands your traces without custom configuration.

Set up the tracer:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

resource = Resource.create({
    "service.name": "my-ai-app",
    "service.version": "1.0.0",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

That's the foundation: a tracer that exports to a local OTel collector on port 4317 (the standard OTLP gRPC port).

Step 2: wrap the LLM call in a span¶

Take the agent kernel from Workshop 4. Wrap the call_anthropic function:

from opentelemetry.trace import SpanKind, Status, StatusCode

def call_anthropic(messages, tools=None):
    with tracer.start_as_current_span(
        "gen_ai.completion",
        kind=SpanKind.CLIENT,
        attributes={
            "gen_ai.system": "anthropic",
            "gen_ai.request.model": MODEL,
            "gen_ai.request.max_tokens": 2048,
            "gen_ai.request.messages.count": len(messages),
        },
    ) as span:
        try:
            resp = httpx.post(API_URL, headers=HEADERS, timeout=60, json={
                "model": MODEL,
                "max_tokens": 2048,
                "messages": messages,
                "tools": [t["schema"] for t in TOOLS.values()] if tools else None,
            }).json()

            # Standard GenAI attributes for the response side.
            usage = resp.get("usage", {})
            span.set_attributes({
                "gen_ai.response.id": resp.get("id"),
                "gen_ai.response.model": resp.get("model"),
                "gen_ai.response.finish_reasons": [resp.get("stop_reason", "")],
                "gen_ai.usage.input_tokens": usage.get("input_tokens", 0),
                "gen_ai.usage.output_tokens": usage.get("output_tokens", 0),
            })
            # A custom attribute for cost - the standard doesn't cover this.
            cost = (usage.get("input_tokens", 0) * 3e-6
                  + usage.get("output_tokens", 0) * 15e-6)
            span.set_attribute("llm.cost.usd", round(cost, 6))
            return resp
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Three things this gets right:

Span kind is CLIENT - this span represents calling an external service. Jaeger renders client spans with a specific icon and the standard service-map view treats it appropriately.
Attributes use the standard names. Any LLM-tracing tool can plug into this without custom mapping.
Errors are recorded. A failed API call sets the span status to ERROR and attaches the exception. The trace UI shows the failed span in red.

Step 3: wrap tool calls in their own spans¶

Tools are the second half of agent observability. Wrap your tool dispatcher:

def run_tool(name: str, args: dict) -> str:
    with tracer.start_as_current_span(
        f"tool.execute.{name}",
        attributes={
            "tool.name": name,
            "tool.arguments": json.dumps(args)[:1000],  # truncate big args
        },
    ) as span:
        fn = TOOLS[name]["fn"]
        try:
            result = fn(**args)
            output = json.dumps(result) if not isinstance(result, str) else result
            span.set_attribute("tool.result_length", len(output))
            return output
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            return f"Tool {name} raised {type(e).__name__}: {e}"

Now every tool execution shows up as a span with the tool's name, the arguments it was called with, and the duration measured automatically by OTel. The arguments are truncated to 1000 chars - logging the full thing for huge inputs creates trace bloat; you can lift the cap or sample for richer debugging.

Step 4: wrap the whole agent turn as the root span¶

The outer agent loop becomes the root span - all the LLM and tool spans are children:

def agent(user_message: str, max_turns: int = 10) -> str:
    with tracer.start_as_current_span(
        "agent.run",
        attributes={
            "agent.user_message": user_message[:500],
            "agent.max_turns": max_turns,
        },
    ) as root_span:
        messages = [{"role": "user", "content": user_message}]
        total_cost = 0.0
        for turn in range(max_turns):
            with tracer.start_as_current_span(
                "agent.turn",
                attributes={"agent.turn_number": turn},
            ):
                resp = call_anthropic(messages)
                messages.append({"role": "assistant", "content": resp["content"]})

                if resp["stop_reason"] != "tool_use":
                    answer = "".join(b["text"] for b in resp["content"]
                                     if b["type"] == "text")
                    root_span.set_attribute("agent.final_answer_length", len(answer))
                    root_span.set_attribute("agent.turns_used", turn + 1)
                    return answer

                results = []
                for b in resp["content"]:
                    if b["type"] == "tool_use":
                        output = run_tool(b["name"], b["input"])
                        results.append({
                            "type": "tool_result",
                            "tool_use_id": b["id"],
                            "content": output,
                        })
                messages.append({"role": "user", "content": results})

        root_span.set_attribute("agent.hit_max_turns", True)
        return "(agent hit max turns)"

The trace tree now looks like:

agent.run                                            [duration: 12s, cost: $0.18]
 |-- agent.turn (turn=0)                             [duration: 2.3s]
 |    |-- gen_ai.completion                          [duration: 1.8s, tokens: 250/130]
 |    +-- tool.execute.query_database                [duration: 0.5s]
 |-- agent.turn (turn=1)                             [duration: 3.1s]
 |    |-- gen_ai.completion                          [duration: 2.6s, tokens: 480/180]
 |    +-- tool.execute.search_logs                   [duration: 0.5s]
 +-- agent.turn (turn=2)                             [duration: 6.6s]
      +-- gen_ai.completion                          [duration: 6.6s, tokens: 750/420]

Three turns, two tool calls, one final response. Anyone reading the trace can see the cost ($0.18), the bottleneck (the final turn's 6.6-second generation), and the work decomposition without reading code.

Step 5: send traces to Jaeger and Langfuse¶

Run Jaeger locally with Docker:

$ docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest

Jaeger now accepts OTLP traces on port 4317 (which is where the exporter from step 1 sends them) and serves the UI on port 16686. Open http://localhost:16686, select your service from the dropdown, click "Find Traces" - your agent runs show up as flame graphs you can drill into.

For LLM-specific UX, run Langfuse:

$ docker run -d --name langfuse \
  -e DATABASE_URL=postgresql://... \
  -e NEXTAUTH_SECRET=...  \
  -p 3000:3000 \
  langfuse/langfuse:latest

(In practice you'd use the docker-compose Langfuse provides; the above is the kernel.) Langfuse's OpenTelemetry support ingests the same OTLP traces, and its UI is built for LLM data - prompt and completion side-by-side, token-cost breakdowns, the ability to mark spans as "good" or "bad" examples for eval datasets.

You now have two trace UIs looking at the same data, optimized for two different questions: "what's slow / failing?" (Jaeger) and "what did the model say / spend?" (Langfuse).

Step 6: break it (the things that go wrong in production observability)¶

6.1 The spans that span requests¶

If a request triggers a background job that emits spans 5 minutes later, those spans don't naturally connect to the original request's trace. OTel solves this with trace context propagation - the trace ID and span ID travel from one service to another via HTTP headers (traceparent) or message-queue attributes. For an agent that spawns async work, you have to capture the current span context when scheduling the work and restore it when the work runs:

from opentelemetry import context
from opentelemetry.propagate import inject, extract

# At scheduling time:
carrier = {}
inject(carrier)  # writes traceparent to carrier
queue.publish({"task": ..., "trace_carrier": carrier})

# At execution time:
ctx = extract(message["trace_carrier"])
with tracer.start_as_current_span("background.task", context=ctx):
    do_work()

Production multi-service AI systems live or die by trace propagation. Get it right early.

6.2 The trace bloat problem¶

You log the full prompt and completion to every span. Each is 5KB. You have 100k requests/day. That's 500MB/day of trace storage, mostly redundant. Mitigations: (1) truncate large attributes at the source (the workshop code does 500-char and 1000-char caps); (2) sample - keep 100% of errors and slow traces, 10% of normal ones (OTel's TailBasedSampler does this in the collector); (3) move the full prompt/completion to a separate object store keyed by trace ID, and only put the link in the span.

6.3 The PII problem¶

Your traces contain user prompts. User prompts contain PII (names, emails, addresses, sometimes worse). Your trace storage is now a PII repository whose compliance posture you have to manage. Defenses: redact at the application layer before tracing (regex for emails / phone numbers / SSNs); use sampling that excludes user-content fields; route PII-containing traces to a stricter storage tier.

6.4 The cost-attribution mismatch¶

Anthropic and OpenAI bill on tokens. Your traces show tokens. But the price per token changes per model and over time. Hardcoding 3e-6 and 15e-6 in your code (as the workshop does) ages badly. Production fix: maintain a pricing table outside the code, updated when providers change prices; compute cost at query time in your observability backend rather than baking it into the span at write time.

Step 7: production patterns¶

The things real production AI observability does that the workshop code doesn't yet:

Eval result overlays. When you score an agent run with whatifd or any other eval system, attach the eval result as a span attribute (eval.score.faithfulness: 0.87) or as a span event. Trace UIs that understand the convention can color failed-eval traces differently.
Prompt versioning. Every span carries gen_ai.request.prompt_id and prompt_version. When you change a prompt and the eval scores drop, you can find every trace that used the new prompt and inspect what went wrong.
A/B routing. Production agents often route requests to different models or different prompts based on heuristics. Tag spans with experiment.id and experiment.variant so you can slice metrics by experiment arm without re-running anything.
Span links across calls. When the agent re-runs because the user clicked "regenerate," link the new trace to the old one with SpanLink. The UI can now show "this was the third attempt at the same question."
OpenTelemetry GenAI events (not just attributes). The standard defines gen_ai.user.message and gen_ai.assistant.message as span events (timestamped records inside a span), separate from attributes. Modern tooling expects these for richer prompt-completion display.

Step 8: hosted vs self-hosted¶

Three viable production stacks:

Self-hosted OTel + Langfuse: maximum control, you own the data, costs you operational complexity. Right for regulated workloads and large teams.
Hosted Langfuse / Phoenix / Helicone: trade per-trace cost (or seat cost) for zero ops. Right for most teams under 1M traces/month.
Datadog / New Relic / Honeycomb with GenAI plugins: if you already run one of these for your service infrastructure, their LLM features are good enough and the unified pane saves switching costs. The user has worked extensively with Datadog (per their work history) and this is often the pragmatic choice in shops that already standardized on it.

There is no objectively right answer. Pick by the question "what's the largest source of friction not having this fixes?" - usually that points to one of the three immediately.

Now extend it¶

Add metrics, not just traces. Use opentelemetry-metrics to emit counters (llm.requests.count, llm.errors.count), histograms (llm.latency.ms, llm.tokens.input), and gauges (agent.active_turns). Metrics aggregate better than traces for dashboard panels.
Auto-instrument with the Anthropic SDK plugin. Anthropic ships opentelemetry-instrumentation-anthropic (community) that wraps every API call automatically. Compare its spans to your hand-rolled ones; usually the auto-instrumentation is good enough.
Wire whatifd's eval results into traces. When the regression check fails, surface the failing span IDs in the CI output. Now a regressed eval points you directly at a trace you can debug.
Add a "trace summary" tool the agent can call. When the agent finishes a long run, have it call a summarize-trace tool that pulls the current trace from the local trace store and writes a human-readable summary. Self-explaining agent runs.
Stream traces. For very long agent runs, OTel can stream spans as they close rather than batching. Production debugging of stuck agents is dramatically better when you can see the live span tree update.

What you might wonder¶

"Why OpenTelemetry instead of writing my own JSON log lines?" Three reasons. (1) Standard attribute names (gen_ai.*) mean every tool understands your data without custom mapping. (2) Sampling, batching, retries, and propagation are solved problems in the OTel collector that you don't want to rebuild. (3) Your AI traces integrate with the rest of your stack's traces - service-level, database-query, message-queue - in one unified view. JSON logs do none of those well.

"How much overhead does OTel add?" Trace export is async and batched; typical overhead is <1% of latency and ~5MB of memory per process. The visible "slow" cases are usually misuse - e.g., synchronous export on every span, or attributes that are megabyte-sized JSON. Use the batch processor (the workshop code does), cap attribute sizes, and you won't notice it.

"What's the relationship between OTel and Langfuse / LangSmith / Phoenix?" Langfuse and Phoenix natively ingest OTLP - you point your exporter at them and they understand the GenAI semantic conventions. LangSmith uses its own SDK historically but is converging on OTel. Helicone is a proxy-based model with a different mechanism (it observes API calls in transit) and complements OTel rather than replacing it. The OTel ecosystem is where the industry is going; adopt accordingly.

"How do I add traces to existing code without refactoring it?" Two paths. (1) Auto-instrumentation: install opentelemetry-instrumentation-* packages for your HTTP client, DB driver, and LLM SDK; they wrap function calls without code changes. (2) Decorator-based: @tracer.start_as_current_span("name") as a decorator wraps any function. For LLM SDKs, the community-maintained instrumentation packages are usually drop-in.

"What's the right span granularity?" One root span per user-facing request. One child span per external service call (LLM, DB, tool, HTTP). One grandchild span per discrete operation within those (a single retry attempt, a parse step). Don't span inside tight loops - 10,000 spans for 10,000 vector comparisons is the bad path. The general test: if the span isn't useful to look at in the UI, don't create it.

What this gave you¶

You instrumented an agent with OpenTelemetry using the GenAI semantic conventions (the industry-standard attribute names).
You wrapped LLM calls, tool calls, and agent turns each as spans, producing a tree that mirrors the work.
You ran Jaeger and Langfuse locally and viewed the same traces in two complementary UIs.
You know the four production-breaking gotchas (cross-request propagation, trace bloat, PII, cost-attribution drift) and how to defend against each.
You can articulate the right granularity and the production patterns (eval overlays, prompt versioning, A/B routing) that follow.

The bigger transfer: an AI feature without observability is a black box; the same feature with one trace tree per request is a queryable system. The difference is the gap between "we think it's working" and "we know it's working." Get there early.

Next: Workshop 10 - Prompt-injection defenses, where you stress-test agents against the OWASP #1 LLM risk and layer in the defenses that actually work.

Submit your build¶

When you finish this workshop, share what you built so others can see and learn from your work. Include:

Public repo with the OTel-instrumented agent
Screenshot of a real agent run in Jaeger with all spans (root + LLM + tools)
Screenshot of the same trace in Langfuse showing prompt-completion view
Short note (3 to 5 sentences) on which production-breaking gotcha you found most relevant to your work

Submit your build Request feedback on your output Discuss this workshop

Browse the gallery | All discussions