13-LLM Observability¶

Why this matters in the journey¶

This is your bridge sequence. Everything you know about distributed-systems observability (traces, metrics, logs, SLOs, alerting) extends here-but with a twist: the "spans" are LLM calls, the "errors" are hallucinations, the "latency budget" includes token economics, and the "drift" is model behavior change. Few engineers come from observability into AI; you'll be unusually credible here.

The rungs¶

Rung 01-Why LLM observability is different¶

What: LLM systems are nondeterministic, expensive per call, latency-sensitive, and quality-sensitive. Traditional APM doesn't capture quality.
Why it earns its place: Frame the gap before reaching for tools.
Resource: Langfuse / LangSmith blogs on "what LLM observability is." Plus your own SLI/SLO instincts as a reference frame.
Done when: You can list 3 things APM tools miss for LLM systems.

Rung 02-Tracing LLM calls¶

What: Each LLM call is a span: model, prompt, response, tokens, latency. Multi-step pipelines (RAG, agents) are nested traces.
Why it earns its place: A trace is the atomic unit of LLM debugging. Without it, you're flying blind.
Resource: Langfuse docs (langfuse.com). LangSmith docs (smith.langchain.com).
Done when: Your project emits traces with prompt, response, latency, and token counts visible per span.

Rung 03-Cost and token observability¶

What: Per-call, per-feature, per-user, per-tenant token consumption. Cache hit rate. Cost projections.
Why it earns its place: AI products live or die on unit economics. Token observability is the SLI of cost.
Resource: LiteLLM's spend tracking. Plus Datadog / Grafana dashboards for tokens (you can build these directly).
Done when: You have a dashboard showing tokens/day, cost/day, cache hit rate, and per-feature breakdown.

Rung 04-Quality observability (online evals)¶

What: Score a sampled fraction of production calls automatically (deterministic checks + judge). Track score over time.
Why it earns its place: Latency and cost are easy. Quality is hard. A quality SLI is a Staff-level move.
Resource: Langfuse evaluations docs. Plus Hamel's posts on online eval sampling.
Done when: You have a quality SLI computed on production traffic with alerting on drops.

Rung 05-OpenTelemetry GenAI semantic conventions¶

What: OTel is standardizing semantic conventions for AI workloads-gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, etc.
Why it earns its place: Vendor-neutral tracing for LLMs. Your observability moat shows up here. Adopt early.
Resource: OpenTelemetry docs (opentelemetry.io)-search for "GenAI semantic conventions." Plus the active spec discussions on GitHub.
Done when: Your project emits OTel traces using GenAI conventions.

Rung 06-Agent observability¶

What: Multi-step traces, tool-call success rates, trajectory analysis, replayability of failed runs.
Why it earns its place: Agents are the hardest LLM workloads to debug; observability multiplies your debugging speed.
Resource: LangSmith / Langfuse agent tracing tutorials.
Done when: You can replay a failed agent run from traces alone.

Rung 07-RAG observability¶

What: Retrieval-specific signals: top-k results per query, retrieval scores, faithfulness, query patterns.
Why it earns its place: RAG quality drift is invisible without retrieval observability.
Resource: Same tools as rung 02 + RAGAS for online faithfulness.
Done when: You can see, per query in production, what was retrieved and what was generated.

Rung 08-User feedback and signal collection¶

What: Thumbs up/down, free-text feedback, implicit signals (regenerate, copy, abandon). Pipe to traces.
Why it earns its place: Real user signal closes the loop; without it, you optimize against your own opinion.
Resource: Langfuse feedback docs. Plus Eugene Yan's posts on feedback systems.
Done when: Your traces are linkable to user feedback.

Rung 09-Drift detection¶

What: Distribution shift in inputs (new query patterns), outputs (new failure shapes), or quality scores. Alert on shifts.
Why it earns its place: Models, prompts, and providers change underneath you. Drift detection is the safety net.
Resource: Arize, WhyLabs, or Fiddler ML monitoring tools-pick one to study, even if you don't use it.
Done when: You can articulate a drift detection scheme for your project's outputs.

Rung 10-Privacy, redaction, and PII handling¶

What: Prompts and responses often contain PII. Redact before logging or use a redaction-aware tracer.
Why it earns its place: Compliance failures kill products. This is non-optional in regulated industries.
Resource: Langfuse's redaction features. Plus Microsoft Presidio for PII detection.
Done when: Your traces redact PII automatically.

Rung 11-Connecting LLM observability to your existing stack¶

What: Datadog, Grafana, Prometheus already exist at most companies. Bridging LLM signals into them is the practical move.
Why it earns its place: Your observability skills literally transfer here. Nobody else on the team will be as fluent.
Resource: Datadog's LLM observability product docs. Plus Grafana's LLM dashboards.
Done when: Your LLM project has dashboards in your team's existing observability stack.

Minimum required to leave this sequence¶

Project emits traces with prompt, response, latency, and token counts.
Cost dashboard with cache hit rate and per-feature breakdown.
Online quality eval running on a sampled subset of traffic.
OTel GenAI conventions adopted.
User feedback linked to traces.
PII redaction in place.

Going further¶

OpenTelemetry GenAI working group discussions and PRs-contribute.
Datadog's LLM Observability product-study the design.
Eugene Yan's "ML monitoring" archive-the foundations are excellent.

How this sequence connects to the year¶

Month 6: Rungs 01–04 are core to the eval and observability work that makes month 6's anchor real.
Q3: Bridge sequence for any track. Your moat sequence.
Q4 blog post: "LLM observability for engineers who already know observability"-most leveraged piece you'll write.