13-LLM Observability¶
Why this matters in the journey¶
This is your bridge sequence. Everything you know about distributed-systems observability (traces, metrics, logs, SLOs, alerting) extends here-but with a twist: the "spans" are LLM calls, the "errors" are hallucinations, the "latency budget" includes token economics, and the "drift" is model behavior change. Few engineers come from observability into AI; you'll be unusually credible here.
The rungs¶
Rung 01-Why LLM observability is different¶
- What: LLM systems are nondeterministic, expensive per call, latency-sensitive, and quality-sensitive. Traditional APM doesn't capture quality.
- Why it earns its place: Frame the gap before reaching for tools.
- Resource: Langfuse / LangSmith blogs on "what LLM observability is." Plus your own SLI/SLO instincts as a reference frame.
- Done when: You can list 3 things APM tools miss for LLM systems.
Rung 02-Tracing LLM calls¶
- What: Each LLM call is a span: model, prompt, response, tokens, latency. Multi-step pipelines (RAG, agents) are nested traces.
- Why it earns its place: A trace is the atomic unit of LLM debugging. Without it, you're flying blind.
- Resource: Langfuse docs (langfuse.com). LangSmith docs (smith.langchain.com).
- Done when: Your project emits traces with prompt, response, latency, and token counts visible per span.
Rung 03-Cost and token observability¶
- What: Per-call, per-feature, per-user, per-tenant token consumption. Cache hit rate. Cost projections.
- Why it earns its place: AI products live or die on unit economics. Token observability is the SLI of cost.
- Resource: LiteLLM's spend tracking. Plus Datadog / Grafana dashboards for tokens (you can build these directly).
- Done when: You have a dashboard showing tokens/day, cost/day, cache hit rate, and per-feature breakdown.
Rung 04-Quality observability (online evals)¶
- What: Score a sampled fraction of production calls automatically (deterministic checks + judge). Track score over time.
- Why it earns its place: Latency and cost are easy. Quality is hard. A quality SLI is a Staff-level move.
- Resource: Langfuse evaluations docs. Plus Hamel's posts on online eval sampling.
- Done when: You have a quality SLI computed on production traffic with alerting on drops.
Rung 05-OpenTelemetry GenAI semantic conventions¶
- What: OTel is standardizing semantic conventions for AI workloads-
gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens, etc. - Why it earns its place: Vendor-neutral tracing for LLMs. Your observability moat shows up here. Adopt early.
- Resource: OpenTelemetry docs (opentelemetry.io)-search for "GenAI semantic conventions." Plus the active spec discussions on GitHub.
- Done when: Your project emits OTel traces using GenAI conventions.
Rung 06-Agent observability¶
- What: Multi-step traces, tool-call success rates, trajectory analysis, replayability of failed runs.
- Why it earns its place: Agents are the hardest LLM workloads to debug; observability multiplies your debugging speed.
- Resource: LangSmith / Langfuse agent tracing tutorials.
- Done when: You can replay a failed agent run from traces alone.
Rung 07-RAG observability¶
- What: Retrieval-specific signals: top-k results per query, retrieval scores, faithfulness, query patterns.
- Why it earns its place: RAG quality drift is invisible without retrieval observability.
- Resource: Same tools as rung 02 + RAGAS for online faithfulness.
- Done when: You can see, per query in production, what was retrieved and what was generated.
Rung 08-User feedback and signal collection¶
- What: Thumbs up/down, free-text feedback, implicit signals (regenerate, copy, abandon). Pipe to traces.
- Why it earns its place: Real user signal closes the loop; without it, you optimize against your own opinion.
- Resource: Langfuse feedback docs. Plus Eugene Yan's posts on feedback systems.
- Done when: Your traces are linkable to user feedback.
Rung 09-Drift detection¶
- What: Distribution shift in inputs (new query patterns), outputs (new failure shapes), or quality scores. Alert on shifts.
- Why it earns its place: Models, prompts, and providers change underneath you. Drift detection is the safety net.
- Resource: Arize, WhyLabs, or Fiddler ML monitoring tools-pick one to study, even if you don't use it.
- Done when: You can articulate a drift detection scheme for your project's outputs.
Rung 10-Privacy, redaction, and PII handling¶
- What: Prompts and responses often contain PII. Redact before logging or use a redaction-aware tracer.
- Why it earns its place: Compliance failures kill products. This is non-optional in regulated industries.
- Resource: Langfuse's redaction features. Plus Microsoft Presidio for PII detection.
- Done when: Your traces redact PII automatically.
Rung 11-Connecting LLM observability to your existing stack¶
- What: Datadog, Grafana, Prometheus already exist at most companies. Bridging LLM signals into them is the practical move.
- Why it earns its place: Your observability skills literally transfer here. Nobody else on the team will be as fluent.
- Resource: Datadog's LLM observability product docs. Plus Grafana's LLM dashboards.
- Done when: Your LLM project has dashboards in your team's existing observability stack.
Minimum required to leave this sequence¶
- Project emits traces with prompt, response, latency, and token counts.
- Cost dashboard with cache hit rate and per-feature breakdown.
- Online quality eval running on a sampled subset of traffic.
- OTel GenAI conventions adopted.
- User feedback linked to traces.
- PII redaction in place.
Going further¶
- OpenTelemetry GenAI working group discussions and PRs-contribute.
- Datadog's LLM Observability product-study the design.
- Eugene Yan's "ML monitoring" archive-the foundations are excellent.
How this sequence connects to the year¶
- Month 6: Rungs 01–04 are core to the eval and observability work that makes month 6's anchor real.
- Q3: Bridge sequence for any track. Your moat sequence.
- Q4 blog post: "LLM observability for engineers who already know observability"-most leveraged piece you'll write.