Saltar a contenido

13-LLM Observability

Why this matters in the journey

This is your bridge sequence. Everything you know about distributed-systems observability (traces, metrics, logs, SLOs, alerting) extends here-but with a twist: the "spans" are LLM calls, the "errors" are hallucinations, the "latency budget" includes token economics, and the "drift" is model behavior change. Few engineers come from observability into AI; you'll be unusually credible here.

The rungs

Rung 01-Why LLM observability is different

  • What: LLM systems are nondeterministic, expensive per call, latency-sensitive, and quality-sensitive. Traditional APM doesn't capture quality.
  • Why it earns its place: Frame the gap before reaching for tools.
  • Resource: Langfuse / LangSmith blogs on "what LLM observability is." Plus your own SLI/SLO instincts as a reference frame.
  • Done when: You can list 3 things APM tools miss for LLM systems.

Rung 02-Tracing LLM calls

  • What: Each LLM call is a span: model, prompt, response, tokens, latency. Multi-step pipelines (RAG, agents) are nested traces.
  • Why it earns its place: A trace is the atomic unit of LLM debugging. Without it, you're flying blind.
  • Resource: Langfuse docs (langfuse.com). LangSmith docs (smith.langchain.com).
  • Done when: Your project emits traces with prompt, response, latency, and token counts visible per span.

Rung 03-Cost and token observability

  • What: Per-call, per-feature, per-user, per-tenant token consumption. Cache hit rate. Cost projections.
  • Why it earns its place: AI products live or die on unit economics. Token observability is the SLI of cost.
  • Resource: LiteLLM's spend tracking. Plus Datadog / Grafana dashboards for tokens (you can build these directly).
  • Done when: You have a dashboard showing tokens/day, cost/day, cache hit rate, and per-feature breakdown.

Rung 04-Quality observability (online evals)

  • What: Score a sampled fraction of production calls automatically (deterministic checks + judge). Track score over time.
  • Why it earns its place: Latency and cost are easy. Quality is hard. A quality SLI is a Staff-level move.
  • Resource: Langfuse evaluations docs. Plus Hamel's posts on online eval sampling.
  • Done when: You have a quality SLI computed on production traffic with alerting on drops.

Rung 05-OpenTelemetry GenAI semantic conventions

  • What: OTel is standardizing semantic conventions for AI workloads-gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, etc.
  • Why it earns its place: Vendor-neutral tracing for LLMs. Your observability moat shows up here. Adopt early.
  • Resource: OpenTelemetry docs (opentelemetry.io)-search for "GenAI semantic conventions." Plus the active spec discussions on GitHub.
  • Done when: Your project emits OTel traces using GenAI conventions.

Rung 06-Agent observability

  • What: Multi-step traces, tool-call success rates, trajectory analysis, replayability of failed runs.
  • Why it earns its place: Agents are the hardest LLM workloads to debug; observability multiplies your debugging speed.
  • Resource: LangSmith / Langfuse agent tracing tutorials.
  • Done when: You can replay a failed agent run from traces alone.

Rung 07-RAG observability

  • What: Retrieval-specific signals: top-k results per query, retrieval scores, faithfulness, query patterns.
  • Why it earns its place: RAG quality drift is invisible without retrieval observability.
  • Resource: Same tools as rung 02 + RAGAS for online faithfulness.
  • Done when: You can see, per query in production, what was retrieved and what was generated.

Rung 08-User feedback and signal collection

  • What: Thumbs up/down, free-text feedback, implicit signals (regenerate, copy, abandon). Pipe to traces.
  • Why it earns its place: Real user signal closes the loop; without it, you optimize against your own opinion.
  • Resource: Langfuse feedback docs. Plus Eugene Yan's posts on feedback systems.
  • Done when: Your traces are linkable to user feedback.

Rung 09-Drift detection

  • What: Distribution shift in inputs (new query patterns), outputs (new failure shapes), or quality scores. Alert on shifts.
  • Why it earns its place: Models, prompts, and providers change underneath you. Drift detection is the safety net.
  • Resource: Arize, WhyLabs, or Fiddler ML monitoring tools-pick one to study, even if you don't use it.
  • Done when: You can articulate a drift detection scheme for your project's outputs.

Rung 10-Privacy, redaction, and PII handling

  • What: Prompts and responses often contain PII. Redact before logging or use a redaction-aware tracer.
  • Why it earns its place: Compliance failures kill products. This is non-optional in regulated industries.
  • Resource: Langfuse's redaction features. Plus Microsoft Presidio for PII detection.
  • Done when: Your traces redact PII automatically.

Rung 11-Connecting LLM observability to your existing stack

  • What: Datadog, Grafana, Prometheus already exist at most companies. Bridging LLM signals into them is the practical move.
  • Why it earns its place: Your observability skills literally transfer here. Nobody else on the team will be as fluent.
  • Resource: Datadog's LLM observability product docs. Plus Grafana's LLM dashboards.
  • Done when: Your LLM project has dashboards in your team's existing observability stack.

Minimum required to leave this sequence

  • Project emits traces with prompt, response, latency, and token counts.
  • Cost dashboard with cache hit rate and per-feature breakdown.
  • Online quality eval running on a sampled subset of traffic.
  • OTel GenAI conventions adopted.
  • User feedback linked to traces.
  • PII redaction in place.

Going further

  • OpenTelemetry GenAI working group discussions and PRs-contribute.
  • Datadog's LLM Observability product-study the design.
  • Eugene Yan's "ML monitoring" archive-the foundations are excellent.

How this sequence connects to the year

  • Month 6: Rungs 01–04 are core to the eval and observability work that makes month 6's anchor real.
  • Q3: Bridge sequence for any track. Your moat sequence.
  • Q4 blog post: "LLM observability for engineers who already know observability"-most leveraged piece you'll write.

Comments