Skip to content

Week 22 - Observability, Cost, Eval Pipelines, MLOps

22.1 Conceptual Core

  • ML observability has two layers above the system observability you already learned:
  • Model observability: prediction distribution drift, input feature drift, output quality (toxicity, refusals, hallucinations). Tools: Arize, Fiddler, WhyLabs, Langfuse.
  • Eval pipelines: continuous evaluation on benchmarks (MMLU, HumanEval, internal eval sets). Tools: lm-evaluation-harness, OpenAI Evals, Inspect AI, internal bespoke harnesses.
  • Cost observability: per-team / per-product / per-feature attribution. GPU-hours × $/hour, plus fixed-cost amortization. OpenCost (week 22 of K8s curriculum) plus model-aware tagging.

22.2 Mechanical Detail

  • Eval-as-CI: every model checkpoint runs through a fixed eval suite. Regressions block promotion. The "tests" of ML.
  • Tracing for LLM applications (vs traditional traces): a single user request fans out to multiple LLM calls, embeddings, retrievals, tool uses. OTel + Langfuse / LangSmith capture the call tree with prompts, responses, latencies, costs.
  • Drift detection:
  • Input drift: PSI (Population Stability Index), KS test on feature distributions.
  • Output drift: change in label/output distribution. For LLMs: monitor refusal rate, response length, toxicity scores.
  • Concept drift: relationship between input and label changes. Hardest to detect.
  • A/B and canary: traffic-split with measured metrics. KServe's canary support handles the routing; the metric aggregation is yours to build.

22.3 Lab-"Eval and Drift Pipeline"

  1. Build a CI pipeline: on every model push, run lm-evaluation-harness on a fixed subset (MMLU 500-question, HumanEval pass@1).
  2. Compare against a baseline; fail the pipeline on >2% regression.
  3. Wire production traffic samples into a drift dashboard: input length distribution, output length distribution, refusal rate, fraction of failed JSON-mode outputs.
  4. Synthetic drift: shift the input distribution (longer prompts) and verify the dashboard catches it.

22.4 Idiomatic & Diagnostic Drill

  • Cost/quality Pareto: every eval run captures both quality scores and inference cost. The dashboard is cost vs quality per model-the unit of decision-making for production model selection.

22.5 Production Slice

  • Document an "incident response for model regressions" runbook: detection → roll-back via traffic-split → investigate → fix → re-promote. The same shape as software incidents, with model-specific specifics.

Comments