Week 22 - Observability, Cost, Eval Pipelines, MLOps¶

22.1 Conceptual Core¶

ML observability has two layers above the system observability you already learned:
Model observability: prediction distribution drift, input feature drift, output quality (toxicity, refusals, hallucinations). Tools: Arize, Fiddler, WhyLabs, Langfuse.
Eval pipelines: continuous evaluation on benchmarks (MMLU, HumanEval, internal eval sets). Tools: lm-evaluation-harness, OpenAI Evals, Inspect AI, internal bespoke harnesses.
Cost observability: per-team / per-product / per-feature attribution. GPU-hours × $/hour, plus fixed-cost amortization. OpenCost (week 22 of K8s curriculum) plus model-aware tagging.

Eval-as-CI: every model checkpoint runs through a fixed eval suite. Regressions block promotion. The "tests" of ML.
Tracing for LLM applications (vs traditional traces): a single user request fans out to multiple LLM calls, embeddings, retrievals, tool uses. OTel + Langfuse / LangSmith capture the call tree with prompts, responses, latencies, costs.
Drift detection:
Input drift: PSI (Population Stability Index), KS test on feature distributions.
Output drift: change in label/output distribution. For LLMs: monitor refusal rate, response length, toxicity scores.
Concept drift: relationship between input and label changes. Hardest to detect.
A/B and canary: traffic-split with measured metrics. KServe's canary support handles the routing; the metric aggregation is yours to build.

Build a CI pipeline: on every model push, run lm-evaluation-harness on a fixed subset (MMLU 500-question, HumanEval pass@1).
Compare against a baseline; fail the pipeline on >2% regression.
Wire production traffic samples into a drift dashboard: input length distribution, output length distribution, refusal rate, fraction of failed JSON-mode outputs.
Synthetic drift: shift the input distribution (longer prompts) and verify the dashboard catches it.

Cost/quality Pareto: every eval run captures both quality scores and inference cost. The dashboard is cost vs quality per model-the unit of decision-making for production model selection.

Document an "incident response for model regressions" runbook: detection → roll-back via traffic-split → investigate → fix → re-promote. The same shape as software incidents, with model-specific specifics.