Week 22 - Observability, Cost, Eval Pipelines, MLOps¶
22.1 Conceptual Core¶
- ML observability has two layers above the system observability you already learned:
- Model observability: prediction distribution drift, input feature drift, output quality (toxicity, refusals, hallucinations). Tools: Arize, Fiddler, WhyLabs, Langfuse.
- Eval pipelines: continuous evaluation on benchmarks (MMLU, HumanEval, internal eval sets). Tools: lm-evaluation-harness, OpenAI Evals, Inspect AI, internal bespoke harnesses.
- Cost observability: per-team / per-product / per-feature attribution. GPU-hours × $/hour, plus fixed-cost amortization. OpenCost (week 22 of K8s curriculum) plus model-aware tagging.
22.2 Mechanical Detail¶
- Eval-as-CI: every model checkpoint runs through a fixed eval suite. Regressions block promotion. The "tests" of ML.
- Tracing for LLM applications (vs traditional traces): a single user request fans out to multiple LLM calls, embeddings, retrievals, tool uses. OTel + Langfuse / LangSmith capture the call tree with prompts, responses, latencies, costs.
- Drift detection:
- Input drift: PSI (Population Stability Index), KS test on feature distributions.
- Output drift: change in label/output distribution. For LLMs: monitor refusal rate, response length, toxicity scores.
- Concept drift: relationship between input and label changes. Hardest to detect.
- A/B and canary: traffic-split with measured metrics. KServe's canary support handles the routing; the metric aggregation is yours to build.
22.3 Lab-"Eval and Drift Pipeline"¶
- Build a CI pipeline: on every model push, run lm-evaluation-harness on a fixed subset (MMLU 500-question, HumanEval pass@1).
- Compare against a baseline; fail the pipeline on >2% regression.
- Wire production traffic samples into a drift dashboard: input length distribution, output length distribution, refusal rate, fraction of failed JSON-mode outputs.
- Synthetic drift: shift the input distribution (longer prompts) and verify the dashboard catches it.
22.4 Idiomatic & Diagnostic Drill¶
- Cost/quality Pareto: every eval run captures both quality scores and inference cost. The dashboard is
cost vs qualityper model-the unit of decision-making for production model selection.
22.5 Production Slice¶
- Document an "incident response for model regressions" runbook: detection → roll-back via traffic-split → investigate → fix → re-promote. The same shape as software incidents, with model-specific specifics.