Week 23 - Safety, Red-Teaming, Alignment Infrastructure¶
23.1 Conceptual Core¶
- Production AI systems require a safety layer above the model: input filters, output filters, refusal handling, abuse detection. The model is one component of a safety-enforcing system.
- The dominant patterns:
- Input classification: detect prompt-injection, jailbreak attempts, content-policy violations before invoking the model. Cheap classifier or small LLM.
- Output classification: detect policy-violating output before returning. Same pattern.
- Constrained decoding: structural constraints during generation (JSON schema, regex, grammar). Reduces "the model said something invalid" failure modes.
- Refusal handling: tasteful refusals. The hardest part-bad refusals (over-refuses) are nearly as bad as harmful outputs (under-refuses).
- Red-teaming infrastructure: continuous adversarial probing of deployed models. Scaled red-teaming is itself an LLM workload.
23.2 Mechanical Detail¶
- Constrained decoding tools: outlines (CFG-based), guidance, jsonformer, vLLM's
guided_decoding(uses xgrammar / outlines under the hood). Performance: constraints add ~10-30% latency overhead; usually worth it. - Safety classifiers: Llama Guard, ShieldGemma, NVIDIA NeMo Guardrails, Anthropic's content moderation API patterns. Latency is the constraint; usually deploy as a small parallel model.
- Audit logging for AI: every inference request logged with input, output, classifier decisions, model version, request ID. Required for:
- Compliance (regulators increasingly want this).
- Debugging.
- Eval-from-production (resampling production traffic for offline eval).
- Red-teaming harnesses: PyRIT (Microsoft), Garak, internal bespoke. Run nightly; failures are P1 issues.
23.3 Lab-"A Safety Layer"¶
Take your week 21 vLLM deployment. Add: 1. Input classifier (Llama Guard or a small custom classifier)-block obvious prompt injections. 2. Output classifier-block policy-violating outputs. 3. Constrained-decoding mode for any structured-output endpoint. 4. Audit logging to a separate, append-only store. 5. A nightly red-teaming job that fires 1000 adversarial prompts; measures failure rate; alerts on regression.
23.4 Idiomatic & Diagnostic Drill¶
- The cost of safety: measure latency overhead and quality impact (does the safety layer cause false-positive refusals on benign prompts?). Track both.
23.5 Production Slice¶
- Safety infrastructure is itself a software system. It needs versioning, eval, regression testing, on-call. Treat the safety classifier with the same MLOps rigor as the main model.