Week 23 - Safety, Red-Teaming, Alignment Infrastructure¶

23.1 Conceptual Core¶

Production AI systems require a safety layer above the model: input filters, output filters, refusal handling, abuse detection. The model is one component of a safety-enforcing system.
The dominant patterns:
Input classification: detect prompt-injection, jailbreak attempts, content-policy violations before invoking the model. Cheap classifier or small LLM.
Output classification: detect policy-violating output before returning. Same pattern.
Constrained decoding: structural constraints during generation (JSON schema, regex, grammar). Reduces "the model said something invalid" failure modes.
Refusal handling: tasteful refusals. The hardest part-bad refusals (over-refuses) are nearly as bad as harmful outputs (under-refuses).
Red-teaming infrastructure: continuous adversarial probing of deployed models. Scaled red-teaming is itself an LLM workload.

23.2 Mechanical Detail¶

Constrained decoding tools: outlines (CFG-based), guidance, jsonformer, vLLM's guided_decoding (uses xgrammar / outlines under the hood). Performance: constraints add ~10-30% latency overhead; usually worth it.
Safety classifiers: Llama Guard, ShieldGemma, NVIDIA NeMo Guardrails, Anthropic's content moderation API patterns. Latency is the constraint; usually deploy as a small parallel model.
Audit logging for AI: every inference request logged with input, output, classifier decisions, model version, request ID. Required for:
Compliance (regulators increasingly want this).
Debugging.
Eval-from-production (resampling production traffic for offline eval).
Red-teaming harnesses: PyRIT (Microsoft), Garak, internal bespoke. Run nightly; failures are P1 issues.

23.3 Lab-"A Safety Layer"¶

Take your week 21 vLLM deployment. Add: 1. Input classifier (Llama Guard or a small custom classifier)-block obvious prompt injections. 2. Output classifier-block policy-violating outputs. 3. Constrained-decoding mode for any structured-output endpoint. 4. Audit logging to a separate, append-only store. 5. A nightly red-teaming job that fires 1000 adversarial prompts; measures failure rate; alerts on regression.

23.4 Idiomatic & Diagnostic Drill¶

The cost of safety: measure latency overhead and quality impact (does the safety layer cause false-positive refusals on benign prompts?). Track both.

23.5 Production Slice¶

Safety infrastructure is itself a software system. It needs versioning, eval, regression testing, on-call. Treat the safety classifier with the same MLOps rigor as the main model.