Skip to content

Week 23 - Safety, Red-Teaming, Alignment Infrastructure

23.1 Conceptual Core

  • Production AI systems require a safety layer above the model: input filters, output filters, refusal handling, abuse detection. The model is one component of a safety-enforcing system.
  • The dominant patterns:
  • Input classification: detect prompt-injection, jailbreak attempts, content-policy violations before invoking the model. Cheap classifier or small LLM.
  • Output classification: detect policy-violating output before returning. Same pattern.
  • Constrained decoding: structural constraints during generation (JSON schema, regex, grammar). Reduces "the model said something invalid" failure modes.
  • Refusal handling: tasteful refusals. The hardest part-bad refusals (over-refuses) are nearly as bad as harmful outputs (under-refuses).
  • Red-teaming infrastructure: continuous adversarial probing of deployed models. Scaled red-teaming is itself an LLM workload.

23.2 Mechanical Detail

  • Constrained decoding tools: outlines (CFG-based), guidance, jsonformer, vLLM's guided_decoding (uses xgrammar / outlines under the hood). Performance: constraints add ~10-30% latency overhead; usually worth it.
  • Safety classifiers: Llama Guard, ShieldGemma, NVIDIA NeMo Guardrails, Anthropic's content moderation API patterns. Latency is the constraint; usually deploy as a small parallel model.
  • Audit logging for AI: every inference request logged with input, output, classifier decisions, model version, request ID. Required for:
  • Compliance (regulators increasingly want this).
  • Debugging.
  • Eval-from-production (resampling production traffic for offline eval).
  • Red-teaming harnesses: PyRIT (Microsoft), Garak, internal bespoke. Run nightly; failures are P1 issues.

23.3 Lab-"A Safety Layer"

Take your week 21 vLLM deployment. Add: 1. Input classifier (Llama Guard or a small custom classifier)-block obvious prompt injections. 2. Output classifier-block policy-violating outputs. 3. Constrained-decoding mode for any structured-output endpoint. 4. Audit logging to a separate, append-only store. 5. A nightly red-teaming job that fires 1000 adversarial prompts; measures failure rate; alerts on regression.

23.4 Idiomatic & Diagnostic Drill

  • The cost of safety: measure latency overhead and quality impact (does the safety layer cause false-positive refusals on benign prompts?). Track both.

23.5 Production Slice

  • Safety infrastructure is itself a software system. It needs versioning, eval, regression testing, on-call. Treat the safety classifier with the same MLOps rigor as the main model.

Comments