Skip to content

Workshop - Structured output with grammar-constrained decoding

DifficultyDeepTime90 min
Needs: Python 3.11+, Anthropic API key, OpenAI API key, outlines + transformers + torch, ~2 GB disk for a tiny local model

Before you start:

  • Comfortable with Python and Pydantic
  • Have called an LLM API and used json.loads on a response at least once
  • Familiar with regex at a basic level (the workshop shows JSON Schema compiled to regex)

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to AI Systems -> Month 05 -> Week 17: Serving Systems, and the sixth in the AI implementations workshop series. Every production AI feature that touches a database, an API, or another system needs the LLM to emit data the next stage can parse. The naive approach - "return JSON" in a prompt and json.loads the result - fails 5-15% of the time on real workloads, which is the difference between a feature that works and a feature that pages you. This workshop walks four approaches in order of strength, ending with grammar-constrained decoding - a mechanism where the token sampler refuses any token that would lead to invalid output. By the end you'll have built three pipelines (prompt-only, JSON-mode, grammar-constrained), measured the failure rates of each, and seen the token-level constraint in action against a local model.

~90 minutes. Needs: Python 3.11+, Anthropic API key, OpenAI API key, the outlines package, ~2 GB of disk for a small local model (Qwen 2.5 0.5B works). No GPU required - CPU inference on a tiny model is fast enough to see the mechanics.

What you'll build, and the idea it makes concrete

You'll build the same extraction pipeline four times: extract {name, age, email, occupation} from a paragraph of natural-language biography. First the prompt-only version that fails ~10% of the time, then the Anthropic tool-use version that almost never fails, then OpenAI's response_format with strict: true, then a true grammar-constrained version using Outlines against a local model where you'll watch the logit mask suppress invalid tokens in real time.

The idea this makes concrete:

There are two fundamentally different mechanisms for "make the model output match a schema." The first is prompt engineering + validation + retry: ask nicely, parse, and if parsing fails, try again. The model can still emit anything; you just hope it emits the right thing, and the model is the layer making the decision. The second is constrained decoding: at every token step, the sampler is given a mask of which tokens are valid given the current parse state, and tokens that would break the schema are assigned probability zero. The model cannot emit invalid output because the invalid tokens are not even on the menu. The schema is the layer making the decision; the model just picks the best valid token. Once you've seen the second mechanism, the first one's failure modes become predictable and the production pattern becomes obvious.

A second idea, just as important:

Pick the strongest mechanism your serving stack supports, then move on. If you control the inference (vLLM, llama.cpp, SGLang), use grammar-constrained decoding - it is strictly the strongest option. If you're calling a hosted API (Anthropic, OpenAI, Google), use that API's native structured-output mode (tool use for Anthropic, response_format with json_schema for OpenAI). Only fall back to "prompt-and-retry" when you have neither, and treat that fallback as the operational debt it is.

Step 0: the architecture you're about to build twice

PROMPT-AND-RETRY                          GRAMMAR-CONSTRAINED
+--------+                                 +--------+
| prompt |                                 | prompt |
+---+----+                                 +---+----+
    v                                          v
+--------+                                 +--------+
| model  |                                 | model  |
+---+----+                                 +---+----+
    | (full vocab, 100k tokens)                 | logits over 100k tokens
    v                                          v
+----------+                              +-----------+
|  text    |                              |  GRAMMAR  | <- (compiled from your
| (maybe   |                              |  FILTER   |     JSON Schema or
|  JSON?)  |                              +-----+-----+     Pydantic model)
+----+-----+                                    |
    v                                            | mask: 0 for invalid tokens,
+----------+                                    | 1 for valid tokens
| json.    |                                    v
| loads()  |                              +-----------+
+----+-----+                              | masked    |
    | maybe ValueError                    | softmax   |
    v                                     +-----+-----+
+----------+                                    v
| validate |                              +-----------+
+----+-----+                              | sample    |
    | maybe ValidationError               +-----+-----+
    v                                          |
+----------+                                    v
| retry,   |                              +-----------+
| or give  |                              | next token|
| up       |                              | GUARANTEED|
+----------+                              | valid     |
                                          +-----------+

The crucial difference: in prompt-and-retry, the validation happens after sampling, so the model can spend tokens on bad output and you eat the cost. In constrained decoding, the validation happens before sampling, so every token spent is on valid output. There is no retry loop because there is no failure mode.

Step 1: the naive prompt baseline (and watch it fail)

Start with the obvious approach. Define the schema with Pydantic:

# extract.py
from pydantic import BaseModel, EmailStr, Field

class Person(BaseModel):
    name: str = Field(description="Full name")
    age: int = Field(description="Age in years")
    email: EmailStr = Field(description="Primary email address")
    occupation: str = Field(description="Job title or role")

The corpus to test against - 100 short biographies (you can grab any source; for the workshop, generate them with the LLM itself or use a public dataset):

BIOGRAPHIES = [
    "Alice Johnson is a 34-year-old software engineer at GlobalCorp. "
    "She lives in Lagos and her work email is alice.j@globalcorp.example.",
    "Bob Lee, age 42, works as a data scientist. Contact: bob@example.org.",
    # ... 98 more
]

The naive extractor:

import json
from anthropic import Anthropic

claude = Anthropic()

PROMPT = """Extract the person's information from the text below as JSON with
keys: name, age, email, occupation. Return ONLY the JSON, no preamble.

Text: {text}
"""

def extract_naive(text: str) -> Person | None:
    resp = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        messages=[{"role": "user", "content": PROMPT.format(text=text)}],
    )
    raw = "".join(b.text for b in resp.content if b.type == "text")
    try:
        return Person.model_validate_json(raw)
    except Exception:
        return None

Measure the failure rate over 100 biographies:

ok = 0
parse_fail = 0
schema_fail = 0
for bio in BIOGRAPHIES:
    try:
        person = extract_naive(bio)
        if person is None:
            schema_fail += 1
        else:
            ok += 1
    except json.JSONDecodeError:
        parse_fail += 1

print(f"ok: {ok}  parse_fail: {parse_fail}  schema_fail: {schema_fail}")

Typical result on a strong model:

ok: 92  parse_fail: 3  schema_fail: 5

92% success rate. On a casual reading this looks fine; on a production budget it's a disaster. A pipeline of three such calls in series has a 0.92³ = 78% end-to-end success rate. A pipeline of ten has a 43% success rate. The "5-15% failure" failure mode compounds catastrophically, and the failures aren't always loud - sometimes a missing field is silently dropped, and downstream code receives a malformed record without noticing.

The failures break down into three categories you'll see in the logs:

  • Parse failures: model wrote \`\`\`json {...} \`\`\`\ with code fences instead of bare JSON. Or added "Here is the extracted data:" before the JSON. Or trailing comma. ~30% of failures.
  • Schema failures: model returned "age": "34" (string) instead of 34 (int). Or "email": "alice.j at globalcorp.example" because it was being "helpful." Or invented a "location": "Lagos" field you didn't ask for. ~60% of failures.
  • Hallucinated values: model filled in an email field for a biography that didn't mention email. (This passes model_validate_json but is wrong.) The most dangerous category because it looks fine. ~10% of failures.

You cannot prompt your way out of this reliably. You can drive the parse-failure rate down with better prompting; you cannot drive the schema and hallucination failures to zero by asking nicely. The model is the layer making the call, and the model is fallible.

Step 2: Anthropic tool use (the right way for Claude)

Anthropic's tools parameter lets you express the schema as a "tool" the model can call. Behind the scenes Anthropic enforces the schema during generation - the model's training plus their decoding strategy ensures the tool_use block's input field is valid JSON Schema-compliant output. (Anthropic doesn't publish whether they use full constrained decoding here, but empirically the output is valid.)

def extract_anthropic_tool(text: str) -> Person | None:
    resp = claude.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        tools=[{
            "name": "save_person",
            "description": "Save the extracted person data.",
            "input_schema": Person.model_json_schema(),
        }],
        tool_choice={"type": "tool", "name": "save_person"},
        messages=[{"role": "user", "content": f"Extract person info from: {text}"}],
    )
    for block in resp.content:
        if block.type == "tool_use" and block.name == "save_person":
            return Person.model_validate(block.input)
    return None

Two important details:

  • tool_choice={"type": "tool", "name": "save_person"} forces the model to call this specific tool. Without it, the model might just respond with text. With it, the model is constrained to emit a tool_use block matching your schema.
  • Person.model_json_schema() generates the JSON Schema from your Pydantic model. This is your contract with the API; Anthropic uses it to constrain the output.

Run the same 100-biography benchmark:

ok: 99  parse_fail: 0  schema_fail: 1

99% success rate. The single failure was a hallucinated email - the schema was satisfied but the value was wrong. Parse and schema failures are essentially zero. For most production work, this is "good enough" - one in 100 hallucinations is recoverable with downstream validation, and you've eliminated 90% of the failure surface.

Step 3: OpenAI's response_format with strict JSON Schema

OpenAI introduced response_format: { type: "json_schema", json_schema: { strict: true, ... } } in late 2024. With strict: true, OpenAI does use full grammar-constrained decoding on their servers - the output is mathematically guaranteed to match your schema. No retries, no validation failures, just valid output.

from openai import OpenAI
openai = OpenAI()

def extract_openai_strict(text: str) -> Person | None:
    resp = openai.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        response_format=Person,  # Pydantic model accepted directly
        messages=[{"role": "user", "content": f"Extract person info: {text}"}],
    )
    return resp.choices[0].message.parsed

The .parse helper handles the schema translation and result parsing. Under the hood, OpenAI compiles your schema into a finite-state machine and uses it to mask logits at every decoding step. Run the benchmark:

ok: 100  parse_fail: 0  schema_fail: 0

100% success rate on schema compliance. The remaining failure mode is hallucinated values, which no decoding strategy can prevent - the schema doesn't know what's true about the world, only what's well-formed.

This is the production sweet spot for OpenAI-served workloads. The performance overhead is small (a few percent of tokens-per-second) and the operational complexity is zero.

Step 4: build your own grammar-constrained decoding with Outlines

So far we've used hosted APIs whose internals are opaque. To see the constraint mechanism, run it against a local model with the outlines library:

$ pip install outlines transformers torch
import outlines
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# A small model so this runs on CPU. Pick anything; the mechanism is universal.
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float32)
out_model = outlines.models.transformers(model, tokenizer)

# Compile the Pydantic schema into a grammar.
generator = outlines.generate.json(out_model, Person)

def extract_outlines(text: str) -> Person:
    prompt = f"Extract person info as JSON: {text}\n\n"
    return generator(prompt, max_tokens=200)

Run it on a biography. You'll get a valid Person instance. The interesting thing is what's happening at every token step.

To see the mechanism, drop into Outlines' lower-level API and inspect the mask at one decoding step:

import outlines.fsm.json_schema as fsmjson

schema = Person.model_json_schema()
fsm = fsmjson.build_regex_from_schema(schema)
print("Compiled regex (first 300 chars):")
print(fsm[:300])

Output (lightly cleaned):

\{[ ]?"name"[ ]?:[ ]?"(?:[^"\\\x00-\x1f]|\\["\\bfnrt/]|\\u[0-9a-fA-F]{4})*"[ ]?,[ ]?
"age"[ ]?:[ ]?(0|[1-9][0-9]*)[ ]?,[ ]?
"email"[ ]?:[ ]?"...email regex...",[ ]?
"occupation"[ ]?:[ ]?"...string regex..."[ ]?\}

Your Pydantic schema has been compiled into a regular expression. That regex defines the set of all valid outputs. Outlines converts the regex to a finite-state automaton; at each token-generation step, the FSM tells the sampler which next tokens would keep the output in a valid state and which would break it. Invalid tokens get their logits set to -inf, which is 0 after softmax, which means they cannot be sampled.

To see the masking in action, intercept the logits:

# Manual step-by-step generation to inspect the mask.
from outlines.processors import JSONLogitsProcessor

processor = JSONLogitsProcessor(schema=Person, tokenizer=tokenizer)

input_ids = tokenizer(f"Extract from: Alice is 34.\n\n", return_tensors="pt").input_ids
for step in range(5):
    with torch.no_grad():
        logits = model(input_ids).logits[:, -1, :]   # logits for next token
    masked = processor(input_ids, logits.clone())
    # Count how many tokens went from finite to -inf
    suppressed = (logits.isfinite() & ~masked.isfinite()).sum().item()
    total_vocab = logits.shape[-1]
    valid = (~masked.isinf()).sum().item()
    print(f"step {step}: vocab={total_vocab} valid={valid} suppressed={suppressed}")
    # Greedy sample for demo.
    next_token = masked.argmax(dim=-1, keepdim=True)
    input_ids = torch.cat([input_ids, next_token], dim=-1)
    print(f"  -> token {next_token.item()}: {tokenizer.decode(next_token[0])!r}")

Typical output:

step 0: vocab=151936 valid=1 suppressed=151935
  -> token 90: '{'
step 1: vocab=151936 valid=1 suppressed=151935
  -> token 1: '"'
step 2: vocab=151936 valid=4 suppressed=151932
  -> token 606: 'name'
step 3: vocab=151936 valid=1 suppressed=151935
  -> token 1: '"'
step 4: vocab=151936 valid=2 suppressed=151934
  -> token 25: ':'

Look at the numbers. At the first step, of 151,936 possible tokens in the vocab, exactly one is valid - the opening {. At step 2, four tokens are valid because the JSON parser will accept "name", "age", "email", or "occupation" as the first key. The grammar is making the choice; the model is just picking the most likely valid token.

This is the mechanism. The model's "decision" to emit valid JSON is no longer a decision the model makes; the grammar makes the decision for it, by removing every invalid token from the menu. There is no failure mode because there is no choice to fail at.

Step 5: complex schemas - discriminated unions and nested structures

The pattern scales to richer schemas. Suppose your extraction needs to handle two different document types:

from typing import Literal, Union

class Person(BaseModel):
    kind: Literal["person"] = "person"
    name: str
    age: int

class Organization(BaseModel):
    kind: Literal["organization"] = "organization"
    name: str
    founded: int

class Document(BaseModel):
    summary: str = Field(max_length=200)
    entities: list[Union[Person, Organization]] = Field(max_length=10)

Outlines (and OpenAI's strict mode) compile this into a grammar where the entities field is a list of objects, each of which is either a Person or an Organization distinguished by the kind field. The model can emit:

{
  "summary": "Two engineers founded Acme in 2020.",
  "entities": [
    {"kind": "person", "name": "Alice", "age": 34},
    {"kind": "person", "name": "Bob", "age": 41},
    {"kind": "organization", "name": "Acme", "founded": 2020}
  ]
}

The grammar enforces all of: the kind field's literal values, the type of age vs founded, the max_length on summary, and the max_length=10 on the list. The model cannot emit a malformed version.

This is a real production pattern - one schema that handles multiple shapes, with discriminated unions selecting between them. Without grammar-constrained decoding it requires hand-rolled validation and high-rate retries; with it, the schema is the validation.

Step 6: regex-constrained extraction for specific formats

A common production need: extract a phone number, but only in a specific format. Outlines supports raw regex constraints:

phone_generator = outlines.generate.regex(
    out_model,
    r"\+1-\d{3}-\d{3}-\d{4}",   # US E.164-ish
)

result = phone_generator("Extract the phone: Call us at 415-555-1234 anytime.")
# result: "+1-415-555-1234"

The grammar guarantees the output matches the regex exactly. Same mechanism, simpler schema. Useful for: phone numbers, dates in a specific format, SKU codes, currency amounts with mandatory units, anything where the format is regular.

Step 7: streaming structured output

A practical issue with constrained decoding: when the schema is large, the user waits for the full structure to generate before seeing anything. For chat-like UX, you want to stream. The trick: stream the underlying tokens through a JSON parser that emits events as fields complete.

import outlines

stream = outlines.generate.json(out_model, Document).stream(prompt)
buffer = ""
for token in stream:
    buffer += token
    # Try a partial parse on the buffer. ijson, json-stream, or hand-rolled
    # tolerant parsers exist for exactly this case.
    if buffer.endswith('",') or buffer.endswith('"]'):
        # A field or list element just completed; emit it to the UI.
        ...

Anthropic and OpenAI's streaming APIs do this for you - you receive input_json_delta events for tool_use blocks. Workshop 8 covers streaming as its own topic; for now, know that streaming + structured output compose, and the structure is preserved throughout.

Step 8: break it (the failure modes of constrained decoding)

Grammar-constrained decoding eliminates schema failures but introduces its own failure modes worth knowing.

8.1 The "saturated" constraint

If your schema is too restrictive - e.g., a regex that forces a specific number with no alternative - the grammar can force the model to emit something it doesn't believe. Example: Literal["yes", "no"] constrains the model to one of two answers. If the input is "maybe" or "I'm not sure," the model is forced to pick "yes" or "no" regardless of what would be appropriate. The grammar guarantees syntactic validity, not semantic correctness.

Fix: include "I don't know" / "unknown" / Optional fields in your schema for cases the model legitimately can't answer. Don't force false certainty.

8.2 Performance overhead

Building the grammar costs CPU on first use (compiling a JSON Schema into an FSM can take 100ms-1s for complex schemas). After compilation, the per-token overhead is small (~2-5%) but non-zero. Cache the compiled grammar; don't recompile per request.

8.3 Tokenizer alignment

Constrained decoding masks at the token level, but your grammar is defined at the character level. The library has to align them, and some token boundaries don't fit cleanly. With unusual vocabularies or character-level constraints, you can hit edge cases where a valid output is unreachable because no token sequence happens to align with it. Modern libraries (Outlines, XGrammar) handle the common cases; if you hit one, simplify the grammar or switch tokenizers.

8.4 The "weak model" problem

Constraining a model to a schema cannot make it smarter. A model that doesn't understand a domain will emit confidently-wrong but schema-valid output. Don't confuse "the JSON validated" with "the answer is correct." Schema validation is a syntax check; semantic correctness needs separate validation (cross-checks against a database, range checks, secondary LLM verification, human review).

Step 9: production patterns

What you actually ship in 2026:

  • For hosted-API workloads: use the API's native structured output. Anthropic tools with tool_choice forcing the call; OpenAI response_format with strict: true; Gemini's JSON mode. Each is a 5-line change from prompt-and-parse and eliminates the parse-failure surface.
  • For self-hosted inference: use Outlines or XGrammar (XGrammar is faster on large models in 2026). Wire them into your vLLM or SGLang serving stack. Both have first-class support.
  • For the Pydantic-first developer experience: use instructor. It wraps both Anthropic and OpenAI with a unified Pydantic interface and falls back gracefully when grammar constraints aren't available. The default abstraction layer for new production code.
  • Schema versioning: name your schemas (e.g., Person_v2) and version them like API contracts. When you change a schema, you change downstream consumers too. Pydantic's model_config = ConfigDict(extra="forbid") catches accidental drift.
  • Semantic validation: always layer on a post-LLM check. Schema-valid != correct. Run the email through a real validator, check the date range, verify the entity exists in your database. The schema is a syntactic guarantee, not a truth guarantee.

Now extend it

  1. Add a JSON Schema validation step at runtime. Even when using strict mode, validate again with jsonschema or Pydantic's validate_python. Defense in depth costs nothing.
  2. Benchmark XGrammar vs Outlines on a real schema. XGrammar (from MLC-AI, released 2024) is claimed to be 100x faster at constraint compilation. Run both on your schemas and pick the winner for your latency budget.
  3. Build a "schema as code" extraction pipeline. Take a corpus of unstructured documents (resumes, invoices, papers) and extract structured records into Postgres. Compare extraction accuracy across the four mechanisms (prompt, tool use, response_format strict, Outlines local).
  4. Add a semantic-verification step. After extraction, ask a different LLM call: "given this source text and this extraction, is the extraction faithful?" Refuse extractions the verifier rejects. Catches the hallucinated-value failure mode that grammar can't.
  5. Read the Outlines source. It's a few thousand lines and the constraint-compilation logic is genuinely interesting. The FSM construction in outlines.fsm is the canonical reference for how to do this.

What you might wonder

"Why doesn't every API do constrained decoding?" They mostly do now, but the first wave (early 2023) just exposed JSON mode without the strict constraint - the model was prompted to output JSON but not constrained to. OpenAI's strict: true (2024) and Anthropic's tool-use enforcement (also 2024) brought real constraint to hosted APIs. Constrained decoding adds modest inference cost and some implementation complexity; the trade-off shifted in favor of correctness as more production code started depending on the output shape.

"What about Pydantic + retry as the strategy?" It works for low-stakes use cases. You will pay 2-5x the tokens for the retries; you'll still get occasional failures that exceed your retry budget; debugging the failures is harder than not having them. Use it as a fallback when the API doesn't support strict mode; don't choose it when strict mode is available.

"Can I constrain the model to ALWAYS emit a citation?" Yes. Define a schema like {"answer": str, "citation": str} with both required. The model cannot emit a response without a citation. This is one of the most effective hallucination defenses for RAG: the model must point at something in the source, which makes the source more salient during generation.

"How does grammar-constrained decoding interact with chain-of-thought?" Two patterns work. The CoT-first pattern: ask the model to think freely (no grammar), then in a second call constrain it to the schema using the free-form output as context. The "reasoning then JSON" pattern in OpenAI's o1/o3 series: the reasoning happens off-the-record (hidden) and the final response is constrained. Don't try to grammar-constrain the thinking itself - you'll lose the value of CoT.

"What about tool-use loops + structured output?" Each tool_use block's input is already structured; that's the same mechanism. You're getting structured output on every step of your agent for free. The agent's final answer can also be constrained by making it a tool call (e.g., final_answer is a tool whose input is the response schema).

"Is there an XML-output equivalent?" Yes, though less common. Outlines supports CFG (context-free grammar) constraints; you can write a grammar for any well-defined format. CSV, YAML, SQL, ProtoBuf textual representations - all expressible. JSON dominates because every consumer can parse it.

What this gave you

  • You measured a real schema failure rate (~5-15%) on the prompt-and-parse baseline.
  • You watched two hosted APIs (Anthropic tool-use, OpenAI strict mode) eliminate the failure surface.
  • You ran true grammar-constrained decoding locally with Outlines and inspected the per-token mask - 151,935 of 151,936 tokens suppressed at the first step.
  • You saw schemas with discriminated unions, nested structures, length constraints, and regex compile to FSMs that the sampler enforces.
  • You can articulate when constrained decoding helps (every production data path), when it doesn't (semantic correctness, hallucinated values), and the failure modes specific to it (saturation, performance, tokenizer alignment).
  • You have a concrete production rubric: hosted API native modes first, Outlines/XGrammar for self-hosted, instructor for the Pythonic abstraction.

The bigger transfer: picking the strongest mechanism is a one-time decision per project, and the strongest mechanism eliminates a class of bugs entirely. Most production AI failures trace back to "we used the weaker mechanism because it was familiar." Don't.

Next: Workshop 7 - Multi-agent supervisor pattern, where a coordinator agent dispatches to specialists and merges their results - and you watch the orchestration graph fire on a real task.

Submit your build

When you finish this workshop, share what you built so others can see and learn from your work. Include:

  • Public repo with the four extraction pipelines (naive prompt, Anthropic tools, OpenAI strict, Outlines local)
  • A measured failure-rate table across all four on a 100-document benchmark
  • A screenshot or log of the logit mask suppressing 151,935 of 151,936 tokens at the first decode step
  • Short note (3 to 5 sentences) on which mechanism you'd ship for your team's most schema-sensitive feature and why

Submit your build  Request feedback on your output  Discuss this workshop

Browse the gallery  |  All discussions

Comments