Month 8-Week 3: DPO and preference data¶

Week summary¶

Goal: Read the DPO paper. Build a preference dataset. DPO-train your week-2 SFT model. Eval base vs SFT vs SFT+DPO.
Time: ~10 h over 3 sessions.
Output: Preference dataset; DPO-trained adapter; 3-way eval comparison.
Sequences relied on: 15-fine-tuning rungs 06, 07, 08; 03-probability-statistics rung 07 (KL).

Why this week matters¶

DPO is the dominant alignment method post-2023. Knowing it (including the math) separates "applied AI engineer" from "engineer who calls fine-tune endpoints." The DPO derivation is a beautiful piece of math-the implicit-reward trick that lets you skip the separate reward model. GRPO (next month) builds on this lineage.

Prerequisites¶

M08-W02 complete with SFT model.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): DPO paper + RLHF context
Session B-Sat morning (~4 h): preference data + DPO training
Session C-Sun afternoon (~3 h): 3-way eval + reflection

Session A-DPO paper + RLHF context¶

Goal: Read DPO. Understand the math. Compare to PPO-based RLHF.

Part 1-DPO paper (90 min)¶

Read: DPO (arxiv.org/abs/2305.18290). All sections.

The setup: - Standard RLHF: train a reward model on preference data; use PPO to optimize policy against it; add KL penalty against a reference model to prevent drift. - DPO observation: the optimal policy under that objective has a closed form. We can train directly on preference data without an explicit reward model.

The DPO loss:

L_DPO = -log σ( β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)) )

where y_w is the preferred response, y_l is the rejected, π_θ is your policy, π_ref is the frozen reference, and β is a hyperparameter.

Read it slowly. The β factor controls how much we trust the preference data vs how much we anchor to the reference.

Part 2-RLHF context (60 min)¶

Read: InstructGPT (arxiv.org/abs/2203.02155) sections on reward modeling and PPO.

DPO removes the reward model. Why is that good? - Less infrastructure (one less model to train + maintain). - Less hyperparameter sensitivity. - More stable training (PPO is famously finicky).

DPO's tradeoff: less expressive than full RLHF for very complex reward landscapes. For most applied teams, DPO is the default in 2026.

Part 3-Constitutional AI (30 min)¶

Read abstract + introduction of Constitutional AI (arxiv.org/abs/2212.08073). Anthropic's approach combines RLHF with self-generated critiques.

You don't need depth here-just awareness that alignment methods vary by lab.

Output of Session A¶

DPO paper notes (~600 words).
InstructGPT notes.
Mental model of where DPO fits in the alignment landscape.

Session B-Preference data + DPO training¶

Goal: Build preference data. Train DPO on top of SFT model.

Part 1-Preference dataset (90 min)¶

A preference triple: (prompt, chosen_response, rejected_response).

Two approaches:

Synthetic via judge model: 1. Take 200 prompts from your domain. 2. Generate two responses per prompt with your SFT model (different temperatures, or different few-shot examples). 3. Use a stronger judge (Claude Opus 4.7) to pick the preferred. 4. Save as triples.

Public dataset: - `HuggingFaceH4/ultrafeedback_binarized - ~60K triples. - Use this if your domain doesn't allow synthetic data quickly.

For learning purposes, build ~500 synthetic triples. Even small datasets show DPO's effect.

def build_preference_dataset(prompts, sft_model):
    triples = []
    for p in prompts:
        a = generate(sft_model, p, temp=0.7)
        b = generate(sft_model, p, temp=1.2)
        chosen, rejected = judge_pick(p, a, b)  # claude opus picks preferred
        triples.append({"prompt": p, "chosen": chosen, "rejected": rejected})
    return triples

Part 2-DPO training (105 min)¶

from trl import DPOConfig, DPOTrainer

dpo_cfg = DPOConfig(
    output_dir="dpo-out",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-6,  # much lower than SFT
    beta=0.1,            # KL strength
    bf16=True,
    report_to="wandb",
)

trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,  # uses model with adapter disabled as ref
    args=dpo_cfg,
    train_dataset=pref_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Monitor: - Reward gap (rewards/chosen - rewards/rejected): should grow. - KL from reference: should stay bounded; spikes indicate instability. - Validation reward gap: should also grow if training is generalizing.

Part 3-Save + verify (45 min)¶

Save the new adapter. Generate samples on a held-out prompt; compare base, SFT, SFT+DPO outputs side by side. Look for: - DPO outputs more aligned with the preferences encoded in your data. - Format / tone / safety improvements (depending on what your judge preferred).

Output of Session B¶

500-triple preference dataset.
DPO-trained adapter.

Session C-3-way eval + reflection¶

Goal: Compare base vs SFT vs SFT+DPO on your eval set.

Part 1-Run all 3 (75 min)¶

Use your eval harness from M06-W03. For each of 30 prompts: - Generate with base model. - Generate with SFT model. - Generate with SFT+DPO model.

Score each with heuristic + judge.

Part 2-Aggregate (60 min)¶

Build the table: | Metric | Base | SFT | SFT+DPO | Δ DPO over SFT | |---|---|---|---|---| | Format-conformance | 0.66 | 0.92 | 0.94 | +0.02 | | Severity match | 0.71 | 0.78 | 0.81 | +0.03 | | Faithfulness (judge) | 4.0 | 3.9 | 4.2 | +0.3 | | Preferred (judge pairwise vs SFT) |-|-| 62% | +12pp |

Pairwise preference (DPO vs SFT) is a particularly informative metric for DPO. If DPO doesn't win pairwise, the preference data was off.

Part 3-Reflection + push (45 min)¶

Write 300 words: "What DPO did to my model-and what it didn't."

Common observations: - DPO often helps format/style, less often factual accuracy. - Reward gap growth is necessary but not sufficient-must also see eval improvement. - β too high → no preference learning; β too low → KL drift.

Push v0.X to your fine-tuning repo.

Output of Session C¶

3-way eval comparison.
Reflection on DPO's effect.

End-of-week artifact¶

End-of-week self-assessment¶

I can derive (or follow the derivation of) the DPO loss.
I can build preference data from scratch.
I can interpret reward gap, KL divergence, and pairwise preference together.

Common failure modes for this week¶

β too high or too low. Defaults (0.1) are usually OK; iterate if KL diverges.
Pref data low quality. Garbage in, garbage out-judge quality matters.
Single-metric eval. DPO can win on one metric and lose on another. Look at multiple.

What's next (preview of M08-W04)¶

Self-host vs API economics blog post + GRPO preview (DeepSeek-R1 lineage).