Month 8-Week 3: DPO and preference data¶
Week summary¶
- Goal: Read the DPO paper. Build a preference dataset. DPO-train your week-2 SFT model. Eval base vs SFT vs SFT+DPO.
- Time: ~10 h over 3 sessions.
- Output: Preference dataset; DPO-trained adapter; 3-way eval comparison.
- Sequences relied on: 15-fine-tuning rungs 06, 07, 08; 03-probability-statistics rung 07 (KL).
Why this week matters¶
DPO is the dominant alignment method post-2023. Knowing it (including the math) separates "applied AI engineer" from "engineer who calls fine-tune endpoints." The DPO derivation is a beautiful piece of math-the implicit-reward trick that lets you skip the separate reward model. GRPO (next month) builds on this lineage.
Prerequisites¶
- M08-W02 complete with SFT model.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): DPO paper + RLHF context
- Session B-Sat morning (~4 h): preference data + DPO training
- Session C-Sun afternoon (~3 h): 3-way eval + reflection
Session A-DPO paper + RLHF context¶
Goal: Read DPO. Understand the math. Compare to PPO-based RLHF.
Part 1-DPO paper (90 min)¶
Read: DPO (arxiv.org/abs/2305.18290). All sections.
The setup: - Standard RLHF: train a reward model on preference data; use PPO to optimize policy against it; add KL penalty against a reference model to prevent drift. - DPO observation: the optimal policy under that objective has a closed form. We can train directly on preference data without an explicit reward model.
The DPO loss:
wherey_w is the preferred response, y_l is the rejected, π_θ is your policy, π_ref is the frozen reference, and β is a hyperparameter.
Read it slowly. The β factor controls how much we trust the preference data vs how much we anchor to the reference.
Part 2-RLHF context (60 min)¶
Read: InstructGPT (arxiv.org/abs/2203.02155) sections on reward modeling and PPO.
DPO removes the reward model. Why is that good? - Less infrastructure (one less model to train + maintain). - Less hyperparameter sensitivity. - More stable training (PPO is famously finicky).
DPO's tradeoff: less expressive than full RLHF for very complex reward landscapes. For most applied teams, DPO is the default in 2026.
Part 3-Constitutional AI (30 min)¶
Read abstract + introduction of Constitutional AI (arxiv.org/abs/2212.08073). Anthropic's approach combines RLHF with self-generated critiques.
You don't need depth here-just awareness that alignment methods vary by lab.
Output of Session A¶
- DPO paper notes (~600 words).
- InstructGPT notes.
- Mental model of where DPO fits in the alignment landscape.
Session B-Preference data + DPO training¶
Goal: Build preference data. Train DPO on top of SFT model.
Part 1-Preference dataset (90 min)¶
A preference triple: (prompt, chosen_response, rejected_response).
Two approaches:
Synthetic via judge model: 1. Take 200 prompts from your domain. 2. Generate two responses per prompt with your SFT model (different temperatures, or different few-shot examples). 3. Use a stronger judge (Claude Opus 4.7) to pick the preferred. 4. Save as triples.
Public dataset: - `HuggingFaceH4/ultrafeedback_binarized - ~60K triples. - Use this if your domain doesn't allow synthetic data quickly.
For learning purposes, build ~500 synthetic triples. Even small datasets show DPO's effect.
def build_preference_dataset(prompts, sft_model):
triples = []
for p in prompts:
a = generate(sft_model, p, temp=0.7)
b = generate(sft_model, p, temp=1.2)
chosen, rejected = judge_pick(p, a, b) # claude opus picks preferred
triples.append({"prompt": p, "chosen": chosen, "rejected": rejected})
return triples
Part 2-DPO training (105 min)¶
from trl import DPOConfig, DPOTrainer
dpo_cfg = DPOConfig(
output_dir="dpo-out",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-6, # much lower than SFT
beta=0.1, # KL strength
bf16=True,
report_to="wandb",
)
trainer = DPOTrainer(
model=sft_model,
ref_model=None, # uses model with adapter disabled as ref
args=dpo_cfg,
train_dataset=pref_dataset,
tokenizer=tokenizer,
)
trainer.train()
Monitor:
- Reward gap (rewards/chosen - rewards/rejected): should grow.
- KL from reference: should stay bounded; spikes indicate instability.
- Validation reward gap: should also grow if training is generalizing.
Part 3-Save + verify (45 min)¶
Save the new adapter. Generate samples on a held-out prompt; compare base, SFT, SFT+DPO outputs side by side. Look for: - DPO outputs more aligned with the preferences encoded in your data. - Format / tone / safety improvements (depending on what your judge preferred).
Output of Session B¶
- 500-triple preference dataset.
- DPO-trained adapter.
Session C-3-way eval + reflection¶
Goal: Compare base vs SFT vs SFT+DPO on your eval set.
Part 1-Run all 3 (75 min)¶
Use your eval harness from M06-W03. For each of 30 prompts: - Generate with base model. - Generate with SFT model. - Generate with SFT+DPO model.
Score each with heuristic + judge.
Part 2-Aggregate (60 min)¶
Build the table: | Metric | Base | SFT | SFT+DPO | Δ DPO over SFT | |---|---|---|---|---| | Format-conformance | 0.66 | 0.92 | 0.94 | +0.02 | | Severity match | 0.71 | 0.78 | 0.81 | +0.03 | | Faithfulness (judge) | 4.0 | 3.9 | 4.2 | +0.3 | | Preferred (judge pairwise vs SFT) |-|-| 62% | +12pp |
Pairwise preference (DPO vs SFT) is a particularly informative metric for DPO. If DPO doesn't win pairwise, the preference data was off.
Part 3-Reflection + push (45 min)¶
Write 300 words: "What DPO did to my model-and what it didn't."
Common observations: - DPO often helps format/style, less often factual accuracy. - Reward gap growth is necessary but not sufficient-must also see eval improvement. - β too high → no preference learning; β too low → KL drift.
Push v0.X to your fine-tuning repo.
Output of Session C¶
- 3-way eval comparison.
- Reflection on DPO's effect.
End-of-week artifact¶
- DPO paper notes
- 500+ preference triples
- DPO-trained adapter
- 3-way eval comparison
- Reflection note
End-of-week self-assessment¶
- I can derive (or follow the derivation of) the DPO loss.
- I can build preference data from scratch.
- I can interpret reward gap, KL divergence, and pairwise preference together.
Common failure modes for this week¶
- β too high or too low. Defaults (0.1) are usually OK; iterate if KL diverges.
- Pref data low quality. Garbage in, garbage out-judge quality matters.
- Single-metric eval. DPO can win on one metric and lose on another. Look at multiple.
What's next (preview of M08-W04)¶
Self-host vs API economics blog post + GRPO preview (DeepSeek-R1 lineage).