15-Fine-tuning¶

Why this matters in the journey¶

Fine-tuning is the bridge from "user of frontier models" to "shaper of model behavior." You don't need to fine-tune in every project, but knowing how-and especially when not to-is core literacy. Modern fine-tuning (LoRA, QLoRA, DPO, GRPO) is approachable on a single GPU and produces real, measurable behavior change.

The rungs¶

Rung 01-When (not) to fine-tune¶

What: Fine-tuning is the wrong tool for adding knowledge (use RAG) and the right tool for changing behavior, format, tone, or specializing on a narrow task.
Why it earns its place: Most fine-tuning attempts fail because the wrong tool was picked.
Resource: OpenAI's fine-tuning best-practices doc. Plus AI Engineering (Huyen) chapter on fine-tuning vs RAG.
Done when: You can argue, in your own words, when fine-tuning is the right move.

Rung 02-Supervised fine-tuning (SFT)¶

What: Continue training a pre-trained model on (prompt, ideal_response) pairs. Cross-entropy loss, lower LR than pretraining.
Why it earns its place: SFT is the foundation. Every modern alignment recipe starts with SFT.
Resource: Hugging Face TRL library SFTTrainer docs. Plus the InstructGPT paper's SFT section (arxiv.org/abs/2203.02155).
Done when: You've SFT'd a small model on a task and observed behavior change.

Rung 03-Parameter-efficient fine-tuning: LoRA¶

What: Freeze the base model. Add small low-rank update matrices. Train only those. Saves 10–100× memory.
Why it earns its place: LoRA makes single-GPU fine-tuning of multi-billion-parameter models feasible.
Resource: LoRA paper (arxiv.org/abs/2106.09685). Plus Hugging Face PEFT library docs (huggingface.co/docs/peft).
Done when: You've LoRA fine-tuned a model and can explain rank, alpha, and target modules.

Rung 04-QLoRA¶

What: LoRA but with the base model quantized to 4-bit. Fits 65B parameters on a single 48GB GPU.
Why it earns its place: This is what makes fine-tuning of large models accessible.
Resource: QLoRA paper (arxiv.org/abs/2305.14314). Plus the bitsandbytes integration docs.
Done when: You've QLoRA fine-tuned a 7B model on a single GPU.

Rung 05-Data curation for fine-tuning¶

What: Quality > quantity. Diverse, clean, well-formatted, deduplicated. Synthetic data generation is a real technique.
Why it earns its place: Most fine-tuning failures are data failures. Curation is the hidden bottleneck.
Resource: Lima: Less Is More for Alignment paper (arxiv.org/abs/2305.11206). Plus Hugging Face's data filtering / cleaning docs.
Done when: You can articulate a curation pipeline and explain why dedup matters.

Rung 06-Direct Preference Optimization (DPO)¶

What: Train on (prompt, preferred_response, rejected_response) triples. Math derived from RLHF but no separate reward model needed.
Why it earns its place: DPO is the dominant alignment method post-2023. Simpler than PPO, often better.
Resource: DPO paper (arxiv.org/abs/2305.18290). HF TRL DPOTrainer docs. Plus a clear blog: search "DPO explained".
Done when: You've DPO'd a small model and can derive the loss function.

Rung 07-GRPO and modern RL fine-tuning¶

What: Group Relative Policy Optimization (DeepSeek). Multiple completions per prompt, reward signal compares within group, no separate value model.
Why it earns its place: GRPO is what powers reasoning models like DeepSeek-R1. The 2024–2026 frontier of post-training.
Resource: DeepSeek-V3 / R1 technical reports. Plus the GRPO discussion in TRL docs.
Done when: You can explain GRPO's advantages over PPO at a conceptual level.

Rung 08-Reward modeling¶

What: Train a separate model to predict "which response is better." Used in classical RLHF.
Why it earns its place: Even DPO replaces this with a math trick-you should know what trick was replaced.
Resource: InstructGPT paper sections on reward modeling. Plus the Anthropic Constitutional AI paper (arxiv.org/abs/2212.08073).
Done when: You can explain why a reward model is needed in PPO-style RLHF and how DPO sidesteps it.

Rung 09-Eval for fine-tuned models¶

What: Pre-fine-tune eval set. Post-fine-tune eval set. Held-out generalization eval. Catastrophic forgetting check.
Why it earns its place: Without eval, you don't know if your fine-tune helped or hurt.
Resource: Sequence 12 evals + OpenAI's eval examples for fine-tuning.
Done when: Your fine-tune project has before/after evals on three task types.

Rung 10-Catastrophic forgetting and continual learning¶

What: Fine-tuning on task A often degrades performance on task B. Mitigations: replay buffers, EWC, mixture training.
Why it earns its place: A common production failure. Worth knowing the failure mode and the standard mitigations.
Resource: Continual Learning for Foundation Models survey (search arxiv).
Done when: You can describe one mitigation strategy.

Rung 11-Open-source fine-tuning ecosystem¶

What: Axolotl (config-driven), Unsloth (fast), TRL (HF official), LLaMA-Factory, Torchtune. Each has a niche.
Why it earns its place: Knowing the ecosystem accelerates picking the right tool.
Resource: Each project's GitHub README. Pick one to use deeply.
Done when: You've completed a fine-tuning run with at least one of these and can articulate when you'd use each.

Minimum required to leave this sequence¶

Articulate when to fine-tune vs RAG vs prompt-tune.
SFT a small model end-to-end.
LoRA fine-tune with PEFT.
QLoRA fine-tune a 7B model on a single GPU.
DPO fine-tune with TRL.
Before/after evals showing what changed.