15-Fine-tuning¶
Why this matters in the journey¶
Fine-tuning is the bridge from "user of frontier models" to "shaper of model behavior." You don't need to fine-tune in every project, but knowing how-and especially when not to-is core literacy. Modern fine-tuning (LoRA, QLoRA, DPO, GRPO) is approachable on a single GPU and produces real, measurable behavior change.
The rungs¶
Rung 01-When (not) to fine-tune¶
- What: Fine-tuning is the wrong tool for adding knowledge (use RAG) and the right tool for changing behavior, format, tone, or specializing on a narrow task.
- Why it earns its place: Most fine-tuning attempts fail because the wrong tool was picked.
- Resource: OpenAI's fine-tuning best-practices doc. Plus AI Engineering (Huyen) chapter on fine-tuning vs RAG.
- Done when: You can argue, in your own words, when fine-tuning is the right move.
Rung 02-Supervised fine-tuning (SFT)¶
- What: Continue training a pre-trained model on
(prompt, ideal_response)pairs. Cross-entropy loss, lower LR than pretraining. - Why it earns its place: SFT is the foundation. Every modern alignment recipe starts with SFT.
- Resource: Hugging Face TRL library
SFTTrainerdocs. Plus the InstructGPT paper's SFT section (arxiv.org/abs/2203.02155). - Done when: You've SFT'd a small model on a task and observed behavior change.
Rung 03-Parameter-efficient fine-tuning: LoRA¶
- What: Freeze the base model. Add small low-rank update matrices. Train only those. Saves 10–100× memory.
- Why it earns its place: LoRA makes single-GPU fine-tuning of multi-billion-parameter models feasible.
- Resource: LoRA paper (arxiv.org/abs/2106.09685). Plus Hugging Face PEFT library docs (huggingface.co/docs/peft).
- Done when: You've LoRA fine-tuned a model and can explain rank, alpha, and target modules.
Rung 04-QLoRA¶
- What: LoRA but with the base model quantized to 4-bit. Fits 65B parameters on a single 48GB GPU.
- Why it earns its place: This is what makes fine-tuning of large models accessible.
- Resource: QLoRA paper (arxiv.org/abs/2305.14314). Plus the
bitsandbytesintegration docs. - Done when: You've QLoRA fine-tuned a 7B model on a single GPU.
Rung 05-Data curation for fine-tuning¶
- What: Quality > quantity. Diverse, clean, well-formatted, deduplicated. Synthetic data generation is a real technique.
- Why it earns its place: Most fine-tuning failures are data failures. Curation is the hidden bottleneck.
- Resource: Lima: Less Is More for Alignment paper (arxiv.org/abs/2305.11206). Plus Hugging Face's data filtering / cleaning docs.
- Done when: You can articulate a curation pipeline and explain why dedup matters.
Rung 06-Direct Preference Optimization (DPO)¶
- What: Train on
(prompt, preferred_response, rejected_response)triples. Math derived from RLHF but no separate reward model needed. - Why it earns its place: DPO is the dominant alignment method post-2023. Simpler than PPO, often better.
- Resource: DPO paper (arxiv.org/abs/2305.18290). HF TRL
DPOTrainerdocs. Plus a clear blog: search "DPO explained". - Done when: You've DPO'd a small model and can derive the loss function.
Rung 07-GRPO and modern RL fine-tuning¶
- What: Group Relative Policy Optimization (DeepSeek). Multiple completions per prompt, reward signal compares within group, no separate value model.
- Why it earns its place: GRPO is what powers reasoning models like DeepSeek-R1. The 2024–2026 frontier of post-training.
- Resource: DeepSeek-V3 / R1 technical reports. Plus the GRPO discussion in TRL docs.
- Done when: You can explain GRPO's advantages over PPO at a conceptual level.
Rung 08-Reward modeling¶
- What: Train a separate model to predict "which response is better." Used in classical RLHF.
- Why it earns its place: Even DPO replaces this with a math trick-you should know what trick was replaced.
- Resource: InstructGPT paper sections on reward modeling. Plus the Anthropic Constitutional AI paper (arxiv.org/abs/2212.08073).
- Done when: You can explain why a reward model is needed in PPO-style RLHF and how DPO sidesteps it.
Rung 09-Eval for fine-tuned models¶
- What: Pre-fine-tune eval set. Post-fine-tune eval set. Held-out generalization eval. Catastrophic forgetting check.
- Why it earns its place: Without eval, you don't know if your fine-tune helped or hurt.
- Resource: Sequence 12 evals + OpenAI's eval examples for fine-tuning.
- Done when: Your fine-tune project has before/after evals on three task types.
Rung 10-Catastrophic forgetting and continual learning¶
- What: Fine-tuning on task A often degrades performance on task B. Mitigations: replay buffers, EWC, mixture training.
- Why it earns its place: A common production failure. Worth knowing the failure mode and the standard mitigations.
- Resource: Continual Learning for Foundation Models survey (search arxiv).
- Done when: You can describe one mitigation strategy.
Rung 11-Open-source fine-tuning ecosystem¶
- What: Axolotl (config-driven), Unsloth (fast), TRL (HF official), LLaMA-Factory, Torchtune. Each has a niche.
- Why it earns its place: Knowing the ecosystem accelerates picking the right tool.
- Resource: Each project's GitHub README. Pick one to use deeply.
- Done when: You've completed a fine-tuning run with at least one of these and can articulate when you'd use each.
Minimum required to leave this sequence¶
- Articulate when to fine-tune vs RAG vs prompt-tune.
- SFT a small model end-to-end.
- LoRA fine-tune with PEFT.
- QLoRA fine-tune a 7B model on a single GPU.
- DPO fine-tune with TRL.
- Before/after evals showing what changed.
Going further¶
- Sebastian Raschka's blog posts on fine-tuning (sebastianraschka.com).
- Hugging Face TRL examples-read every example script.
- Axolotl docs and example configs-learn one config-driven workflow well.
How this sequence connects to the year¶
- Month 8: This sequence IS the bulk of month 8.
- Q3 (any track): Fine-tuning literacy is universal.
- Capstone: A fine-tune + eval is a strong public artifact.