Skip to content

15-Fine-tuning

Why this matters in the journey

Fine-tuning is the bridge from "user of frontier models" to "shaper of model behavior." You don't need to fine-tune in every project, but knowing how-and especially when not to-is core literacy. Modern fine-tuning (LoRA, QLoRA, DPO, GRPO) is approachable on a single GPU and produces real, measurable behavior change.

The rungs

Rung 01-When (not) to fine-tune

  • What: Fine-tuning is the wrong tool for adding knowledge (use RAG) and the right tool for changing behavior, format, tone, or specializing on a narrow task.
  • Why it earns its place: Most fine-tuning attempts fail because the wrong tool was picked.
  • Resource: OpenAI's fine-tuning best-practices doc. Plus AI Engineering (Huyen) chapter on fine-tuning vs RAG.
  • Done when: You can argue, in your own words, when fine-tuning is the right move.

Rung 02-Supervised fine-tuning (SFT)

  • What: Continue training a pre-trained model on (prompt, ideal_response) pairs. Cross-entropy loss, lower LR than pretraining.
  • Why it earns its place: SFT is the foundation. Every modern alignment recipe starts with SFT.
  • Resource: Hugging Face TRL library SFTTrainer docs. Plus the InstructGPT paper's SFT section (arxiv.org/abs/2203.02155).
  • Done when: You've SFT'd a small model on a task and observed behavior change.

Rung 03-Parameter-efficient fine-tuning: LoRA

  • What: Freeze the base model. Add small low-rank update matrices. Train only those. Saves 10–100× memory.
  • Why it earns its place: LoRA makes single-GPU fine-tuning of multi-billion-parameter models feasible.
  • Resource: LoRA paper (arxiv.org/abs/2106.09685). Plus Hugging Face PEFT library docs (huggingface.co/docs/peft).
  • Done when: You've LoRA fine-tuned a model and can explain rank, alpha, and target modules.

Rung 04-QLoRA

  • What: LoRA but with the base model quantized to 4-bit. Fits 65B parameters on a single 48GB GPU.
  • Why it earns its place: This is what makes fine-tuning of large models accessible.
  • Resource: QLoRA paper (arxiv.org/abs/2305.14314). Plus the bitsandbytes integration docs.
  • Done when: You've QLoRA fine-tuned a 7B model on a single GPU.

Rung 05-Data curation for fine-tuning

  • What: Quality > quantity. Diverse, clean, well-formatted, deduplicated. Synthetic data generation is a real technique.
  • Why it earns its place: Most fine-tuning failures are data failures. Curation is the hidden bottleneck.
  • Resource: Lima: Less Is More for Alignment paper (arxiv.org/abs/2305.11206). Plus Hugging Face's data filtering / cleaning docs.
  • Done when: You can articulate a curation pipeline and explain why dedup matters.

Rung 06-Direct Preference Optimization (DPO)

  • What: Train on (prompt, preferred_response, rejected_response) triples. Math derived from RLHF but no separate reward model needed.
  • Why it earns its place: DPO is the dominant alignment method post-2023. Simpler than PPO, often better.
  • Resource: DPO paper (arxiv.org/abs/2305.18290). HF TRL DPOTrainer docs. Plus a clear blog: search "DPO explained".
  • Done when: You've DPO'd a small model and can derive the loss function.

Rung 07-GRPO and modern RL fine-tuning

  • What: Group Relative Policy Optimization (DeepSeek). Multiple completions per prompt, reward signal compares within group, no separate value model.
  • Why it earns its place: GRPO is what powers reasoning models like DeepSeek-R1. The 2024–2026 frontier of post-training.
  • Resource: DeepSeek-V3 / R1 technical reports. Plus the GRPO discussion in TRL docs.
  • Done when: You can explain GRPO's advantages over PPO at a conceptual level.

Rung 08-Reward modeling

  • What: Train a separate model to predict "which response is better." Used in classical RLHF.
  • Why it earns its place: Even DPO replaces this with a math trick-you should know what trick was replaced.
  • Resource: InstructGPT paper sections on reward modeling. Plus the Anthropic Constitutional AI paper (arxiv.org/abs/2212.08073).
  • Done when: You can explain why a reward model is needed in PPO-style RLHF and how DPO sidesteps it.

Rung 09-Eval for fine-tuned models

  • What: Pre-fine-tune eval set. Post-fine-tune eval set. Held-out generalization eval. Catastrophic forgetting check.
  • Why it earns its place: Without eval, you don't know if your fine-tune helped or hurt.
  • Resource: Sequence 12 evals + OpenAI's eval examples for fine-tuning.
  • Done when: Your fine-tune project has before/after evals on three task types.

Rung 10-Catastrophic forgetting and continual learning

  • What: Fine-tuning on task A often degrades performance on task B. Mitigations: replay buffers, EWC, mixture training.
  • Why it earns its place: A common production failure. Worth knowing the failure mode and the standard mitigations.
  • Resource: Continual Learning for Foundation Models survey (search arxiv).
  • Done when: You can describe one mitigation strategy.

Rung 11-Open-source fine-tuning ecosystem

  • What: Axolotl (config-driven), Unsloth (fast), TRL (HF official), LLaMA-Factory, Torchtune. Each has a niche.
  • Why it earns its place: Knowing the ecosystem accelerates picking the right tool.
  • Resource: Each project's GitHub README. Pick one to use deeply.
  • Done when: You've completed a fine-tuning run with at least one of these and can articulate when you'd use each.

Minimum required to leave this sequence

  • Articulate when to fine-tune vs RAG vs prompt-tune.
  • SFT a small model end-to-end.
  • LoRA fine-tune with PEFT.
  • QLoRA fine-tune a 7B model on a single GPU.
  • DPO fine-tune with TRL.
  • Before/after evals showing what changed.

Going further

  • Sebastian Raschka's blog posts on fine-tuning (sebastianraschka.com).
  • Hugging Face TRL examples-read every example script.
  • Axolotl docs and example configs-learn one config-driven workflow well.

How this sequence connects to the year

  • Month 8: This sequence IS the bulk of month 8.
  • Q3 (any track): Fine-tuning literacy is universal.
  • Capstone: A fine-tune + eval is a strong public artifact.

Comments