09 - Fine-Tuning¶
What this session is¶
About 90 minutes. Adapt a pretrained model to your data. Modern parameter-efficient fine-tuning (LoRA) - feasible on consumer GPUs. By the end you'll have fine-tuned a small LLM on a custom dataset.
This page benefits from a GPU. CPU works but is very slow.
The two modes¶
- Full fine-tuning - update all model weights. Best quality; needs massive memory (7B model in FP16 needs ~28GB just for the optimizer states). Beyond most beginners' budgets.
- Parameter-efficient fine-tuning (PEFT) - update only a tiny subset of new parameters. LoRA is the most-used. Same effective quality for ~1% of the parameters' worth of training. Runs on a single consumer GPU.
We'll do LoRA.
What LoRA actually does¶
Each big linear layer in a transformer (the nn.Linear from page 04, scaled up) is a matrix W. Instead of updating W directly, LoRA learns a low-rank update:
Where A is (d, r) and B is (r, d). The rank r is small (typically 8-64). The original W stays frozen; only A and B train.
Memory savings: instead of d × d parameters per layer (millions), you train d × r + r × d (tens of thousands). For a 7B model with rank-16 LoRA: ~10M trainable parameters instead of 7B.
You don't implement this - peft library handles it.
Setup¶
- peft - parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning, etc.).
- trl - Transformers Reinforcement Learning. Includes
SFTTrainer, the easiest fine-tuning loop wrapper. - bitsandbytes - 4-bit quantization, used by QLoRA.
A complete LoRA fine-tuning¶
We'll fine-tune a small model on a tiny dataset to make it answer in a specific style.
import torch
from datasets import Dataset
from transformers import (
AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
model_name = "microsoft/Phi-3-mini-4k-instruct"
# Quantized to 4-bit
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# LoRA config
lora = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 6.3M || all params: 3.8B || trainable: 0.16%
# A tiny training dataset (in production: load real data with `datasets`)
examples = [
{"text": "<|user|>\nWhat's 2+2?<|end|>\n<|assistant|>\nIt's 4, mate.<|end|>"},
{"text": "<|user|>\nHello!<|end|>\n<|assistant|>\nG'day!<|end|>"},
{"text": "<|user|>\nWhat's your favorite color?<|end|>\n<|assistant|>\nProbably blue, mate.<|end|>"},
# ... in a real run, hundreds to thousands of examples ...
] * 50
train_ds = Dataset.from_list(examples)
# Training config
cfg = SFTConfig(
output_dir="./lora-out",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=512,
)
trainer = SFTTrainer(
model=model,
tokenizer=tok,
train_dataset=train_ds,
args=cfg,
dataset_text_field="text",
)
trainer.train()
trainer.save_model("./lora-out/final")
The whole thing - model load, LoRA setup, training loop - fits in ~50 lines. SFTTrainer from trl wraps Hugging Face's Trainer with sensible defaults.
Run time: ~5-15 minutes on a free Colab T4 GPU for the small dataset above.
Use the fine-tuned model¶
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = PeftModel.from_pretrained(base, "./lora-out/final")
model.eval()
inputs = tok("<|user|>\nHi there!<|end|>\n<|assistant|>\n", return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# Hopefully responds in the trained style ("G'day mate!")
The fine-tuned model = base model + LoRA adapter. The adapter is small (~30MB for our config); the base is shared.
Merge LoRA into the base (for deployment)¶
For inference at scale, you may want a single merged model:
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tok.save_pretrained("./merged-model")
Result: a standalone model with LoRA's updates baked in. Drops the adapter layer overhead at inference time.
Hyperparameter notes¶
r(LoRA rank) - 8, 16, 32, 64. Higher = more capacity, more memory. Start at 16.lora_alpha- usually 2×r. Acts as a scaling factor.target_modules- which linear layers to LoRA-fy. Common:["q_proj", "v_proj"]for cheap,["q_proj", "k_proj", "v_proj", "o_proj"]for fuller coverage. Model-specific naming.learning_rate- much higher than full fine-tuning (because you have fewer params). 1e-4 to 5e-4 typical.per_device_train_batch_size+gradient_accumulation_steps- effective batch is the product. Small batch fits memory; accumulation simulates a bigger batch.
These need experimentation. Start with the defaults above; adjust.
QLoRA - even smaller memory¶
The BitsAndBytesConfig(load_in_4bit=True, ...) we used is QLoRA - quantize the base model to 4-bit, train LoRA adapters on top in higher precision. Lets you fine-tune 7B models on a 12GB GPU. The standard approach for hobbyist fine-tuning.
What you can / can't fine-tune¶
LoRA fine-tuning is great for: - Style adaptation - "respond in our brand's voice." - Domain-specific Q&A - train on your support docs. - Output format - JSON conformance, structured outputs. - Tool / function calling - train the model to emit specific function calls.
LoRA is bad for: - Teaching the model NEW factual knowledge. That requires more data + full fine-tuning, and the model often half-learns and hallucinates the rest. For facts, use RAG (page 10). - Reasoning skill upgrades. Generally requires lots of data + more compute than LoRA gives.
Dataset format¶
Most fine-tuning recipes want a list of conversation strings in the model's chat format. Building one:
- Collect 50-1000+ example interactions in the desired style.
- Format each as a single string using the model's chat template.
- Wrap in a
datasets.Dataset.
Real datasets often live on Hugging Face Hub - datasets.load_dataset("squad") etc. Filter / format as needed.
Exercise¶
You need a GPU (or Colab) for this exercise. CPU works but takes hours.
-
Run the example above. Train for 1 epoch on the toy dataset. Confirm training loss decreases.
-
Use the trained model. Run a few prompts. Notice the style.
-
Increase the dataset size. Add 10 more diverse examples. Re-train. Compare outputs.
-
Tweak
r: tryr=8vsr=64. Quality difference? Memory difference? -
(Stretch) Use a real dataset from Hugging Face:
datasets.load_dataset("squad", split="train[:1000]"). Format the QA pairs into the chat template. Fine-tune. Evaluate by hand.
What you might wonder¶
"Why is my fine-tuned model worse than the base?" Common causes: dataset too small (under ~100 examples), learning rate too high (model overfits and forgets general knowledge), bad data formatting (model is learning your formatting bugs not your style). Start with a known-good recipe and iterate.
"What's 'catastrophic forgetting'?" The fine-tuned model loses knowledge from its base training. Severe with full fine-tuning; minimal with LoRA (the base weights are frozen). One reason LoRA is the default.
"How do I evaluate the fine-tuned model?" Page 11. Critical and the hardest part of ML.
"DPO? RLHF? PPO? GRPO?" Reinforcement-learning-from-feedback techniques used by frontier labs to align chat models. Beyond beginner; mentioned for awareness.
Done¶
- Distinguish full fine-tuning from LoRA / PEFT.
- Set up
peft+trlfor a real LoRA training run. - Train and save a fine-tuned model.
- Load and use the trained adapter.
- Pick reasonable hyperparameters.
Next: Retrieval-Augmented Generation →