Month 8-Week 2: LoRA + QLoRA-first fine-tune¶
Week summary¶
- Goal: Read LoRA and QLoRA papers. SFT a small model with LoRA. Eval before / after. Internalize when fine-tuning is the right tool.
- Time: ~10 h over 3 sessions.
- Output: Fine-tuned LoRA adapter; before/after eval; notebook documenting the process.
- Sequences relied on: 15-fine-tuning rungs 01–05.
Why this week matters¶
Fine-tuning is the bridge from "user of frontier models" to "shaper of model behavior." LoRA (and QLoRA-quantized LoRA) made it accessible on a single GPU. Knowing when fine-tuning is the right tool-and especially when it isn't-is core literacy.
Prerequisites¶
- M08-W01 complete.
- GPU access continued.
Recommended cadence¶
- Session A-Tue/Wed evening (~3.5 h): papers + when not to fine-tune
- Session B-Sat morning (~4 h): first SFT with LoRA
- Session C-Sun afternoon (~2.5 h): QLoRA on bigger model + eval
Session A-LoRA + QLoRA papers + when not to fine-tune¶
Goal: Read both papers. Internalize when fine-tuning is the right tool.
Part 1-When NOT to fine-tune (45 min)¶
Common mistakes: - "Add knowledge"-use RAG instead. Fine-tuning bakes facts into weights but doesn't update easily. - "Improve at long-context tasks"-usually a context-length / prompt issue, not a weights issue. - "Make the model good at my niche domain"-try few-shot first; fine-tune only if few-shot insufficient.
When fine-tuning IS right: - Change behavior, format, tone-not knowledge. - Specialize on a narrow output structure. - Compress a working long prompt into a smaller, faster model. - Distill a strong model's behavior into a cheaper deployment.
Read: OpenAI's fine-tuning guide. Plus Sebastian Raschka's blog on fine-tuning practical advice (sebastianraschka.com).
Part 2-LoRA paper (60 min)¶
Read: LoRA (arxiv.org/abs/2106.09685). Sections 1, 4, 5.
Key idea: instead of fine-tuning all weights, freeze them and add small low-rank update matrices. Trainable params drop 100–1000×.
Math: a weight matrix W (large) is replaced (additively) by W + B·A where B is d × r, A is r × k, with r << d, k. Often r = 8–32.
Part 3-QLoRA paper (75 min)¶
Read: QLoRA (arxiv.org/abs/2305.14314). Sections 1, 3, 4.
Key contributions: - Quantize the base model to 4-bit using the NF4 format (information-theoretically optimal for normally distributed weights). - Adapter weights stay in fp16/bf16. - Double-quantization for further memory savings. - Paged optimizers for handling memory spikes.
Result: can fine-tune a 70B model on a 48GB GPU. Or a 7B model on a 16GB GPU.
Output of Session A¶
- Notes on when (not) to fine-tune.
- LoRA + QLoRA paper notes.
Session B-First fine-tune with TRL + PEFT¶
Goal: SFT Qwen2.5-0.5B (or similar small model) on a domain dataset using LoRA.
Part 1-Setup (30 min)¶
uv pip install transformers trl peft datasets accelerate bitsandbytes wandb
huggingface-cli login # for any gated models
wandb login
Part 2-Pick a dataset and a small model (45 min)¶
Model: Qwen/Qwen2.5-0.5B-Instruct (small, fits even on Colab T4).
Dataset: Either:
- databricks/databricks-dolly-15k (general).
- A synthetic dataset for your domain (e.g., generate 500 incident-report-to-triage pairs with Claude).
- HuggingFaceH4/no_robots.
Format conversation-style:
Part 3-Training script (165 min)¶
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model
from trl import SFTConfig, SFTTrainer
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
peft_config = LoraConfig(
r=16, lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
ds = load_dataset("HuggingFaceH4/no_robots", split="train_sft").select(range(500))
cfg = SFTConfig(
output_dir="ft-out",
num_train_epochs=2,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
logging_steps=10,
bf16=True,
report_to="wandb",
)
trainer = SFTTrainer(
model=model, args=cfg, train_dataset=ds,
peft_config=peft_config,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("ft-out/final")
Watch loss. Should decrease.
Output of Session B¶
- Trained adapter at
ft-out/final/. - W&B run with loss curve.
Session C-QLoRA on a bigger model + eval¶
Goal: QLoRA-fine-tune a 7B model. Compare base vs fine-tuned on your eval set.
Part 1-QLoRA config (30 min)¶
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb)
Adjust LoRA config for the larger model (same r, more target_modules including FFN).
Part 2-Train + save (90 min)¶
Same SFTTrainer setup. Train for 1 epoch (don't overfit small datasets). ~30–60 min on a single A10.
Part 3-Eval before / after (60 min)¶
Use your M04-W03 / M06-W03 eval setup. Run on 30 examples: - Base model. - Fine-tuned model (with adapter loaded).
Compare: | Metric | Base | Fine-tuned | Δ | |---|---|---|---| | Format-conformance | 0.66 | 0.92 | +0.26 | | Severity match | 0.71 | 0.78 | +0.07 | | Faithfulness (judge) | 4.0 | 3.9 | -0.1 |
Common pattern: fine-tuning helps format/structure dramatically; helps factual quality less; can hurt if dataset is too narrow ("catastrophic forgetting").
Honest write-up in repo.
Output of Session C¶
- 7B QLoRA adapter trained.
- Before/after eval committed.
End-of-week artifact¶
- LoRA + QLoRA paper notes
- Small-model LoRA fine-tune
- 7B-model QLoRA fine-tune
- Before/after eval with delta documented
End-of-week self-assessment¶
- I can articulate when to fine-tune vs RAG vs prompt-tune.
- I can write a TRL SFTTrainer config from a blank file.
- I have measured my fine-tune's effect on a real eval set.
Common failure modes for this week¶
- Fine-tuning to "improve quality" without specifying what you're improving. Always specify.
- No before/after eval. Without it, you don't know if fine-tuning helped.
- Too small dataset (< 100). Generally insufficient unless domain is very narrow.
What's next (preview of M08-W03)¶
DPO-direct preference optimization. The simpler, more elegant successor to RLHF/PPO.