Saltar a contenido

09 - Fine-Tuning

What this session is

About 90 minutes. Adapt a pretrained model to your data. Modern parameter-efficient fine-tuning (LoRA) - feasible on consumer GPUs. By the end you'll have fine-tuned a small LLM on a custom dataset.

This page benefits from a GPU. CPU works but is very slow.

The two modes

  • Full fine-tuning - update all model weights. Best quality; needs massive memory (7B model in FP16 needs ~28GB just for the optimizer states). Beyond most beginners' budgets.
  • Parameter-efficient fine-tuning (PEFT) - update only a tiny subset of new parameters. LoRA is the most-used. Same effective quality for ~1% of the parameters' worth of training. Runs on a single consumer GPU.

We'll do LoRA.

What LoRA actually does

Each big linear layer in a transformer (the nn.Linear from page 04, scaled up) is a matrix W. Instead of updating W directly, LoRA learns a low-rank update:

W_new = W + (A @ B)

Where A is (d, r) and B is (r, d). The rank r is small (typically 8-64). The original W stays frozen; only A and B train.

Memory savings: instead of d × d parameters per layer (millions), you train d × r + r × d (tens of thousands). For a 7B model with rank-16 LoRA: ~10M trainable parameters instead of 7B.

You don't implement this - peft library handles it.

Setup

pip install transformers datasets accelerate peft trl bitsandbytes
  • peft - parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning, etc.).
  • trl - Transformers Reinforcement Learning. Includes SFTTrainer, the easiest fine-tuning loop wrapper.
  • bitsandbytes - 4-bit quantization, used by QLoRA.

A complete LoRA fine-tuning

We'll fine-tune a small model on a tiny dataset to make it answer in a specific style.

import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

model_name = "microsoft/Phi-3-mini-4k-instruct"
# Quantized to 4-bit
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# LoRA config
lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 6.3M || all params: 3.8B || trainable: 0.16%

# A tiny training dataset (in production: load real data with `datasets`)
examples = [
    {"text": "<|user|>\nWhat's 2+2?<|end|>\n<|assistant|>\nIt's 4, mate.<|end|>"},
    {"text": "<|user|>\nHello!<|end|>\n<|assistant|>\nG'day!<|end|>"},
    {"text": "<|user|>\nWhat's your favorite color?<|end|>\n<|assistant|>\nProbably blue, mate.<|end|>"},
    # ... in a real run, hundreds to thousands of examples ...
] * 50

train_ds = Dataset.from_list(examples)

# Training config
cfg = SFTConfig(
    output_dir="./lora-out",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=512,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tok,
    train_dataset=train_ds,
    args=cfg,
    dataset_text_field="text",
)

trainer.train()
trainer.save_model("./lora-out/final")

The whole thing - model load, LoRA setup, training loop - fits in ~50 lines. SFTTrainer from trl wraps Hugging Face's Trainer with sensible defaults.

Run time: ~5-15 minutes on a free Colab T4 GPU for the small dataset above.

Use the fine-tuned model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = PeftModel.from_pretrained(base, "./lora-out/final")
model.eval()

inputs = tok("<|user|>\nHi there!<|end|>\n<|assistant|>\n", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# Hopefully responds in the trained style ("G'day mate!")

The fine-tuned model = base model + LoRA adapter. The adapter is small (~30MB for our config); the base is shared.

Merge LoRA into the base (for deployment)

For inference at scale, you may want a single merged model:

merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tok.save_pretrained("./merged-model")

Result: a standalone model with LoRA's updates baked in. Drops the adapter layer overhead at inference time.

Hyperparameter notes

  • r (LoRA rank) - 8, 16, 32, 64. Higher = more capacity, more memory. Start at 16.
  • lora_alpha - usually 2× r. Acts as a scaling factor.
  • target_modules - which linear layers to LoRA-fy. Common: ["q_proj", "v_proj"] for cheap, ["q_proj", "k_proj", "v_proj", "o_proj"] for fuller coverage. Model-specific naming.
  • learning_rate - much higher than full fine-tuning (because you have fewer params). 1e-4 to 5e-4 typical.
  • per_device_train_batch_size + gradient_accumulation_steps - effective batch is the product. Small batch fits memory; accumulation simulates a bigger batch.

These need experimentation. Start with the defaults above; adjust.

QLoRA - even smaller memory

The BitsAndBytesConfig(load_in_4bit=True, ...) we used is QLoRA - quantize the base model to 4-bit, train LoRA adapters on top in higher precision. Lets you fine-tune 7B models on a 12GB GPU. The standard approach for hobbyist fine-tuning.

What you can / can't fine-tune

LoRA fine-tuning is great for: - Style adaptation - "respond in our brand's voice." - Domain-specific Q&A - train on your support docs. - Output format - JSON conformance, structured outputs. - Tool / function calling - train the model to emit specific function calls.

LoRA is bad for: - Teaching the model NEW factual knowledge. That requires more data + full fine-tuning, and the model often half-learns and hallucinates the rest. For facts, use RAG (page 10). - Reasoning skill upgrades. Generally requires lots of data + more compute than LoRA gives.

Dataset format

Most fine-tuning recipes want a list of conversation strings in the model's chat format. Building one:

  1. Collect 50-1000+ example interactions in the desired style.
  2. Format each as a single string using the model's chat template.
  3. Wrap in a datasets.Dataset.

Real datasets often live on Hugging Face Hub - datasets.load_dataset("squad") etc. Filter / format as needed.

Exercise

You need a GPU (or Colab) for this exercise. CPU works but takes hours.

  1. Run the example above. Train for 1 epoch on the toy dataset. Confirm training loss decreases.

  2. Use the trained model. Run a few prompts. Notice the style.

  3. Increase the dataset size. Add 10 more diverse examples. Re-train. Compare outputs.

  4. Tweak r: try r=8 vs r=64. Quality difference? Memory difference?

  5. (Stretch) Use a real dataset from Hugging Face: datasets.load_dataset("squad", split="train[:1000]"). Format the QA pairs into the chat template. Fine-tune. Evaluate by hand.

What you might wonder

"Why is my fine-tuned model worse than the base?" Common causes: dataset too small (under ~100 examples), learning rate too high (model overfits and forgets general knowledge), bad data formatting (model is learning your formatting bugs not your style). Start with a known-good recipe and iterate.

"What's 'catastrophic forgetting'?" The fine-tuned model loses knowledge from its base training. Severe with full fine-tuning; minimal with LoRA (the base weights are frozen). One reason LoRA is the default.

"How do I evaluate the fine-tuned model?" Page 11. Critical and the hardest part of ML.

"DPO? RLHF? PPO? GRPO?" Reinforcement-learning-from-feedback techniques used by frontier labs to align chat models. Beyond beginner; mentioned for awareness.

Done

  • Distinguish full fine-tuning from LoRA / PEFT.
  • Set up peft + trl for a real LoRA training run.
  • Train and save a fine-tuned model.
  • Load and use the trained adapter.
  • Pick reasonable hyperparameters.

Next: Retrieval-Augmented Generation →

Comments