AI Systems From Scratch (Beginner)¶

Beginner path: heard-of-ChatGPT → training a small net, fine-tuning with LoRA, building RAG, serving locally, contributing to AI OSS.

Printing this page

Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.

AI Systems From Scratch - Beginner to OSS Contributor¶

From "I've heard of AI / LLMs" to "I can train a small model, fine-tune a transformer, build a small RAG app, evaluate it honestly, and submit a fix to an AI-adjacent OSS project."

Who this is for¶

You've finished Python From Scratch (or you're comfortable enough in Python to write small programs).
You've never trained a model, OR you've copy-pasted some PyTorch / Hugging Face code without really understanding what it does.

Soft prerequisite¶

Python comfort is mandatory - AI tooling lives in Python. If you can't write a function, walk a list, and read a stack trace, do Python From Scratch first.

You do not need a PhD in math. We use linear algebra at the level of "dot products and matrices"; we explain everything else as it appears.

What you'll need¶

A computer. A GPU helps a lot but isn't required for the first 8 pages. Free options for hands-on GPU work: Google Colab (free tier), Kaggle Notebooks, Lambda Labs (cheap), AWS / GCP (paid).
Python ≥3.10 (you set this up in the Python beginner path).
A text editor.
About 5 hours/week. Path is sized for 4-6 months.

Why AI systems¶

Biggest growth area in software. The job market and OSS activity around LLMs / ML infra is the most active it's ever been.
OSS is the heart of the field. PyTorch, Hugging Face, vLLM, llama.cpp, Ollama, LangChain - all open-source and welcoming.
The barrier is lower than it looks. Modern tooling lets you fine-tune real models with ~50 lines of code; serve them with another ~30.

How this path works¶

Same template as the other beginner paths: one concept per page, code first then walkthrough, exercise, Q&A, done recap.

We use PyTorch as the framework throughout - most popular, best ecosystem. Hugging Face Transformers for pre-trained models. vLLM / Ollama / llama.cpp for inference.

The pages¶

#	Title	What you'll know after
00	Introduction	What we're doing and why
01	Setup	Python + PyTorch + CUDA (or CPU) working
02	Tensors	PyTorch's central data type
03	Linear algebra you actually need	Dot products, matmul, gradients (intuitive)
04	Your first neural network	A small MLP from scratch
05	Training loop	Loss, optimizer, gradient descent
06	Inference and saving	Loading a pretrained model, running it
07	Transformers and tokenization	What an LLM actually does
08	Hugging Face Transformers	Pre-trained models in 3 lines
09	Fine-tuning	Adapt a model to your data (LoRA-friendly)
10	Retrieval-Augmented Generation	Embeddings + vector DB + LLM
11	Evaluation	The hardest part of ML done seriously
12	Serving models	vLLM, Ollama, simple HTTP wrappers
13	Picking a project	AI-OSS candidates
14	Anatomy of an AI OSS project	Case study
15	Your first contribution	Workflow + PR

Start with Introduction.

00 - Introduction¶

What this session is¶

A 10-minute read. No code. Sets expectations.

What you're going to be able to do, eventually¶

By the end: - Manipulate tensors confidently with PyTorch. - Build, train, and use a small neural network from scratch. - Load a pre-trained transformer from Hugging Face and use it. - Fine-tune that transformer on your own data (parameter-efficient with LoRA). - Build a small Retrieval-Augmented Generation (RAG) app. - Evaluate model quality the right way (most people get this wrong). - Serve a model behind an HTTP API. - Clone an AI OSS project, find a small fix, submit a PR.

The deal¶

It's slow on purpose. One concept per page.
Python fluency assumed. Read a stack trace, write a function, walk a list.
No math PhD required. Linear algebra at the "dot product and matmul" level. We explain everything else inline.
GPU is helpful but not mandatory. Pages 01-08 work on CPU. Page 09+ benefits from GPU; Google Colab's free tier suffices.
You will be confused. Often. AI has more vocabulary than any other technical area on this site. Don't panic.

A note on hype vs honesty¶

The AI field has more hype than any other in software. To stay sane:

Models are token predictors. They are not "intelligent" in the way the marketing implies. They are very good at pattern completion over enormous corpora. That's an extraordinary thing - and that's all it is.
Most "AI products" are wrappers around APIs. The actual engineering: tokenization, retrieval, prompt design, evaluation. The "model" itself is often someone else's pre-trained checkpoint.
Evaluation is the hard part. "Looks good" is not evaluation. We'll do this properly in page 11.

This path treats AI as a practical engineering domain - what works, how it's built, how to ship it. We don't speculate about AGI.

What you need¶

A computer (any OS).
Python ≥3.10 (set up in Python From Scratch path).
A text editor.
~5 hours/week. Path is sized for 4-6 months.
A GPU for pages 09+ (or use Google Colab / Kaggle for free).

What you do NOT need¶

A PhD or MS.
A formal math background beyond high school algebra + intuitive linear algebra (we cover what you need).
A cloud account or paid API. Open-source models run locally; we use them.
C++ / CUDA. Those are senior-path material (AI Systems senior reference).

How long this realistically takes¶

4-6 months at 5 hours/week to "submit a PR."

The slowest pages are 07 (transformers) and 09 (fine-tuning). Plan for one or two re-reads at each.

What success looks like¶

You'll be able to: - Look at a model.py in any HF model and roughly understand what it does. - Build a small project end-to-end: load data, train, evaluate, serve. - Read a research paper's abstract + introduction + experiments section and predict what their code does. - Submit a fix to a real AI OSS project.

You will not be able to: - Train a frontier LLM. (Multi-million-dollar GPU farms; not in 6 months.) - Tell people you're "an ML engineer." (Years of work past this.) - Pass an FAANG ML interview. (Different focus - leetcode plus theory.)

What you'll have: the foundation to keep going. The AI Expert Roadmap is the natural follow-up - 12 months of structured study from here.

One last thing before we start¶

If a page feels too dense - stop, re-read. Still dense? Skip, come back.

The AI field uses jargon shamelessly. When a word appears you haven't seen, this path defines it inline. If a word slips through without definition, that's a bug - note it.

Ready? Next: Setup →

01 - Setup¶

What this session is¶

About 45 minutes. Install PyTorch (with GPU support if you have one), confirm it works, run a tiny tensor program.

Step 1: Pick your Python environment¶

You should have Python ≥3.10 from the Python beginner path. Create a fresh virtual environment for this path:

mkdir -p ~/code/ai-learning
cd ~/code/ai-learning
python3 -m venv .venv
source .venv/bin/activate            # macOS / Linux
.venv\Scripts\activate                # Windows

You'll work inside this .venv for the whole path. Always activate it before working.

Step 2: Install PyTorch¶

Go to pytorch.org/get-started/locally. The site has a config-builder that gives you the exact pip install command for your OS / Python / CUDA.

Typical commands:

CPU only (any platform):

pip install torch torchvision

Linux + NVIDIA GPU (CUDA 12):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

macOS (Apple Silicon - uses MPS, Apple's Metal-based backend):

pip install torch torchvision

PyTorch on macOS automatically uses MPS when available. Works well for small models; not as fast as CUDA.

Verify:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

Should print a version (e.g., 2.4.0) and True or False depending on whether CUDA is available.

If you have an NVIDIA GPU and cuda.is_available() is False: the install picked the CPU-only wheel. Reinstall with the CUDA URL.

If you have Apple Silicon:

python -c "import torch; print(torch.backends.mps.is_available())"

Should be True.

Step 3: Install supporting libraries¶

The rest of the path uses these:

pip install jupyter ipykernel numpy pandas matplotlib
pip install transformers datasets accelerate
pip install sentence-transformers faiss-cpu      # for RAG (page 10)
pip install scikit-learn                          # for utilities
pip install httpx                                 # for serving (page 12)

That's a lot at once. Each tool has its own purpose:

transformers - Hugging Face's library; loading and using pre-trained models.
datasets - Hugging Face's data-loading library.
accelerate - multi-GPU + mixed-precision helper.
sentence-transformers + faiss - embeddings + vector search for RAG.
scikit-learn - classical ML utilities (data splits, metrics).

If any installation fails, read the error. Often it's a missing system dependency (e.g., libstdc++); search the error message for guidance.

Step 4: First PyTorch program¶

Create tensor_hello.py:

import torch

# Create a tensor - PyTorch's central data type
x = torch.tensor([1.0, 2.0, 3.0])
print("tensor:", x)
print("shape:", x.shape)
print("dtype:", x.dtype)

# Some math
y = torch.tensor([4.0, 5.0, 6.0])
print("x + y:", x + y)
print("x · y:", torch.dot(x, y))

# A 2D tensor (matrix)
mat = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
print("matrix:")
print(mat)
print("matrix shape:", mat.shape)

# Move to GPU if available
if torch.cuda.is_available():
    x_gpu = x.cuda()
    print("x on GPU:", x_gpu.device)
elif torch.backends.mps.is_available():
    x_mps = x.to("mps")
    print("x on MPS:", x_mps.device)
else:
    print("running on CPU")

Run:

python tensor_hello.py

You should see output like:

tensor: tensor([1., 2., 3.])
shape: torch.Size([3])
dtype: torch.float32
x + y: tensor([5., 7., 9.])
x · y: tensor(32.)
matrix:
tensor([[1., 2.],
        [3., 4.]])
matrix shape: torch.Size([2, 2])
running on CPU

(32 because 1·4 + 2·5 + 3·6 = 32. We'll cover dot products properly in page 03.)

Step 5: Jupyter notebooks (optional but valuable)¶

ML work happens in notebooks more than scripts. They mix code, output, plots, and prose in one document.

jupyter notebook

Opens a browser. Click "New" → "Python 3". You get a cell-based editor. Each cell runs independently; output appears below it. Great for exploration.

Many tutorials are notebook files (.ipynb). You'll meet them. VS Code also has a built-in notebook UI.

Step 6: Google Colab as a backup¶

For pages 09+ you'll want a GPU. If you don't have one:

colab.research.google.com gives you a free Jupyter notebook with a GPU (~T4-class) for a few hours per session.

!nvidia-smi              # in Colab - shows the GPU you got

For more compute: Kaggle Notebooks (free), Lambda Labs (paid by the minute), or any cloud provider.

We'll note when a page benefits from GPU.

A note on the AI ecosystem's pace¶

AI libraries change fast. PyTorch APIs are stable across minor versions but the broader ecosystem (Hugging Face, vLLM, training optimizers) iterates monthly. When in doubt, read the library's current docs, not blog posts from 2023.

Exercise¶

Verify PyTorch installation:

python -c "import torch; print(torch.__version__, torch.cuda.is_available() or torch.backends.mps.is_available())"

Run tensor_hello.py above. Confirm the output.
Modify it:
Create a tensor z = torch.arange(10). Print it. (Range from 0 to 9.)
Create a random 3×3 matrix with torch.randn(3, 3). Print.
Compute the matrix's sum (m.sum()) and mean (m.mean()).
(Optional) Set up Jupyter:
```
jupyter notebook
```
Create a new notebook. Run the tensor code in cells.

What you might wonder¶

"Why PyTorch and not TensorFlow / JAX?" PyTorch is the dominant ML framework as of 2026 - research uses it, most OSS uses it, the job market wants it. JAX has its place (Google ecosystem, transformer research); TensorFlow is mostly legacy. Stick with PyTorch.

"What is CUDA?" NVIDIA's parallel computing platform - the way GPUs run general-purpose code. PyTorch built with CUDA support uses your GPU automatically when you .cuda() tensors. AMD GPUs use ROCm; Apple Silicon uses MPS.

"My GPU isn't supported / I don't have one." Use CPU for pages 01-08. They run fine. For pages 09+, use Google Colab's free tier.

"How much disk space will this take?" ~3-5 GB for PyTorch and core deps. Pre-trained models you download in later pages can add 1-10 GB each. Plan for 30-50 GB free if you'll experiment broadly.

Done¶

You have: - Python venv with PyTorch installed (CPU or GPU). - Supporting libraries (transformers, datasets, sentence-transformers, faiss). - Verified PyTorch can create and operate on tensors. - (Optional) Jupyter set up; Colab as backup.

Next: Tensors →

02 - Tensors¶

What this session is¶

About 45 minutes. Tensors are PyTorch's central data type - multi-dimensional arrays, with support for GPU acceleration and automatic differentiation. Almost every line of PyTorch code touches tensors.

What a tensor is¶

A tensor is a generalized array:

A 0-dimensional tensor is a single number (scalar): 7.
A 1-D tensor is a vector: [1, 2, 3].
A 2-D tensor is a matrix: [[1, 2], [3, 4]].
A 3-D tensor is a cube of numbers. (Often used for color images: [height, width, channels].)
Higher: a batch of images, a batch of token sequences, etc.

Every tensor has a shape (its size per dimension) and a dtype (the type of each element).

Creating tensors¶

import torch

# From a Python list
a = torch.tensor([1, 2, 3])
print(a.shape, a.dtype)            # torch.Size([3]) torch.int64

# As floats
b = torch.tensor([1.0, 2.0, 3.0])
print(b.shape, b.dtype)            # torch.Size([3]) torch.float32

# Zeros, ones, random
z = torch.zeros(2, 3)              # 2x3 of zeros
o = torch.ones(2, 3)
r = torch.randn(2, 3)              # random normal (mean=0, std=1)
u = torch.rand(2, 3)               # random uniform [0, 1)
i = torch.arange(0, 10)            # 0, 1, 2, ..., 9

# An identity matrix
I = torch.eye(4)

Default float dtype is float32. Default int dtype is int64. You can specify:

x = torch.zeros(2, 3, dtype=torch.float16)
y = torch.tensor([1, 2, 3], dtype=torch.int32)

Shape and reshape¶

a = torch.arange(12)
print(a.shape)                     # torch.Size([12])

b = a.reshape(3, 4)                # 3x4 matrix
print(b.shape)                     # torch.Size([3, 4])

c = a.reshape(2, 2, 3)             # 2x2x3 tensor
print(c.shape)                     # torch.Size([2, 2, 3])

reshape(...) doesn't copy data when it can avoid it - it just changes the "view" on the underlying buffer.

The -1 placeholder means "infer this dimension":

a = torch.arange(12)
b = a.reshape(-1, 4)               # 3 rows of 4 (12/4)
c = a.reshape(2, -1)               # 2 rows of 6 (12/2)

Useful in functions where you know all but one dimension.

Indexing¶

Like NumPy, like Python lists, but extended:

m = torch.arange(12).reshape(3, 4)
# m is:
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

m[0]              # first row: tensor([0, 1, 2, 3])
m[0, 0]           # first element: tensor(0)
m[:, 0]           # first column: tensor([0, 4, 8])
m[1:, 2:]         # rows 1+, cols 2+: tensor([[6, 7], [10, 11]])
m[0:2, 0:2]       # 2x2 top-left

Slicing returns a view (shares storage). Modifying the slice modifies the original. Use .clone() if you need an independent copy.

Arithmetic¶

Element-wise:

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)               # [5, 7, 9]
print(a * b)               # [4, 10, 18]    element-wise multiply
print(a ** 2)              # [1, 4, 9]
print(torch.exp(a))        # [e^1, e^2, e^3]
print(torch.sin(a))

Reductions (collapse a dimension):

m = torch.randn(3, 4)
m.sum()                    # scalar
m.sum(dim=0)               # column sums (4 values)
m.sum(dim=1)               # row sums (3 values)
m.mean()
m.max()
m.argmax()                 # index of maximum

dim= is the dimension to reduce over. dim=0 collapses the rows; dim=1 collapses the columns. Confusing the first time; you'll internalize it.

Matrix multiplication¶

The most-used operation in ML. It is not the same as element-wise *.

A = torch.randn(2, 3)      # 2x3
B = torch.randn(3, 4)      # 3x4
C = A @ B                  # 2x4 - matrix multiply
# or: torch.matmul(A, B)

The @ operator is matrix multiply. Two requirements: - Inner dimensions match: (2, 3) @ (3, 4) works because both have 3 in the middle. - Result is the outer dimensions: (2, 3) @ (3, 4) → (2, 4).

If they don't match, you get an error. Get the dimensions right first; everything else follows.

Broadcasting¶

When you operate on tensors of different shapes, PyTorch tries to make them match by broadcasting the smaller one along the matching dimensions:

a = torch.tensor([[1, 2, 3], [4, 5, 6]])     # shape (2, 3)
b = torch.tensor([10, 20, 30])                # shape (3,)

print(a + b)
# tensor([[11, 22, 33],
#         [14, 25, 36]])

b was broadcast across the rows of a. Equivalent to adding [10, 20, 30] to each row.

The rules are precise but the intuition is "align from the right; missing dimensions are filled in by repeating":

a.shape = (2, 3)
b.shape = (3,)        becomes (1, 3) then broadcast → (2, 3)

When in doubt, print shapes. Most "shape mismatch" errors come from this; once you see the shapes, the fix is usually obvious.

Move tensors to GPU¶

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print("using:", device)

a = torch.randn(1000, 1000).to(device)
b = torch.randn(1000, 1000).to(device)
c = a @ b           # runs on GPU if device is cuda/mps

Tensors and operations have to be on the same device. Mixing CPU and GPU tensors raises an error.

Common idiom: define device once at the top of the script; .to(device) every tensor you create.

NumPy interop¶

PyTorch tensors and NumPy arrays interoperate:

import numpy as np

n = np.array([1, 2, 3])
t = torch.from_numpy(n)            # tensor sharing memory with the array
back = t.numpy()                   # numpy array sharing memory with the tensor

If the tensor is on CPU, this is free (no copy). On GPU, you have to .cpu() first.

NumPy is the older sibling - PyTorch borrows most of its API conventions from NumPy. If you've used NumPy, PyTorch tensors will feel familiar.

Going deeper¶

You can make and manipulate tensors now. This is the depth that turns the cryptic errors every PyTorch beginner hits into instant diagnoses - because you'll hit all of these in your first week.

The error you'll see most: shape mismatch¶

90% of early PyTorch errors are shapes that don't line up. The message looks scary but is precise:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x10 and 64x128)

Decode it: a matrix multiply needs the inner dimensions to match - (32x10) @ (64x128) fails because 10 != 64. The fix is making the inner dimensions agree (here, the first matrix's columns must equal the second's rows). The habit that prevents 90% of these: print .shape constantly.

print(x.shape)        # torch.Size([32, 10])   -- batch of 32, 10 features
print(w.shape)        # torch.Size([64, 128])  -- mismatch! 10 != 64

When any tensor op errors, your first move is always printing the shapes of the operands. The numbers in the error map directly to the numbers you print - the mismatch jumps out. Shape debugging is the single most-used PyTorch skill.

The device error: CPU vs GPU tensors¶

The second-most-common error - mixing tensors on different devices:

RuntimeError: Expected all tensors to be on the same device, but found at least
two devices, cuda:0 and cpu!

A tensor lives on the CPU or a specific GPU, and you can't do math across devices. This happens when your model is on the GPU but a batch of data is still on the CPU (or vice versa). The fix is moving everything to the same device:

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
x = x.to(device)              # data must go to the SAME device as the model
print(x.device, next(model.parameters()).device)   # confirm they match

The discipline: pick one device at the top, .to(device) your model once, and .to(device) every batch. "Expected all tensors on the same device" always means a batch (or a new tensor) didn't get moved - check .device on both operands.

The dtype trap¶

Less common but baffling when it hits - mixing data types:

RuntimeError: expected scalar type Float but found Long

A tensor of integers (Long) where floats (Float) were expected. Image pixels loaded as ints, labels vs inputs, etc. Fix with .float() / .long():

x = x.float()                 # cast to float32 for the model
y = y.long()                  # labels for CrossEntropyLoss must be Long
print(x.dtype, y.dtype)       # torch.float32 torch.int64

CrossEntropyLoss specifically wants float inputs (logits) and long labels - a frequent source of this error.

The silent one: broadcasting did something you didn't expect¶

This doesn't error - it gives wrong results, which is worse. Broadcasting auto-expands mismatched shapes, sometimes not how you meant:

a = torch.tensor([[1.0], [2.0], [3.0]])   # shape (3, 1)
b = torch.tensor([10.0, 20.0])            # shape (2,)
(a + b).shape                              # (3, 2)! - broadcast to a 3x2 grid, probably not intended

When a result has a surprising shape, broadcasting silently expanded something. Print shapes before the op; if they're not what you expect, reshape explicitly (.unsqueeze(), .view(), .reshape()) to control it. Broadcasting is powerful but its silent expansion is a real correctness footgun.

Try it (with what you'll see)¶

Deliberately multiply mismatched matrices (torch.randn(32,10) @ torch.randn(64,128)). Read the error, print both shapes, fix the inner dimension, watch it work.
If you have a GPU: put a model on cuda, feed it a CPU tensor, hit the device error. Add .to(device) and fix it.
Make a Long tensor, pass it where float is expected, see the dtype error, .float() it.
Add (3,1) and (2,) tensors and look at the surprising (3,2) result - feel broadcasting expand silently.

Exercise¶

In a new script tensor_practice.py:

Create a 5×3 tensor of random normal values. Print its shape and mean.
Create the same tensor and add 1.0 to every element. (Hint: just tensor + 1.)
Create a 3×3 identity matrix; create another 3×3 matrix with torch.arange(9).reshape(3, 3).float(). Multiply them with @. Result?
Create a = torch.arange(20).reshape(4, 5). Get the third row. Get the second column. Get the bottom-right 2×2 submatrix.
Broadcasting: create a = torch.zeros(3, 4) and b = torch.tensor([1, 2, 3, 4]). Compute a + b. What shape? What values?
GPU (if available): create two 1000×1000 random matrices. Time how long a @ b takes on CPU vs your device. Use time.time() around the multiplications.

What you might wonder¶

"Why are tensors not just NumPy arrays?" PyTorch tensors add: GPU support, automatic differentiation (page 04), automatic device placement, and a richer API for ML-specific operations. They're NumPy++.

"What's float32 vs float16 vs bfloat16?" Number formats with different precision/memory trade-offs. float32 (FP32) is the default - 4 bytes per number, lots of precision. float16 and bfloat16 are half-precision (2 bytes); used heavily in training large models for memory savings. Modern GPUs (Volta+) have tensor cores that specifically accelerate these.

"Why both reshape and view?" view requires the data to be contiguous in memory. reshape may copy if needed. Prefer reshape; reach for view only when you've measured it matters.

"My tensors are on different devices and I'm confused." Set a device = ... constant at the top of your script. Always .to(device) after creation. This rule alone eliminates 80% of device-mismatch bugs.

Done¶

Create tensors with various constructors.
Reshape, index, slice.
Use element-wise arithmetic, matrix multiplication.
Use broadcasting confidently.
Move tensors between CPU and GPU.

Next: Linear algebra you actually need →

03 - Linear Algebra You Actually Need¶

What this session is¶

About 30 minutes. The math for neural networks at the intuitive level. No proofs. By the end you'll know what a dot product, matrix multiply, and gradient are - and what they mean in ML code.

Dot product¶

Two vectors a and b of the same length:

a · b = a[0]*b[0] + a[1]*b[1] + ... + a[n-1]*b[n-1]

A single number. Measures how aligned the vectors are: large when they point the same way; zero when perpendicular; negative when opposite.

import torch
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(torch.dot(a, b))        # 32.0
# (1*4 + 2*5 + 3*6 = 32)

Why it matters: the simplest neuron computes a dot product between its inputs and its weights, adds a bias, applies a nonlinearity. Every neural network is built up from this.

Matrix multiplication¶

Treat a matrix as a stack of row vectors (or column vectors). Matrix multiplication A @ B:

The entry at row i, column j of A @ B is the dot product of row i of A and column j of B.

Shape rule: (m, k) @ (k, n) = (m, n). The inner dimensions match; the outer dimensions become the result's shape.

A = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])
B = torch.tensor([[5.0, 6.0],
                  [7.0, 8.0]])
print(A @ B)
# tensor([[19., 22.],
#         [43., 50.]])
# 1*5 + 2*7 = 19, 1*6 + 2*8 = 22, etc.

Why it matters: an entire neural network layer is output = input @ weights + bias. Matmul is what GPUs are designed to accelerate; everything else is supporting infrastructure.

Transpose¶

Swap rows and columns:

A = torch.tensor([[1, 2, 3], [4, 5, 6]])     # shape (2, 3)
print(A.T)                                    # shape (3, 2)
# tensor([[1, 4],
#         [2, 5],
#         [3, 6]])

Often used to make shapes line up for matrix multiplication.

A neuron¶

A single artificial neuron:

output = activation(input · weights + bias)

input and weights are vectors of the same length.
bias is a single number.
activation is a nonlinear function (relu, sigmoid, tanh, etc.).

A layer of n neurons is just n of these stacked - equivalent to one big matmul:

batch_size = 4
input_dim = 10
output_dim = 5

x = torch.randn(batch_size, input_dim)           # (4, 10)
W = torch.randn(input_dim, output_dim)           # (10, 5)
b = torch.randn(output_dim)                      # (5,)
out = x @ W + b                                  # (4, 5)

x @ W is (4, 10) @ (10, 5) → (4, 5). The bias b broadcasts across the batch.

Welcome - that's what a dense layer (also called a "linear layer" or "fully-connected layer") does. Everything else is variations.

Nonlinearity¶

Without a nonlinearity between layers, stacking matmuls collapses to one matmul (matrix multiplication is linear). A non-linear function applied element-wise restores the network's power:

import torch.nn.functional as F

x = torch.randn(4, 10)
h = F.relu(x @ W + b)         # ReLU: max(0, x)

Common nonlinearities: - ReLU - max(0, x). Cheap, effective, the default for hidden layers. - GELU - smoother ReLU. Used heavily in transformers. - Sigmoid - 1 / (1 + exp(-x)). Outputs in (0, 1). Used for binary classification outputs. - Softmax - normalizes a vector to sum to 1. Used for multi-class classification outputs.

You'll mostly use ReLU or GELU in hidden layers; softmax in output.

Gradient (intuitively)¶

A gradient is "the slope of a function at a point, in N dimensions." For a single-variable function f(x), the gradient is the derivative f'(x). For a multi-variable function L(w₁, w₂, ..., wₙ), it's a vector - one partial derivative per variable.

Why it matters: training a network is "minimize the loss function." The gradient of the loss with respect to the weights tells you "if I nudge each weight in the direction opposite the gradient, the loss decreases." That's gradient descent.

You don't compute gradients by hand. PyTorch's autograd does it for you - every operation you do on tensors with requires_grad=True is tracked, and .backward() walks the graph computing gradients automatically.

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()                  # computes dy/dx
print(x.grad)                 # 2*x + 3 = 7

Page 05 uses this in a training loop. For now, just know: gradients let you adjust weights to reduce loss.

Vectors in geometry vs in ML¶

In math classes, vectors had geometric meaning - points, directions, magnitudes. In ML, a vector is just a list of features. A user's embedding might be 1536 numbers - no geometric interpretation, but the "directions" still capture meaningful similarities (cosine of the angle between two user embeddings = how similar they are in the model's learned space).

The math is the same - the interpretation is "feature-space similarity," not "physical space."

Cosine similarity¶

The dot product of two normalized vectors (each with length 1):

import torch.nn.functional as F
a = torch.randn(100)
b = torch.randn(100)
sim = F.cosine_similarity(a, b, dim=0)
print(sim)        # between -1 and 1

A standard "how similar are these two embeddings" metric. Used heavily in RAG (page 10).

What you'll never need from a math course¶

Eigenvalues / eigenvectors (occasionally relevant; not for daily work).
Singular Value Decomposition (used in LoRA fine-tuning page 09; we'll cover what you need).
Convex analysis. Calculus of variations. Differential geometry.

Don't get nerve-sniped by Twitter saying you need to "understand linear algebra before doing ML." You need the operations on this page. The rest is for research, not engineering.

Exercise¶

Dot product: create two random vectors of length 100. Compute their dot product manually (loop with sum) AND with torch.dot. Verify they match.
Matmul shape check: create A of shape (3, 5) and B of shape (5, 7). What's the shape of A @ B? Verify in code.

A neuron from scratch:

import torch.nn.functional as F
x = torch.tensor([1.0, 2.0, 3.0])
w = torch.tensor([0.1, 0.2, 0.3])
b = torch.tensor(0.5)
out = F.relu(torch.dot(x, w) + b)
print(out)

What's the value? Why?

Batch: create X of shape (8, 3) (batch of 8 inputs, each 3-dim). Create W of shape (3, 5). Compute X @ W and inspect the shape. What does each row represent?
Gradient: define f(x) = x³ - 4x² + 7x - 1. At x = 2.0, compute the gradient using PyTorch. (Hint: use x.backward(). The math answer is 3x² - 8x + 7 = 3 at x=2.)

What you might wonder¶

"I see lots of torch.bmm in code. What's that?" Batched matrix multiplication - when you have a batch dimension. bmm is (B, m, k) @ (B, k, n) → (B, m, n). Common in transformers' attention.

"What's torch.einsum?" Einstein summation notation - a powerful, terse way to express tensor operations. torch.einsum("ij,jk->ik", A, B) is matmul. Worth learning once you've seen the same matmul pattern enough times.

"How does a network 'know' which way to adjust weights?" The gradient gives the direction of steepest increase. Going opposite the gradient (gradient descent) decreases the loss locally. That's all. The magic is that this simple rule works in millions of dimensions.

Done¶

Dot product, matrix multiplication, transpose.
A single neuron and a dense layer.
Common nonlinearities.
What a gradient is and why it matters.
Recognizing what's NOT essential math for ML engineering.

Next: Your first neural network →

04 - Your First Neural Network¶

What this session is¶

About 45 minutes. Build a small multi-layer perceptron (MLP) in PyTorch - the simplest interesting neural network. You'll see how nn.Module, layers, forward pass, and autograd compose.

The plan¶

We'll build a network that classifies handwritten digits (MNIST - the "hello world" of ML). 28×28 grayscale images → one of 10 digit classes.

We won't train it yet (page 05 covers training). This page is about defining the model.

`nn.Module`: PyTorch's central abstraction¶

Every model in PyTorch is a class extending torch.nn.Module. Define your layers in __init__; define how data flows through them in forward.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)     # input layer: 784 → 128
        self.fc2 = nn.Linear(128, 64)       # hidden:      128 → 64
        self.fc3 = nn.Linear(64, 10)        # output:      64 → 10

    def forward(self, x):
        # x is shape (batch_size, 784)
        x = F.relu(self.fc1(x))             # → (batch, 128)
        x = F.relu(self.fc2(x))             # → (batch, 64)
        x = self.fc3(x)                     # → (batch, 10) - raw logits
        return x

model = MLP()
print(model)

Run this. You'll see:

MLP(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)

PyTorch auto-prints the architecture.

What each line does¶

nn.Linear(in_features, out_features) - a fully-connected layer. Internally: a weight matrix of shape (in_features, out_features) and a bias vector. When you call it with input x, it computes x @ W + b.

F.relu(x) - element-wise ReLU activation. max(0, x). Adds non-linearity (page 03).

The last layer produces logits - raw scores, one per class. We don't apply softmax here; the loss function (page 05) does it more numerically stably.

Run a forward pass¶

x = torch.randn(4, 784)            # batch of 4 random "images"
out = model(x)
print(out.shape)                   # torch.Size([4, 10])
print(out[0])                      # 10 logits for the first sample

model(x) calls forward(x) under the hood. The output is (batch_size, num_classes) - 10 logits per input.

To turn logits into probabilities:

probs = F.softmax(out, dim=1)
print(probs[0])
print(probs.sum(dim=1))            # each row sums to 1

To get the predicted class (argmax over the classes):

preds = out.argmax(dim=1)
print(preds)                       # tensor of class indices

The model is randomly initialized - predictions are garbage. Training (page 05) fixes that.

Counting parameters¶

total = sum(p.numel() for p in model.parameters())
print(f"total params: {total:,}")    # ~109,000

.parameters() yields all the learnable tensors (weights + biases). For this MLP: - fc1: 784 × 128 weights + 128 biases = 100,480. - fc2: 128 × 64 + 64 = 8,256. - fc3: 64 × 10 + 10 = 650. - Total: 109,386.

For comparison: GPT-2 (small) is 124M params. GPT-3 is 175B. Modern open-source LLMs (Llama 3 70B) are 70B. Parameter count is one rough proxy for model capability (and one direct proxy for memory cost).

A more compact form: `nn.Sequential`¶

For simple feed-forward stacks:

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)

Layers in order. Use Sequential for "just call these in sequence" cases. Use the class form when forward needs anything more than that (skip connections, conditional flow).

Activations as modules vs functions¶

Two ways to apply ReLU:

# As a function (no parameters):
x = F.relu(x)

# As a module (in Sequential):
nn.ReLU()

Both work. Functions are cleaner inside forward. Modules are required inside Sequential. Use whichever fits.

Initialization¶

PyTorch initializes weights with sensible defaults (Kaiming uniform for linear layers). For most cases this is fine. To override:

for p in model.parameters():
    if p.dim() > 1:
        nn.init.kaiming_normal_(p)

Initialization matters for very deep networks; modern architectures (with normalization layers) are robust enough that the default usually works.

Move to GPU¶

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

x = torch.randn(4, 784).to(device)
out = model(x)

.to(device) moves all parameters. After that, your input must also be on the same device.

Save and load¶

# Save just the parameters
torch.save(model.state_dict(), "mlp.pt")

# Load into the same architecture
model2 = MLP()
model2.load_state_dict(torch.load("mlp.pt"))
model2.eval()

state_dict() returns a dict of {name: tensor} for every parameter. Saving the state dict (not the whole model object) is the recommended pattern - portable across code changes.

.eval() switches the model into evaluation mode (matters for layers like dropout and batch norm that behave differently during training vs inference).

Common architectures (very brief preview)¶

The MLP is the simplest. You'll meet others:

CNN (Convolutional Neural Network) - for images. Layers detect local patterns (edges, textures) at multiple scales.
RNN / LSTM - for sequences (older approach). Largely replaced by transformers.
Transformer - attention-based, the modern default for language and increasingly vision. Page 07 covers it.

For this beginner path: MLPs for the early pages; transformers for the LLM pages.

Exercise¶

Create mlp_demo.py:

import torch
import torch.nn as nn
import torch.nn.functional as F


class MLP(nn.Module):
    def __init__(self, input_dim=784, hidden=128, num_classes=10):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print("device:", device)

    model = MLP().to(device)
    print(model)
    total = sum(p.numel() for p in model.parameters())
    print(f"params: {total:,}")

    # Forward pass on a fake batch
    x = torch.randn(8, 784, device=device)
    logits = model(x)
    print("logits shape:", logits.shape)
    probs = F.softmax(logits, dim=1)
    print("sample probs:", probs[0])
    print("sample sum (should be ~1):", probs[0].sum().item())


if __name__ == "__main__":
    main()

Run it. Verify everything makes sense.

Stretch: rewrite the model as nn.Sequential. Same output, fewer lines.

Bigger stretch: add a dropout layer (nn.Dropout(p=0.2)) between the hidden layers. Dropout randomly zeros some activations during training; an effective regularizer. Print the model to see the new architecture.

What you might wonder¶

"What's super().__init__() for?" Calls nn.Module's constructor - sets up internal bookkeeping so .parameters() and .to() work on your model. Always call it first in __init__.

"Why F.relu vs nn.ReLU?" Same thing. Functional form is shorter inside forward; module form composes with Sequential and is registered as a child module (so it shows in print(model)).

"What's a 'logit'?" A pre-softmax output. The raw score the model assigns to each class. The class with the largest logit is the prediction. Logits aren't probabilities; softmax converts them.

"Should I worry about backpropagation math?" No - PyTorch's autograd handles it. You define the forward pass; gradients are computed automatically. Page 05 shows the loop.

Done¶

Define a model by extending nn.Module.
Use nn.Linear, F.relu, nn.Sequential.
Run a forward pass.
Count parameters.
Save and load state_dict.
Move models to GPU.

Next: Training loop →

05 - Training Loop¶

What this session is¶

About an hour. Train the MLP from page 04 to recognize MNIST digits. By the end you'll have written a full training loop - the same shape as every PyTorch training loop in existence.

The pattern¶

Every training loop is:

For each epoch (pass over the data):
    For each batch:
        1. Forward pass - compute predictions
        2. Compute loss - how wrong are we?
        3. Backward pass - compute gradients
        4. Optimizer step - adjust weights

That's it. The rest is bookkeeping.

Load MNIST¶

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),                          # PIL image → tensor
    transforms.Normalize((0.1307,), (0.3081,)),     # standardize: (x - mean) / std
])

train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_ds  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=512)

Three pieces: - Dataset - knows how to load and transform one example. - DataLoader - wraps the dataset, batches it, optionally shuffles. - Transform - preprocessing applied to each example.

torchvision provides MNIST out of the box. First run downloads ~10MB; subsequent runs use the cached copy.

The full training script¶

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Reproducibility
torch.manual_seed(42)

# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"device: {device}")

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])
train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_ds  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=512)

# Model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)             # flatten 28x28 → 784
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

model = MLP().to(device)

# Loss + optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Train
for epoch in range(3):
    model.train()
    total_loss = 0
    correct = 0
    n = 0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)

        # 1. Forward
        logits = model(x)

        # 2. Loss
        loss = criterion(logits, y)

        # 3. Backward
        optimizer.zero_grad()                  # clear gradients from last step
        loss.backward()                        # compute gradients

        # 4. Optimizer step
        optimizer.step()

        total_loss += loss.item() * x.size(0)
        correct += (logits.argmax(dim=1) == y).sum().item()
        n += x.size(0)

    print(f"epoch {epoch}: train loss {total_loss/n:.4f}, acc {correct/n:.4f}")

# Test
model.eval()
correct = 0
n = 0
with torch.no_grad():
    for x, y in test_loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        correct += (logits.argmax(dim=1) == y).sum().item()
        n += x.size(0)

print(f"test accuracy: {correct/n:.4f}")

Run. After ~30 seconds (on CPU) or ~5 seconds (on GPU), you should see ~97% test accuracy. Your first trained model.

What each line is doing¶

x.view(x.size(0), -1) - flatten the 28x28 images into 784-length vectors. The -1 infers the dimension. x.size(0) is the batch dimension.

nn.CrossEntropyLoss - standard loss for classification. Internally: softmax + negative log-likelihood. Stable and standard.

optimizer = torch.optim.Adam(...) - the optimizer. Adam is the most-used optimizer for modern ML. Other options: SGD (stochastic gradient descent - classic but needs more tuning), AdamW (Adam with corrected weight decay).

optimizer.zero_grad() - clear gradients from the last batch. PyTorch accumulates gradients by default; you must clear them explicitly each step. Forget this and your gradients grow unbounded.

loss.backward() - autograd walks the computation graph backward, computing gradients for every parameter that participated in computing loss. Stores them in parameter.grad.

optimizer.step() - applies the gradient update. For Adam: complex math; for SGD: param = param - lr * param.grad.

model.train() vs model.eval() - switches the model's internal mode. Affects layers like dropout and batch norm that behave differently during training vs inference. Always set the right mode.

with torch.no_grad(): - disables autograd. Inference doesn't need gradients; this skips the bookkeeping and uses less memory.

What the loss tells you¶

Training loss going down = model is fitting the training data. Plateauing means we've hit the model's capacity (or the optimizer is stuck - try different hyperparams).

Test accuracy is what you actually care about - performance on data the model didn't see during training. If train loss keeps dropping but test accuracy stops improving, you're overfitting - memorizing the training data.

Mitigations: more data, regularization (dropout, weight decay), smaller model, early stopping.

Hyperparameters¶

The numbers you set that aren't learned: - Learning rate (lr=1e-3) - how big a step the optimizer takes. Too high → unstable. Too low → slow. 1e-3 is a great starting point for Adam. - Batch size (64) - larger = smoother gradients, more memory; smaller = noisier gradients, sometimes generalizes better. - Epochs - how many passes over the data. More = better fit (until overfitting). - Architecture - depth, width, normalizations.

These need tuning. For MNIST, defaults work. For real problems, expect to iterate.

Save the model¶

torch.save(model.state_dict(), "mnist_mlp.pt")

Load later:

model = MLP().to(device)
model.load_state_dict(torch.load("mnist_mlp.pt"))
model.eval()

What this scales to¶

The training loop pattern above is identical for huge models - the only thing that changes is the model definition. Add a few wrinkles for big-model training (mixed precision, gradient accumulation, distributed, checkpointing) and you have what a real LLM training script looks like.

Going deeper¶

You can run a training loop now. This section is the part nobody teaches and everybody needs: what to do when the loss won't go down. This is the daily reality of training - the loop runs, no error, but the model isn't learning. Here's the diagnosis tree, with what each failure looks like.

Read the loss curve - it tells you what's wrong¶

Before fixing anything, look at the loss over steps. The shape is the diagnosis:

HEALTHY:           loss falls fast, then slowly flattens
  3.2 -> 2.1 -> 1.4 -> 1.1 -> 0.95 -> 0.88 ...

FLAT (not learning): barely moves
  3.2 -> 3.19 -> 3.20 -> 3.18 ...        # learning rate too low, or a bug

EXPLODING (diverging): shoots up, often to NaN
  3.2 -> 5.1 -> 19.4 -> 412 -> nan       # learning rate too high

ZIGZAG (unstable): bounces around, no trend
  3.2 -> 1.1 -> 4.5 -> 0.9 -> 5.0 ...    # learning rate slightly too high, or bad data

OVERFITTING: train loss falls, val loss rises
  train 0.1, val 2.3 (and climbing)      # memorizing, not generalizing

Always print and watch both training and validation loss. The single most informative thing you can do. The curve's shape points directly at the cause below.

The loss-won't-decrease checklist (in order)¶

When loss is flat, work this list top to bottom - the causes are roughly in order of likelihood:

Did you forget optimizer.zero_grad()? Gradients accumulate in PyTorch by default. Without zeroing them each step, they pile up and training goes haywire. Missing zero_grad() is the #1 silent training bug.
Learning rate wrong. Flat loss = LR too low (try 10x higher). Exploding/NaN loss = LR too high (try 10x lower). LR is the highest-leverage knob; if in doubt, sweep it (1e-2, 1e-3, 1e-4). Most "it won't learn" problems are LR.
Did you call loss.backward() and optimizer.step()? Forgetting either means no gradients flow or no weights update - the model literally never changes. Print a weight before and after a step to confirm it moved.
Is the model in train() mode? model.eval() left on disables dropout and freezes batchnorm; model.train() must be set for training. Forgetting to switch back after validation is a classic.
Data/label mismatch. Are inputs and labels actually aligned? A shuffling bug or off-by-one between x and y makes the task impossible - loss stays at random-guess level. Sanity check: can the model overfit a single batch? (Next section - the best test there is.)
Wrong loss function. CrossEntropyLoss expects raw logits (it applies softmax internally) - passing it already-softmaxed values, or using the wrong loss for the task, quietly breaks learning.

The single best debugging test: overfit one batch¶

Before training on the full dataset, prove the loop can learn at all by making it memorize one tiny batch:

x, y = next(iter(loader))           # grab ONE batch
for step in range(200):             # train on JUST this batch, repeatedly
    optimizer.zero_grad()
    loss = criterion(model(x), y)
    loss.backward()
    optimizer.step()
    if step % 20 == 0: print(step, loss.item())

0   2.31
20  0.84
40  0.12
60  0.01        # loss -> ~0 = the loop WORKS, the model CAN learn

If the loss drops to near zero, your training loop, model, loss function, and optimizer are all wired correctly - any remaining problem is data or hyperparameters. If it can't even memorize one batch, you have a fundamental bug (one of the checklist items) - and you've isolated it to the loop, not the data, in 10 seconds. Overfitting one batch is the first thing experienced people do when training misbehaves. It separates "the plumbing is broken" from "the plumbing works, tune it."

What you'll see: the loss is NaN¶

Loss suddenly becomes nan and stays there - training is dead:

step 40  loss 1.82
step 41  loss 14.3
step 42  loss nan        # diverged - everything downstream is now nan

Causes, in order: learning rate too high (most common - lower it 10x), exploding gradients (add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)), a log(0) or divide-by-zero in a custom loss, or bad input data (a nan/inf in the dataset - check with torch.isnan(x).any()). Once loss is nan, every subsequent step is nan (nan propagates), so catch it early - the moment loss spikes before nan is your clue it's divergence, not data.

Try it (with what you'll see)¶

Take your working training loop. Delete optimizer.zero_grad(). Watch the loss behave erratically or stall. Restore it. Feel the #1 bug.
Run the overfit-one-batch test on your model. Confirm loss -> ~0 in a couple hundred steps. (If it doesn't, you've found a real bug - work the checklist.)
Crank the learning rate 100x. Watch the loss explode to nan. Then drop it 100x below normal and watch the loss go flat. See both failure modes bracket the healthy range.
Print a model weight (next(model.parameters())[0,0].item()) before and after one optimizer.step(). Confirm it changed - proof the update is happening.

Exercise¶

Run the script. Get ~97% test accuracy.
Tweak hyperparameters:
Set lr=1e-2. Does it train? (Probably loss goes NaN - too high.)
Set lr=1e-5. (Trains but slowly.)
Increase epochs to 10. Does test accuracy improve? Plateau? Degrade (overfit)?
Architecture changes:
Add a third hidden layer of size 128.
Increase hidden sizes to 256, 128.
Watch parameter count + accuracy change.
Visualize: plot the per-epoch training loss using matplotlib. (Or use TensorBoard - pip install tensorboard then tensorboard --logdir runs/.)
Stretch: instead of an MLP, try a small CNN. A 2-layer CNN beats this MLP at >98% accuracy. Look up nn.Conv2d. Don't worry if it doesn't work first try - convolutions take some adjusting.

What you might wonder¶

"Why Adam over SGD?" Adam adapts the learning rate per-parameter; converges faster on most problems without tuning. SGD with momentum can outperform on specific architectures (vision CNNs) when carefully tuned. For getting started: Adam.

"How big should my batch be?" For GPU work, "as big as fits in memory" is a common heuristic. Common sizes: 32, 64, 128, 256. Larger batches give smoother gradients but you might need more epochs to converge.

"What does loss.item() do?" Extracts a Python float from a 0-dim tensor. Detached from the graph (no gradient tracking). Use when you want a number for logging.

"Why is my loss not going down?" Common causes: - Learning rate too high (loss NaN) or too low (loss flat). - Wrong loss function for your task. - Bug in the model (wrong shapes - print them). - Data not normalized.

Print loss every step initially. If it's not going down within ~100 steps on MNIST, something's wrong.

Done¶

The four-step training loop pattern.
DataLoader for batched iteration.
nn.CrossEntropyLoss + Adam optimizer.
optimizer.zero_grad() / loss.backward() / optimizer.step().
model.train() vs model.eval().
Save/load weights.

Next: Inference and saving →

06 - Inference and Saving¶

What this session is¶

About 30 minutes. The other half of training - using a trained model. Loading saved weights, running predictions, the eval-mode + no-grad pattern.

Inference vs training¶

Training: forward pass + loss + backward pass + optimizer step. Slow, memory-hungry, uses gradients.

Inference: forward pass only. Fast, cheap, no gradients.

The difference matters because most of a model's lifetime is inference - users sending requests; you predicting. Optimizing inference is its own discipline (page 12).

The basic pattern¶

import torch

model = MLP()                              # the same class you trained with
model.load_state_dict(torch.load("mnist_mlp.pt"))
model.eval()                               # IMPORTANT: switch to eval mode

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# An input
x = torch.randn(1, 1, 28, 28).to(device)   # one fake "image"

with torch.no_grad():                       # IMPORTANT: skip gradient tracking
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred = logits.argmax(dim=1)

print(f"predicted: {pred.item()}, confidence: {probs.max().item():.4f}")

Three things you must remember: 1. Load the weights - load_state_dict into a freshly-constructed model with the same architecture. 2. model.eval() - disables dropout, freezes batch norm running statistics. 3. with torch.no_grad(): - disables gradient tracking. Faster, uses less memory.

Forgetting any of these gives subtle bugs.

Predict on a real image¶

from PIL import Image
from torchvision import transforms

# The same transform you used during training
transform = transforms.Compose([
    transforms.Grayscale(),
    transforms.Resize((28, 28)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

img = Image.open("my_digit.png")
x = transform(img).unsqueeze(0).to(device)    # add batch dim → (1, 1, 28, 28)

with torch.no_grad():
    logits = model(x)
    pred = logits.argmax(dim=1).item()

print(f"predicted digit: {pred}")

Key point: the inference preprocessing must match training. Same resize, same normalize, same color space. Mismatched preprocessing is the #1 silent-bug source in ML - the model still produces a prediction, just a bad one.

Batching for speed¶

If you have many inputs, predict on them in batches - much faster than one-at-a-time:

images = [transform(Image.open(p)) for p in paths]
batch = torch.stack(images).to(device)         # (N, 1, 28, 28)

with torch.no_grad():
    logits = model(batch)
    preds = logits.argmax(dim=1)

for path, p in zip(paths, preds):
    print(path, p.item())

Batches let the GPU keep busy. For latency-sensitive online inference, you might still process single inputs; for throughput-sensitive batch jobs, batch as much as memory allows.

Save more than weights¶

state_dict() is just the parameters. For a fully-recoverable training session, save more:

torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "loss": loss,
}, "checkpoint.pt")

# Resume:
ckpt = torch.load("checkpoint.pt")
model.load_state_dict(ckpt["model_state_dict"])
optimizer.load_state_dict(ckpt["optimizer_state_dict"])
start_epoch = ckpt["epoch"] + 1

For production deployment, save just the model weights. For "I want to resume training tomorrow," save the full checkpoint.

TorchScript and ONNX (briefly)¶

For shipping models, two portability formats:

TorchScript - torch.jit.script(model) or torch.jit.trace(model, example_input) produces a deployment-friendly version. Can run without Python.
ONNX - open standard. torch.onnx.export(model, ...) produces a file readable by many runtimes (ONNX Runtime, TensorRT, browsers).

Beyond beginner scope; mentioned because deployment paths sometimes need them.

For most cases, deploying a PyTorch model directly (page 12) is fine.

Inference performance: the gotchas¶

A few things that catch people:

First inference is slow. PyTorch JIT-compiles kernels on first use. Warm up with a dummy forward pass before timing.
.cuda()/.to(device) is async. GPU operations are queued. To time them, call torch.cuda.synchronize() first.
with torch.no_grad(): matters even for small inferences. Saves memory; can be 2x faster.
torch.set_num_threads(1) for CPU inference can speed up small models by avoiding thread overhead.

A complete example¶

"""
Load a trained MNIST MLP and predict on a single image.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms


class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main(image_path: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = MLP().to(device)
    model.load_state_dict(torch.load("mnist_mlp.pt", map_location=device))
    model.eval()

    transform = transforms.Compose([
        transforms.Grayscale(),
        transforms.Resize((28, 28)),
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])

    img = Image.open(image_path)
    x = transform(img).unsqueeze(0).to(device)

    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
        pred = logits.argmax(dim=1).item()
        confidence = probs.max().item()

    print(f"predicted: {pred} (confidence: {confidence:.4f})")


if __name__ == "__main__":
    import sys
    main(sys.argv[1])

Run: python infer.py my_digit.png.

Exercise¶

Train a model from page 05. Save its weights as mnist_mlp.pt.
Write infer.py above. Load the weights. Get a digit image (download one from Google or draw one in Paint, save as PNG). Run prediction.

Measure speed:

import time
for _ in range(3):                # warm up
    with torch.no_grad(): model(x)
torch.cuda.synchronize() if device == "cuda" else None
t0 = time.time()
for _ in range(1000):
    with torch.no_grad(): _ = model(x)
torch.cuda.synchronize() if device == "cuda" else None
print(f"{(time.time() - t0) * 1000 / 1000:.3f} ms / inference")

Stretch: load multiple images at once into one batch. Time forward on the batch vs looping single-images. The batch is much faster per-image - that's why batching matters.

What you might wonder¶

"Why map_location=device in torch.load?" Loads tensors directly to the target device. Without it, PyTorch tries to load to the device they were saved on, which fails if you trained on GPU but are inferring on CPU.

"What's torch.compile?" PyTorch 2's JIT compiler. model = torch.compile(model) can give 1.5x-3x speedup. Sometimes flaky; experiment carefully. Mentioned for awareness.

"Should I use .half() or .bfloat16() for inference?" For modern GPUs (Volta and newer), yes - half-precision inference is ~2x faster with negligible quality drop for most models. model = model.half() then x = x.half(). Test accuracy afterward; some models tolerate half-precision better than others.

"What about quantization (INT8, INT4)?" Even more aggressive than half-precision. Used heavily for LLM inference (page 12). Beyond beginner scope on this page.

Done¶

Load weights into a model architecture.
Use model.eval() + with torch.no_grad():.
Preprocess inference inputs the same way as training inputs.
Batch inference for throughput.
Save/load full training checkpoints.

Next: Transformers and tokenization →

07 - Transformers and Tokenization¶

What this session is¶

About an hour. What an LLM actually does - at the level needed to use, fine-tune, and serve them. The architecture, the tokenization step that confuses everyone, the autoregressive generation loop.

This page is dense. Plan to re-read.

The big picture¶

A language model: 1. Takes a sequence of tokens (integer IDs representing chunks of text). 2. Predicts a probability distribution over the next token. 3. You sample one. Append it. Repeat.

That's it. The clever part - what makes LLMs work - is the architecture that produces the next-token prediction. That architecture is the transformer.

Tokenization¶

Text → token IDs. The model never sees characters or words; it sees integers.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"
ids = tok.encode(text)
print(ids)
# [15496, 11, 995, 0]
print([tok.decode([i]) for i in ids])
# ['Hello', ',', ' world', '!']

Each token is roughly a "subword piece." Common words → single token. Rare words → multiple. The tokenizer learned its vocabulary during the model's training; you can't change it.

Why subwords: vocabulary size matters. Word-level vocabulary needs hundreds of thousands of entries (with new ones constantly appearing). Character-level produces very long sequences. Subword (BPE, WordPiece, SentencePiece) is the compromise: 32k-256k tokens covering nearly any text.

Practical implications: - Token counts are not word counts. "I am happy" = 3 tokens, "antidisestablishmentarianism" = many. - Prices are per-token; latency is per-token. - Different models have different tokenizers; same text → different token counts.

What a transformer does¶

Inside the model, each token ID becomes an embedding - a vector of ~768 to ~12000 dimensions (depending on the model).

A transformer layer processes a sequence of these vectors and produces an updated sequence of the same shape. The crucial operation is attention - every output position is a weighted sum of all input positions (and itself), where the weights are computed from the inputs themselves.

This lets the model "look at" other parts of the sequence when generating each output. Long-range dependencies (across thousands of tokens) become tractable.

Stack many such layers (~12-100), add positional encodings so the model knows token order, end with a linear projection back to the vocabulary, apply softmax - you have the next-token distribution.

The full math is in the AI Systems senior path, Deep Dive 07. For now, the operational view: a transformer transforms a sequence of token embeddings into a probability distribution over the next token.

Causal (autoregressive) vs masked¶

Two families:

Causal / autoregressive models (GPT-family, Llama, Mistral, Gemma) - each position attends only to positions before it. Generates left-to-right. Used for language generation.
Masked models (BERT-family) - every position attends to every other. Used for understanding (classification, NER, embeddings).

If you're working with chatbots, code completion, RAG - causal. Embedding for retrieval - masked. Many modern open-source LLMs are causal (the GPT-style architecture won).

The generation loop¶

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

prompt = "The quick brown fox"
input_ids = tok.encode(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=30,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
    )

print(tok.decode(output[0]))

What's happening: 1. Tokenize the prompt into IDs. 2. Call model.generate(...) - a wrapped helper around the basic predict-and-append loop. 3. The model generates 30 more tokens, sampling from each next-token distribution. 4. Decode the resulting IDs back to text.

Sampling parameters: - temperature - how peaked the distribution is. 0 = pick the most likely (greedy). Higher = more diverse, less predictable. 0.7-1.0 typical. - top_p (nucleus) - only consider tokens whose cumulative probability is up to p. Avoids low-probability "weird" tokens. - top_k - only consider the k most-likely. Cruder but works. - max_new_tokens - when to stop. - stop - explicit stop strings.

Different sampling strategies → different output styles. Greedy is deterministic but often repetitive. Top-p sampling is the modern default.

What "small" and "big" mean¶

Some calibration:

GPT-2 small: 124M params. Fits in 500MB. Runs on a laptop.
Llama 3 8B: 8 billion. 16GB at FP16. Single high-end GPU.
Llama 3 70B: 70 billion. 140GB at FP16. Multiple GPUs (or quantized down to 4-bit for a single ~40GB GPU).
GPT-4-class frontier models: ~hundreds of billions to trillions (rumors; not public). Many GPUs.

The pattern: 10x bigger → noticeably smarter on hard tasks. Quality scales with parameters + training data + compute (the "scaling laws").

For learning, GPT-2 / Llama 3 8B (or smaller) suffice.

Context length¶

The maximum sequence length the model can attend over. GPT-2 was 1024. Llama 3 is 8192-128000+. Frontier models claim 200k+.

Two implications: - Memory cost is quadratic in context length (attention is O(n²)). 100k context is 10000x more attention compute than 1k. - Practical context isn't the same as advertised context. A model trained on long contexts doesn't necessarily use the middle well - research papers ("Lost in the Middle") show degradation. RAG (page 10) mitigates this.

Embeddings (preview)¶

A model's tokenization step also produces token embeddings - vector representations of pieces of text. These are useful beyond generation: semantic search, classification, clustering.

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["a cat sat on a mat", "the dog ran"]
embeddings = m.encode(texts)            # shape (2, 384) for this model

Cosine similarity between two embeddings ≈ semantic similarity. Page 10 builds RAG on this.

Exercise¶

Run the GPT-2 generation example above. Try different prompts. Vary temperature from 0.1 to 1.5. Note how it changes.

Inspect tokenization:

for text in ["hello", "antidisestablishmentarianism", "I'm fine.", "🦀"]:
    ids = tok.encode(text)
    print(f"{text!r} -> {ids} -> {[tok.decode([i]) for i in ids]}")

Notice how rare words and emoji become multiple tokens.

Greedy decoding:
```
output = model.generate(input_ids, max_new_tokens=30, do_sample=False)
```
Run twice with the same prompt. Output is identical (deterministic). Then with do_sample=True, different each time.
(Stretch - GPU helpful): try a small open-source LLM. Hugging Face Hub: search gpt2-medium, microsoft/phi-2, meta-llama/Llama-3.2-1B. (Llama gates require accepting a license on HF.) Load and generate. Note the quality difference.

What you might wonder¶

"Why is the same word sometimes one token and sometimes two?" Subword tokenizers split based on frequency in their training data. " happy" (with leading space) and "happy" (without) are distinct tokens. Case matters too. Don't fight it; understand it.

"What's a 'chat model' vs a 'base model'?" Base models are trained on raw text. Chat models are fine-tuned (page 09) with conversation data + safety training. For "ask a question, get an answer" use chat models. For raw text completion or further fine-tuning, base models.

"What's the actual difference between GPT-2 and modern LLMs architecturally?" Mostly: more parameters, more training data, more compute. Architectural tweaks (rotary positional encoding, grouped-query attention, SwiGLU activations, RMSNorm) are real but modest. The scaling matters more than the architecture changes.

"Should I implement attention from scratch?" Once, for understanding. Andrej Karpathy's "Let's build GPT" video walks you through it. Then use library implementations for production.

Done¶

Understand the token → embedding → transformer → next-token-prob pipeline.
Use a tokenizer; understand subword units.
Generate text with sampling parameters.
Know causal vs masked models.
Have a calibration of model sizes.

Next: Hugging Face Transformers →

08 - Hugging Face Transformers¶

What this session is¶

About 30 minutes. Hugging Face is the GitHub of AI models. The transformers library makes using thousands of pre-trained models a 3-line operation.

The library¶

pip install transformers

(You did this in page 01.) The library provides three main classes you'll use:

AutoTokenizer - load any model's tokenizer.
AutoModel / AutoModelForCausalLM / AutoModelForSequenceClassification / etc. - load a model. The AutoModelFor... variants add task-specific heads.
pipeline - a high-level helper that combines tokenization + model + post-processing into one call.

The simplest possible usage: `pipeline`¶

from transformers import pipeline

# Text classification
clf = pipeline("sentiment-analysis")
print(clf("I love this!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]
print(clf("This is terrible."))
# [{'label': 'NEGATIVE', 'score': 0.9991}]

# Text generation
gen = pipeline("text-generation", model="gpt2")
print(gen("The future of AI is", max_new_tokens=20))

# Translation
trans = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
print(trans("Hello, how are you?"))

# Question answering
qa = pipeline("question-answering")
print(qa(question="Where do I live?", context="My name is Alice and I live in Lagos."))
# {'answer': 'Lagos', ...}

Each pipeline picks a default model, downloads it (first time), runs it. Useful for prototyping.

Browsing the Hub¶

huggingface.co hosts hundreds of thousands of models. Filter by task, language, license. Common model names you'll see:

gpt2, gpt2-medium - small classical LLMs. Good for learning.
microsoft/phi-3-mini-4k-instruct - small + capable + permissive license.
meta-llama/Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B - Meta's open weights (gated; accept license).
mistralai/Mistral-7B-v0.3 - open-source Mistral.
google/gemma-2-2b - small Gemma.
sentence-transformers/all-MiniLM-L6-v2 - tiny embedding model. Page 10.
distilbert-base-uncased - small BERT-family for classification, embedding.

Each model page on the Hub has a README with usage, license, evaluation, intended use.

Direct usage (not via pipeline)¶

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "microsoft/phi-3-mini-4k-instruct"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Write a haiku about garbage collection:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
print(tok.decode(output[0], skip_special_tokens=True))

Key arguments: - torch_dtype=torch.bfloat16 - load weights in bfloat16 instead of float32. Halves memory; minimal quality loss. - device_map="auto" - automatically distribute layers across available devices (GPU + CPU fallback). - return_tensors="pt" - tokenizer returns PyTorch tensors. - skip_special_tokens=True - strip <eos>, <bos>, etc. from output.

Chat templates¶

Modern chat-tuned models expect a specific message format. The tokenizer knows it:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of Nigeria?"},
]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=50)
response_only = tok.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response_only)

apply_chat_template formats the messages with the model's expected special tokens (<|user|>, <|assistant|>, etc.). add_generation_prompt=True adds the assistant's turn-start so the model knows it's its turn to speak.

For chat-tuned models, always use the chat template. Raw prompt-completion produces worse results.

Embedding models¶

For semantic search (used in page 10):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["A dog is running.", "A cat is sleeping.", "I bought milk."]
embeddings = model.encode(texts)
print(embeddings.shape)            # (3, 384) - three 384-dim vectors

# Compute similarities
import numpy as np
sim = embeddings @ embeddings.T
print(sim)        # diagonal is 1.0 (each vector with itself);
                  # off-diagonal close to 0 for unrelated texts

sentence-transformers wraps Hugging Face models and handles the "pool tokens into a sentence vector" step.

Caching¶

By default, models download to ~/.cache/huggingface/. Big models (gigabytes) live here. To change:

export HF_HOME=/path/to/your/cache

To pre-download a model without using it (useful in Docker):

from huggingface_hub import snapshot_download
snapshot_download(repo_id="microsoft/phi-3-mini-4k-instruct")

Quantized models¶

LLMs are huge. Loading FP16 needs gigabytes; FP32 needs 2x. Quantization reduces precision further:

GPTQ / AWQ - 4-bit quantization, requires specific quantized weights.
bitsandbytes - runtime 8-bit / 4-bit quantization for any model:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", quantization_config=bnb)

8B parameters at 4-bit ≈ 4GB. Fits on consumer GPUs.

Quality drops ~1-5% on benchmarks; for many tasks, indistinguishable. Used heavily in production inference (page 12).

Exercise¶

Run the simplest pipeline:

from transformers import pipeline
clf = pipeline("sentiment-analysis")
print(clf("Containers from scratch is a good path"))

Generate text with a small model:

gen = pipeline("text-generation", model="gpt2")
print(gen("Once upon a time", max_new_tokens=40)[0]["generated_text"])

Direct model usage with the chat-template form above. Use any chat-tuned model that fits your hardware. Try several prompts.
Embeddings: with sentence-transformers, encode 5 sentences (some related, some not). Compute the similarity matrix. Notice which pairs score high.
(Stretch - GPU helpful) Load Llama-3.2-1B (accept license on HF first; small enough for most setups). Compare its outputs to gpt2's.

What you might wonder¶

"How big a model can I run?" Rule of thumb (FP16): need ~2 bytes per parameter for inference. 1B params = 2GB. 7B = 14GB. 70B = 140GB. With 4-bit quantization, ~0.5 bytes per param. 70B at 4-bit ≈ 40GB.

"Why does the model download so slowly?" HF servers throttle anonymous traffic. Authenticate (huggingface-cli login) for higher limits, especially for gated models.

"What's device_map="auto" actually doing?" Hugging Face's accelerate library partitions the model across available devices (GPU layers; CPU offload for excess). For small models, the whole thing goes on GPU. For huge models, layers spill to CPU (much slower but possible).

"Should I use safetensors or pytorch_model.bin?" Safetensors. Faster loading, safer (no arbitrary code execution risk). All modern HF models ship both.

Done¶

Use pipeline for the quickest possible model usage.
Use AutoTokenizer + AutoModelForCausalLM for direct control.
Apply chat templates for chat-tuned models.
Use sentence-transformers for embeddings.
Load quantized models for memory efficiency.

Next: Fine-tuning →

09 - Fine-Tuning¶

What this session is¶

About 90 minutes. Adapt a pretrained model to your data. Modern parameter-efficient fine-tuning (LoRA) - feasible on consumer GPUs. By the end you'll have fine-tuned a small LLM on a custom dataset.

This page benefits from a GPU. CPU works but is very slow.

The two modes¶

Full fine-tuning - update all model weights. Best quality; needs massive memory (7B model in FP16 needs ~28GB just for the optimizer states). Beyond most beginners' budgets.
Parameter-efficient fine-tuning (PEFT) - update only a tiny subset of new parameters. LoRA is the most-used. Same effective quality for ~1% of the parameters' worth of training. Runs on a single consumer GPU.

We'll do LoRA.

What LoRA actually does¶

Each big linear layer in a transformer (the nn.Linear from page 04, scaled up) is a matrix W. Instead of updating W directly, LoRA learns a low-rank update:

W_new = W + (A @ B)

Where A is (d, r) and B is (r, d). The rank r is small (typically 8-64). The original W stays frozen; only A and B train.

Memory savings: instead of d × d parameters per layer (millions), you train d × r + r × d (tens of thousands). For a 7B model with rank-16 LoRA: ~10M trainable parameters instead of 7B.

You don't implement this - peft library handles it.

Setup¶

pip install transformers datasets accelerate peft trl bitsandbytes

peft - parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning, etc.).
trl - Transformers Reinforcement Learning. Includes SFTTrainer, the easiest fine-tuning loop wrapper.
bitsandbytes - 4-bit quantization, used by QLoRA.

A complete LoRA fine-tuning¶

We'll fine-tune a small model on a tiny dataset to make it answer in a specific style.

import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

model_name = "microsoft/Phi-3-mini-4k-instruct"
# Quantized to 4-bit
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# LoRA config
lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 6.3M || all params: 3.8B || trainable: 0.16%

# A tiny training dataset (in production: load real data with `datasets`)
examples = [
    {"text": "<|user|>\nWhat's 2+2?<|end|>\n<|assistant|>\nIt's 4, mate.<|end|>"},
    {"text": "<|user|>\nHello!<|end|>\n<|assistant|>\nG'day!<|end|>"},
    {"text": "<|user|>\nWhat's your favorite color?<|end|>\n<|assistant|>\nProbably blue, mate.<|end|>"},
    # ... in a real run, hundreds to thousands of examples ...
] * 50

train_ds = Dataset.from_list(examples)

# Training config
cfg = SFTConfig(
    output_dir="./lora-out",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=512,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tok,
    train_dataset=train_ds,
    args=cfg,
    dataset_text_field="text",
)

trainer.train()
trainer.save_model("./lora-out/final")

The whole thing - model load, LoRA setup, training loop - fits in ~50 lines. SFTTrainer from trl wraps Hugging Face's Trainer with sensible defaults.

Run time: ~5-15 minutes on a free Colab T4 GPU for the small dataset above.

Use the fine-tuned model¶

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = PeftModel.from_pretrained(base, "./lora-out/final")
model.eval()

inputs = tok("<|user|>\nHi there!<|end|>\n<|assistant|>\n", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# Hopefully responds in the trained style ("G'day mate!")

The fine-tuned model = base model + LoRA adapter. The adapter is small (~30MB for our config); the base is shared.

Merge LoRA into the base (for deployment)¶

For inference at scale, you may want a single merged model:

merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tok.save_pretrained("./merged-model")

Result: a standalone model with LoRA's updates baked in. Drops the adapter layer overhead at inference time.

Hyperparameter notes¶

r (LoRA rank) - 8, 16, 32, 64. Higher = more capacity, more memory. Start at 16.
lora_alpha - usually 2× r. Acts as a scaling factor.
target_modules - which linear layers to LoRA-fy. Common: ["q_proj", "v_proj"] for cheap, ["q_proj", "k_proj", "v_proj", "o_proj"] for fuller coverage. Model-specific naming.
learning_rate - much higher than full fine-tuning (because you have fewer params). 1e-4 to 5e-4 typical.
per_device_train_batch_size + gradient_accumulation_steps - effective batch is the product. Small batch fits memory; accumulation simulates a bigger batch.

These need experimentation. Start with the defaults above; adjust.

QLoRA - even smaller memory¶

The BitsAndBytesConfig(load_in_4bit=True, ...) we used is QLoRA - quantize the base model to 4-bit, train LoRA adapters on top in higher precision. Lets you fine-tune 7B models on a 12GB GPU. The standard approach for hobbyist fine-tuning.

What you can / can't fine-tune¶

LoRA fine-tuning is great for: - Style adaptation - "respond in our brand's voice." - Domain-specific Q&A - train on your support docs. - Output format - JSON conformance, structured outputs. - Tool / function calling - train the model to emit specific function calls.

LoRA is bad for: - Teaching the model NEW factual knowledge. That requires more data + full fine-tuning, and the model often half-learns and hallucinates the rest. For facts, use RAG (page 10). - Reasoning skill upgrades. Generally requires lots of data + more compute than LoRA gives.

Dataset format¶

Most fine-tuning recipes want a list of conversation strings in the model's chat format. Building one:

Collect 50-1000+ example interactions in the desired style.
Format each as a single string using the model's chat template.
Wrap in a datasets.Dataset.

Real datasets often live on Hugging Face Hub - datasets.load_dataset("squad") etc. Filter / format as needed.

Going deeper¶

You can run a LoRA fine-tune now. This is the depth that separates "the script ran" from "the model actually got better" - the failure modes of fine-tuning, with what you'll see.

The #1 fine-tuning failure: CUDA out of memory¶

You'll hit this immediately on consumer GPUs:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 24.00 GiB total; 22.1 GiB already allocated; ...)

Fine-tuning needs memory for weights + gradients + optimizer state + activations (the four consumers). The fixes, cheapest first:

Lower batch size to 1, and use gradient accumulation to keep the effective batch size (gradient_accumulation_steps=8 simulates batch 8 with batch-1 memory).
Use QLoRA - load the base model in 4-bit (load_in_4bit=True via bitsandbytes). A 7B model drops from ~14 GB to ~4 GB, leaving room to train. This is the technique that makes fine-tuning possible on a single consumer GPU.
Shorter sequences - max_seq_length=512 instead of 2048 cuts activation memory dramatically (activations scale with sequence length).
Gradient checkpointing (gradient_checkpointing=True) - recompute activations in the backward pass instead of storing them. Trades ~20% more compute for big memory savings.

The order matters: try batch-size-1 + grad-accumulation + QLoRA first; that combination fine-tunes 7B models on a 24 GB (or even 16 GB) card. (The senior path's nvidia-smi investigation goes deeper on diagnosing exactly which consumer is eating memory.)

The silent failure: it ran, but the model didn't change¶

Worse than an OOM - the fine-tune completes, loss went down, but the model behaves identically. Causes:

The LoRA adapter wasn't applied/merged at inference. LoRA trains small adapter weights separately; you must load them onto the base model to see the effect (PeftModel.from_pretrained(base, adapter_path)). Loading just the base model gives you the un-fine-tuned behavior. The most common "my fine-tune did nothing" cause.
target_modules wrong. LoRA only adapts the layers you target. If target_modules doesn't match your model's actual layer names (they differ across architectures - q_proj/v_proj for Llama, query_key_value for some others), few or no parameters train. Check the trainable-parameter count:

model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062

If trainable params is near zero, your target_modules missed - that's the bug. ~0.06-1% trainable is normal for LoRA.

What you'll see: overfitting on a small dataset¶

Fine-tuning datasets are small (50-1000 examples), so overfitting is easy and looks like success at first:

epoch 1  train_loss 1.8
epoch 3  train_loss 0.4
epoch 8  train_loss 0.02      # train loss near zero...
# ...but the model now ONLY parrots your training examples, generalizes worse

Train loss near zero on a tiny dataset means memorization, not learning. The model regurgitates training examples and gets worse at everything else (catastrophic forgetting). Defenses: fewer epochs (1-3 is often plenty for LoRA), a held-out validation set to watch for the train/val gap (the training-loop chapter's overfitting curve), and a lower learning rate. More epochs is not better for fine-tuning - it's the road to overfitting.

The data-quality reality¶

The uncomfortable truth: fine-tuning quality is dominated by data quality, not hyperparameters. 100 excellent, consistent examples beat 10,000 noisy ones. Common data failures: inconsistent formatting (the model learns the noise), examples that don't match your actual use case, or a chat template mismatch (you formatted with the wrong template, so the model learns malformed structure). Before tuning hyperparameters, read your data - print 10 formatted examples and check they're exactly what you want the model to imitate.

Try it (with what you'll see)¶

Run a LoRA fine-tune and immediately call model.print_trainable_parameters(). Confirm trainable% is ~0.1-1%, not near zero (which would mean target_modules missed).
Trigger an OOM with a too-large batch, then fix it with batch-1 + gradient_accumulation_steps=8 + 4-bit loading. Watch it fit.
Fine-tune, then load the base model without the adapter and compare outputs to base-with-adapter. See that the adapter is what carries the change - forgetting to load it is the "did nothing" bug.
Over-train deliberately (20 epochs on 50 examples). Watch train loss hit ~0 and the model start parroting. Feel overfitting.

Exercise¶

You need a GPU (or Colab) for this exercise. CPU works but takes hours.

Run the example above. Train for 1 epoch on the toy dataset. Confirm training loss decreases.
Use the trained model. Run a few prompts. Notice the style.
Increase the dataset size. Add 10 more diverse examples. Re-train. Compare outputs.
Tweak r: try r=8 vs r=64. Quality difference? Memory difference?
(Stretch) Use a real dataset from Hugging Face: datasets.load_dataset("squad", split="train[:1000]"). Format the QA pairs into the chat template. Fine-tune. Evaluate by hand.

What you might wonder¶

"Why is my fine-tuned model worse than the base?" Common causes: dataset too small (under ~100 examples), learning rate too high (model overfits and forgets general knowledge), bad data formatting (model is learning your formatting bugs not your style). Start with a known-good recipe and iterate.

"What's 'catastrophic forgetting'?" The fine-tuned model loses knowledge from its base training. Severe with full fine-tuning; minimal with LoRA (the base weights are frozen). One reason LoRA is the default.

"How do I evaluate the fine-tuned model?" Page 11. Critical and the hardest part of ML.

"DPO? RLHF? PPO? GRPO?" Reinforcement-learning-from-feedback techniques used by frontier labs to align chat models. Beyond beginner; mentioned for awareness.

Done¶

Distinguish full fine-tuning from LoRA / PEFT.
Set up peft + trl for a real LoRA training run.
Train and save a fine-tuned model.
Load and use the trained adapter.
Pick reasonable hyperparameters.

Next: Retrieval-Augmented Generation →

10 - Retrieval-Augmented Generation (RAG)¶

What this session is¶

About an hour. RAG is the dominant production pattern for LLMs answering questions over your data. Instead of fine-tuning facts in, you retrieve relevant passages at query time and pass them to the model.

Why RAG, not fine-tuning, for facts¶

Fine-tuning teaches a model patterns, styles, formats. Asking it to memorize facts works poorly: knowledge degrades, the model hallucinates "knowing" things, no clean way to update when facts change.

RAG separates concerns: the LLM is the language interface; the knowledge is in a database. Update facts by updating the database - no retraining.

The architecture¶

User question
    ↓
Embed the question (vector)
    ↓
Search a vector DB for similar passages → top-k passages
    ↓
Build a prompt: question + retrieved passages
    ↓
LLM generates answer using both

Five components: 1. Documents - your knowledge corpus (docs, PDFs, wiki). 2. Chunker - splits docs into ~200-1000 token passages. 3. Embedder - a model that turns text into vectors. 4. Vector store - stores passages + their embeddings; supports nearest-neighbor search. 5. LLM - generates the final answer.

A complete (minimal) RAG¶

from sentence_transformers import SentenceTransformer
import numpy as np
import torch
from transformers import pipeline

# 1. Documents - a tiny corpus
docs = [
    "Lagos is the most populous city in Nigeria.",
    "Abuja is the capital of Nigeria.",
    "The Niger River flows through Mali, Niger, and Nigeria.",
    "Python was created by Guido van Rossum in 1991.",
    "Rust was first released in 2010 by Mozilla.",
    "Go was designed at Google starting in 2007.",
]

# 2. Embed all documents (one-time index-build)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)
# shape: (6, 384)

# 3. Search function
def retrieve(query, k=2):
    q_emb = embedder.encode([query], normalize_embeddings=True)
    sims = (doc_embeddings @ q_emb.T).flatten()        # cosine sim because normalized
    topk = np.argsort(-sims)[:k]
    return [docs[i] for i in topk]

# 4. Generate with retrieved context
gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct",
               torch_dtype=torch.bfloat16, device_map="auto")

def answer(question):
    context = "\n".join(f"- {p}" for p in retrieve(question))
    prompt = f"""<|user|>
Use the following context to answer the question.

Context:
{context}

Question: {question}<|end|>
<|assistant|>
"""
    out = gen(prompt, max_new_tokens=100, do_sample=False)[0]["generated_text"]
    return out[len(prompt):]    # strip the prompt

print(answer("What is the capital of Nigeria?"))
print(answer("Who created Python?"))

That's a working RAG in ~30 lines. The model answers using the retrieved context, not just its baked-in knowledge.

For real production you'd swap in a proper vector DB (next section); the LLM call stays the same.

Vector databases¶

For 100 documents, a NumPy dot product is fine. For 1M+ documents, you need a vector database with efficient approximate nearest neighbor search.

Self-hosted: - FAISS (Facebook) - library, in-process. Fast. No persistence layer; you build that. - Chroma - embedded, easy to start. - Qdrant - server-mode, production-grade. - Weaviate - feature-rich, server-mode. - Milvus - distributed, for very large scale.

Hosted: - Pinecone - first popular hosted vector DB. - Cloud-native: AWS OpenSearch with k-NN, Postgres + pgvector, Redis with vector search.

For learning: Chroma or FAISS. For production: depends on scale and existing infra.

A Chroma example:

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=docs,
    embeddings=doc_embeddings.tolist(),
    ids=[f"doc-{i}" for i in range(len(docs))],
)

results = collection.query(
    query_embeddings=embedder.encode(["What is Lagos?"]).tolist(),
    n_results=2,
)
print(results["documents"])

Chunking strategies¶

Long documents must be split. Naive: split every N characters. Better: - Fixed-size with overlap (e.g., 500 chars, 50-char overlap to preserve context across boundaries). - Semantic chunks (paragraphs, headings). - Recursive chunking - try paragraphs first, fall back to sentences, fall back to words.

langchain.text_splitter.RecursiveCharacterTextSplitter is the popular tool. Try a few; the best chunking is task-dependent.

Embedding choices¶

Bigger embedding model = better retrieval, slower to embed, larger vectors.

Model	Dim	Speed	Quality
`all-MiniLM-L6-v2`	384	very fast	decent
`all-mpnet-base-v2`	768	medium	good
`BAAI/bge-base-en-v1.5`	768	medium	excellent
`BAAI/bge-large-en-v1.5`	1024	slow	best
`text-embedding-3-small` (OpenAI)	1536	API	excellent
`nomic-embed-text-v1.5`	768	medium	excellent, open source

For learning: all-MiniLM-L6-v2. For production: BGE or Nomic embed are strong open options.

Quality knobs¶

Things that matter, in order of impact:

Chunking strategy. Bad chunks = bad retrieval. Tune first.
Number of retrieved chunks (k). 3-10 typical. Too few = miss relevant info. Too many = context bloat, "lost in the middle."
Re-ranking. Retrieve k=20, then re-rank with a more expensive model down to top-5. Improves quality at modest cost.
Hybrid search. Combine semantic (vector) with keyword (BM25). Catches cases where exact word match matters.
Query rewriting. LLM rewrites the user's question into a better search query.
Embedding model. Better embeddings = better retrieval. Worth experimenting.

For a beginner, just fixed-size chunks + top-3 semantic retrieval is a strong baseline.

When RAG fails¶

User asks a question whose answer requires synthesis across many docs. RAG retrieves top-k passages; each independent. Synthesis fails.
Question is ambiguous. Retrieval gets the wrong passage; answer is confidently wrong.
The corpus genuinely doesn't contain the answer. The LLM hallucinates because the user expects an answer.

Mitigations: explicit "I don't know" in the prompt; structured outputs that include source citations; user-facing transparency about what was retrieved.

Frameworks¶

Real RAG apps often use: - LangChain - most popular framework. Composable chains for retrieval + generation. - LlamaIndex - alternative, more retrieval-focused. - Haystack - pipeline-oriented, German-engineered.

These wrap the patterns above with batteries included. For learning, building from scratch (like this page) makes the mechanics clear; for production, frameworks save time.

Going deeper¶

You can build a RAG pipeline now. This is the depth that explains why RAG systems disappoint in practice - because "it retrieves and generates" is easy; "it retrieves the right thing" is the entire hard problem. Here are the failure modes, with what you'll see.

The #1 RAG failure: retrieval returns the wrong chunks¶

RAG's quality ceiling is its retrieval. If the wrong chunks come back, the LLM gets fed irrelevant context and either hallucinates or says "I don't know" - and the LLM is not the problem, retrieval is. The first debugging move is always to look at what was retrieved:

results = vector_db.query(query_embedding, top_k=5)
for r in results:
    print(f"score={r.score:.3f}  {r.text[:80]}")    # ALWAYS inspect the retrieved chunks

score=0.42  "...unrelated paragraph about billing..."     # low score, off-topic = bad retrieval
score=0.41  "...another tangent..."

Low similarity scores (and visibly off-topic text) mean retrieval failed - and no prompt engineering on the generation side will fix it. When RAG gives bad answers, inspect the retrieved chunks first. If they're irrelevant, the bug is upstream (chunking, embedding, or the query), not the LLM.

Chunking is where most RAG quality is won or lost¶

How you split documents into chunks determines what can be retrieved. The common failures:

Chunks too big - a 2000-token chunk dilutes the embedding (it's "about" too many things), so it matches poorly and wastes context window.
Chunks too small - a 50-token chunk loses context (a sentence retrieved without its surrounding meaning).
Chunks split mid-thought - naive fixed-size splitting cuts sentences/tables in half, so neither half is coherent.

The reliable starting point: ~200-500 token chunks with ~10-20% overlap, split on natural boundaries (paragraphs, sections) not arbitrary character counts. Overlap ensures a concept spanning a boundary appears whole in at least one chunk. Most "RAG retrieves garbage" problems trace back to bad chunking - it's the unglamorous lever with the biggest effect.

The embedding mismatch you won't notice¶

A subtle, silent failure: the query and the documents must be embedded with the same model, and some embedding models need an asymmetric prefix (a query prefix vs a passage prefix):

# WRONG - mixing embedding models, or forgetting required prefixes
doc_emb = model_a.encode(documents)
query_emb = model_b.encode(query)        # different model -> incomparable vectors -> garbage

# Some models (e5, bge) require prefixes:
doc_emb = model.encode("passage: " + text)
query_emb = model.encode("query: " + question)    # forgetting these tanks retrieval silently

Mismatched embedding models produce vectors in different spaces - similarity is meaningless, retrieval is random, and there's no error. Check your embedding model's docs for required prefixes; forgetting them quietly halves retrieval quality.

What you'll see: the LLM ignores the context (or hallucinates anyway)¶

Sometimes retrieval is good but the answer is still wrong. Two patterns:

Hallucination despite good context - the model answers from its training, ignoring your retrieved chunks. Fix: a firmer prompt ("Answer ONLY using the context below. If the answer isn't in the context, say 'I don't know.'") and put the context prominently (right before the question).
"I don't know" despite the answer being present - the model is too conservative, or the relevant chunk was retrieved but ranked low and got cut by top_k. Fix: increase top_k, or add a reranking step (retrieve 20, rerank to the best 5 with a cross-encoder).

The diagnostic: print the exact prompt sent to the LLM (retrieved context + question). Seeing what the model actually received - is the answer even in there? is the context buried? - resolves most generation-side RAG bugs. The full prompt is the ground truth.

The honest evaluation problem¶

RAG is hard to evaluate because there are two failure points (retrieval and generation) and "looks plausible" isn't "correct" (the Evaluation chapter's lesson). At minimum, separate the two: measure retrieval (did the right chunk make the top-k? - you can check this with known question/chunk pairs) and generation (given the right context, is the answer faithful to it?). A RAG system failing end-to-end is almost always failing at one of these specifically - measuring them separately tells you which to fix.

Try it (with what you'll see)¶

Run a query and print the retrieved chunks with scores before generation. Are they relevant? Are the scores high (>0.7-ish for good embedders)? Inspect before trusting.
Re-chunk a document three ways: 2000-token, 50-token, and 300-token-with-overlap. Run the same query against each. See retrieval quality change with chunking alone.
Deliberately embed documents and the query with different models. Watch retrieval return garbage with no error. Fix by using the same model.
Print the full prompt (context + question) sent to the LLM. Confirm the answer is actually in the retrieved context - if the answer's wrong but the context is right, it's a generation/prompt problem, not retrieval.

Exercise¶

Run the minimal RAG above. Confirm the answers use the retrieved context.
Expand the corpus: add 20 more facts. Try a question that's ambiguous between two retrieved docs - see how the model handles it.
Different embedder: swap all-MiniLM-L6-v2 for BAAI/bge-base-en-v1.5. Larger model; do retrieval results improve for tricky questions?
Chunking exercise: download a Markdown doc (your README.md or any project's). Use RecursiveCharacterTextSplitter from langchain to chunk it. Index the chunks. Ask questions.
(Stretch) Try Chroma instead of in-memory NumPy. Same RAG flow with persistent index.

What you might wonder¶

"What if the LLM ignores the retrieved context?" Happens. Make the prompt clearer: "Answer ONLY using the context above. If the context doesn't contain the answer, say 'I don't know.'" Smaller models ignore instructions more; bigger ones follow.

"Should I do RAG or fine-tuning?" Both, often. RAG for facts; fine-tuning for style + format. Don't pit them against each other.

"What's a 'retriever' vs an 'embedder'?" An embedder produces vectors. A retriever uses the embedder + a vector DB + post-processing to return passages. Same pipeline, different name for different layers.

"How do I evaluate a RAG?" Next page. Hardest part.

Done¶

Build a RAG pipeline end-to-end with embeddings + vector search + LLM.
Distinguish from fine-tuning (facts vs style).
Recognize vector DB options.
Apply basic quality knobs (chunking, k, re-ranking, hybrid search).
Know LangChain / LlamaIndex / Haystack exist.

Next: Evaluation →

11 - Evaluation¶

What this session is¶

About an hour. The hardest part of building with AI - and the one most beginner tutorials skip. By the end you'll know why "looks good" is not evaluation, and how to do it for real.

Why this page matters¶

Most "AI products" you'll see are evaluated by their authors clicking around and saying "yep, looks good." That's how launched products go viral with embarrassing failures the moment a user does something unexpected.

Good evaluation is what separates a demo from a system. Most engineers - even experienced ML practitioners - get this wrong. Take this page seriously.

The fundamental rule¶

You cannot iterate on what you cannot measure.

Without an objective evaluation, every change is a coin flip. Did this prompt change improve things or make them worse? You can't tell. Without a number, you'll convince yourself it's better - because you wrote it.

A measurable eval lets you see real improvement, A/B test prompts, catch regressions when you change models, ship confidently.

Types of evaluation¶

Different problems need different evals.

Classification - easy¶

If your output is a class (positive/negative, A/B/C, 0-9):

Accuracy - fraction correct.
Precision - of predicted positives, what fraction are actually positive.
Recall - of actual positives, what fraction did you find.
F1 - harmonic mean of precision and recall.
Confusion matrix - full breakdown of predicted vs actual.

scikit-learn:

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))

Done. Easy.

Free-form text generation - hard¶

If your output is a paragraph of text (LLM chatbot, summarizer):

Exact match - useless unless you're matching against a fixed answer.
BLEU, ROUGE, METEOR - n-gram overlap with a reference. Useful for translation; poor for chat (paraphrase = bad score).
Embedding similarity - cosine similarity between generated and reference embedding. Better than n-gram.
LLM-as-judge - use a strong model to grade outputs. Most-used in practice, with caveats below.
Human eval - gold standard, expensive.

For chat / RAG / summarization, LLM-as-judge is the practical default.

Retrieval - medium¶

If your problem is "did I retrieve the right passages":

Recall@K - of all relevant passages, how many appear in your top-K results.
MRR (Mean Reciprocal Rank) - average of 1 / rank-of-first-relevant.
nDCG - discounted gain over ranking quality.

You need a labeled dataset: each query has a known correct passage. Build this manually for ~100-1000 queries.

LLM-as-judge¶

You have outputs from your system. You want to know "are these good?"

from openai import OpenAI            # or any LLM client

judge_prompt = """You are grading an AI assistant's answer.

Question: {question}
Expected answer: {gold}
AI answer: {generated}

Grade the AI answer on:
- Correctness (0-5): does it match the expected answer in substance?
- Completeness (0-5): does it cover the key points?
- Conciseness (0-5): is it free of fluff?

Respond ONLY with JSON: {{"correctness": N, "completeness": N, "conciseness": N, "rationale": "..."}}
"""

def grade(question, gold, generated):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt.format(
            question=question, gold=gold, generated=generated
        )}],
    )
    import json
    return json.loads(response.choices[0].message.content)

Run your system against a held-out dataset of (question, gold-answer) pairs. Have the judge grade each. Aggregate the scores.

Caveats: - Use a more capable model for judging than for generating. Don't have GPT-3.5 grade GPT-3.5; have GPT-4 or Claude do it. - Judge bias. LLM judges have biases (preferring longer answers, preferring their own family's models). Counter with care. - Calibration. Run human-judged grades on a subset; check the LLM judge agrees with humans. - Pairwise > absolute. "Which of A and B is better" judgments are more stable than absolute 1-5 scores.

Even with caveats, LLM-as-judge is far better than "looks good to me."

Build an evaluation dataset¶

For LLM apps, this is the work you'll spend the most time on. Patterns:

Production traces. Sample real user queries from your service logs. Manually label expected answers. ~100-1000 examples.
Adversarial cases. Specifically construct queries that should fail or should succeed. Boundary cases, ambiguous queries, out-of-scope queries.
Public benchmarks. MMLU (multitask), TruthfulQA, HumanEval (coding), GSM8K (math). Useful for "how does my model compare," less useful for "is my prompt better."

A good eval dataset is representative + adversarial + maintained. Production examples + manually-curated edge cases. Refresh as your product evolves.

A real workflow¶

Pattern that works:

Build a small golden dataset. ~50-200 examples to start.
Run your current system. Score with LLM-as-judge. Get a baseline number.
Make a change (new prompt, new model, new retrieval).
Re-run. Compare. If the number went up materially, ship; if it went down, revert; if it's noise, you didn't change enough.
Expand the eval set as you discover failure modes in production.

This loop is the whole job. Every successful AI product team runs some version of it. Every failed one didn't.

Cost and latency are evaluation criteria¶

A model that's 5% better but 10× slower might be worse for your product. Track both quality and ops costs:

Per-request cost (tokens × model price).
p50, p95 latency.
Throughput (requests per second).

A useful "is the next model worth it" question: "for every 1% quality improvement, how much do cost/latency change?"

Bias, fairness, safety¶

Big and important; out of scope for a beginner page. The minimum:

Test on diverse inputs. Different demographics, languages, dialects, edge cases.
Test refusal behavior. Does it refuse harmful requests? Does it over-refuse benign ones?
Test on adversarial prompts (prompt injection).

Production teams have dedicated red-teamers. For your first project, manual sampling is fine.

Specific tools¶

evaluate (Hugging Face) - eval-metric library. Bundles many standard metrics.
langsmith (LangChain Labs) - tracing + evaluation platform.
promptfoo - open-source eval CLI for LLM prompts.
ragas - RAG-specific evaluation metrics (faithfulness, context relevancy).
lm-eval-harness (EleutherAI) - runs many academic benchmarks.

For learning: roll your own (the snippet above). For scale: pick one tool.

Exercise¶

Build a tiny eval dataset. Use the RAG from page 10. Write 10 (question, expected-answer) pairs covering your corpus.
Run the RAG. Score each output manually with a 1-5 score on correctness. Average the scores; that's your baseline.
Make a change. Change the prompt, or k=2 → k=4, or use a bigger embedder. Re-score. Did it improve?
Add LLM-as-judge. Have the model itself score the outputs against expected answers. Compare to your manual scores. How well do they agree?
(Stretch) Use the ragas library on your RAG. Run its faithfulness + answer_relevancy metrics.

What you might wonder¶

"How big does my eval set need to be?" 50 is the minimum for noisy signal. 500+ is comfortable. 5000+ for academic-paper-strength results. For getting started, start at 50 and expand.

"Can I trust LLM-as-judge?" Mostly. Pair with human spot-checks (10% of examples reviewed by you). When LLM-as-judge says scores went up but you can't see the improvement, something's miscalibrated.

"What about RLHF / DPO / online evaluation?" Real production AI products have ongoing eval pipelines collecting user feedback, A/B testing changes, fine-tuning on preference data. Beyond beginner; mentioned for awareness.

"How does this compare to evaluating a 'normal' classifier?" Classifiers have ground truth + simple metrics. LLM outputs have ambiguity at every step - there's no single correct answer to "summarize this article." Evaluation gets correspondingly fuzzy. The discipline is the same; the metrics are softer.

Done¶

Recognize different eval types (classification, generation, retrieval).
Build an evaluation dataset.
Use LLM-as-judge correctly (with calibration awareness).
Run the build → measure → change → measure loop.
Track cost + latency alongside quality.

Next: Serving models →

12 - Serving Models¶

What this session is¶

About 45 minutes. How to expose your model as a service users can call. Local options (Ollama, llama.cpp), high-performance serving (vLLM), and rolling your own HTTP API.

The simplest possible serve: Ollama¶

Ollama is the easiest way to run an LLM locally.

# Install (macOS):
brew install ollama
ollama serve &

# Pull and run a model:
ollama pull llama3.2:3b
ollama run llama3.2:3b

You get an interactive chat. To use programmatically:

import httpx
r = httpx.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": False,
})
print(r.json()["message"]["content"])

Ollama handles the model loading, quantization, GPU detection. For local dev, it's the easiest start.

llama.cpp - for tighter control¶

llama.cpp is a C++ inference engine that runs GGUF-quantized models on CPU + GPU. Lower-level than Ollama; faster; more configurable.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a quantized model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run
./llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello!" -n 100

Or serve as HTTP:

./llama-server -m model.gguf -c 4096
# OpenAI-compatible API on http://localhost:8080

Many projects use llama.cpp under the hood (Ollama, LM Studio, etc.).

vLLM - high-throughput serving¶

For production-grade serving of large models with high concurrency: vLLM.

pip install vllm

Run a server:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 1

OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
r = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(r.choices[0].message.content)

vLLM's killer features: - PagedAttention - KV-cache management like virtual memory. Way better GPU utilization than naive serving. - Continuous batching - interleaves requests at the token level. Many concurrent users; high throughput. - OpenAI-compatible API - drop-in replacement for OpenAI client libraries.

Used by many production LLM deployments. Needs a GPU; doesn't run on CPU usefully.

A simple HTTP wrapper around your own model¶

If you want full control:

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()


class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 100
    temperature: float = 0.7


@app.post("/generate")
def generate(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=req.max_new_tokens,
            temperature=req.temperature,
            do_sample=True,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    return {"text": text[len(req.prompt):]}

Run: uvicorn server:app --host 0.0.0.0 --port 8080. Test:

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "max_new_tokens": 30}'

For a small model and a few users, this works fine. For high concurrency, use vLLM.

Streaming responses¶

Users want to see tokens as they generate, not wait for the whole response. Implementation:

from fastapi.responses import StreamingResponse
from threading import Thread
from transformers import TextIteratorStreamer

@app.post("/stream")
def stream(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    Thread(target=model.generate, kwargs={
        **inputs, "max_new_tokens": req.max_new_tokens,
        "streamer": streamer, "do_sample": True, "temperature": req.temperature,
    }).start()

    def gen():
        for token in streamer:
            yield token
    return StreamingResponse(gen(), media_type="text/plain")

For production: use vLLM (streaming is built-in) rather than rolling your own threading.

Containerize for deployment¶

A Dockerfile for the FastAPI server:

FROM python:3.12-slim
WORKDIR /app

# CUDA wheel for PyTorch (adjust to your CUDA version)
RUN pip install --no-cache-dir torch fastapi uvicorn pydantic transformers

COPY server.py .
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

Build, push, run on Kubernetes (Containers + Kubernetes paths).

Deployment concerns¶

GPU scheduling. Kubernetes can schedule pods to GPU nodes (nvidia.com/gpu: 1 in resources). NVIDIA's GPU Operator manages drivers.
Cold start. Loading a 7B model takes 10-30 seconds. Don't scale to zero unless cold start is acceptable.
Model caching. Embed the model weights in the container image (huge), or mount as a PV (faster restarts).
Autoscaling. GPU pods are expensive. Scale based on request queue depth or GPU utilization, not CPU.
Observability. Latency per request, tokens/sec, GPU memory, queue depth.

Cost / latency calibration¶

Rough numbers for a single A100 GPU serving Llama-3.1-8B (FP16): - ~30-80 tokens/sec generation rate. - 5-15 GB GPU memory. - A few concurrent users; vLLM bumps this to dozens.

Quantized (4-bit) on a single consumer 24GB GPU: - ~40-100 tokens/sec. - Single users comfortable; concurrency lower.

For frontier models (70B+), you need multi-GPU or sharded serving. Beyond beginner.

OpenAI compatibility is a contract¶

Many tools (LangChain, llama-index, CLI tools) speak the OpenAI HTTP API. vLLM, llama.cpp's server, Ollama (with its /v1/... endpoints) all implement it. Building against the OpenAI API makes you portable across self-hosted and hosted backends.

from openai import OpenAI
# Same code works for openai.com, vLLM, Ollama, llama.cpp:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

Going deeper¶

You can serve a model now. This is the depth that turns "it works on my machine for one request" into understanding why production serving is hard - the failure modes you'll hit the moment there's more than one user, with what you'll see.

The throughput cliff: it's fast alone, slow under load¶

Your model serves one request at 50 tokens/sec. Five users connect, and suddenly everyone gets 10 tokens/sec - worse than 1/5th, because naive serving processes requests one at a time (or wastes the GPU). The fix that defines modern serving is continuous batching: the server processes many requests' tokens together in each GPU step, since the GPU has spare capacity during a single request.

Naive (one at a time):     5 users -> ~10 tok/s each, GPU mostly idle between requests
Continuous batching:        5 users -> ~40 tok/s each, GPU saturated
(vLLM/TGI do this; Ollama/llama.cpp are more single-user oriented)

This is the reason vLLM and TGI exist and why "I'll just wrap the model in FastAPI" falls over in production. If you're serving more than one concurrent user, you need a server built for batching - not a hand-rolled loop. Watching per-user throughput collapse as users increase is the symptom; continuous batching is the fix.

The memory math that decides everything: the KV cache¶

The hidden memory consumer in serving isn't the model - it's the KV cache, which stores the attention keys/values for every token in every active request's context. It grows with concurrent_requests x context_length, and it's often what OOMs a serving box:

Model weights (7B, fp16):     ~14 GB   (fixed)
KV cache per request:         ~0.5-2 GB depending on context length
-> 10 concurrent users with long contexts can need MORE memory than the model itself

This is why a server that loads fine then OOMs under load - the weights fit, but the KV cache for many concurrent long-context requests doesn't. vLLM's famous innovation (PagedAttention) is specifically about managing this KV-cache memory efficiently. For you as a beginner: know that serving memory = weights + (concurrency x context KV cache), and that long contexts and high concurrency are what blow it up - not just model size.

What you'll see: time-to-first-token vs tokens-per-second¶

Serving has two latencies users feel, and they're different problems:

Time to first token (TTFT) - how long until the response starts. Dominated by the prefill phase (processing the whole prompt). Long prompts = slow TTFT. Felt as "it hangs before responding."
Inter-token latency / tokens-per-second - how fast tokens stream once started. Dominated by the decode phase (one token at a time). Felt as "it types slowly."

$ curl -w "TTFT and total..." ...     # measure both
prompt of 2000 tokens -> TTFT 1.8s (prefill is slow), then 45 tok/s (decode fine)

A user complaining "it's slow to start" (TTFT/prefill - shorten the prompt, use prefix caching) is a different fix from "it types slowly" (decode/throughput - batching, quantization, a smaller model). Streaming responses (stream=True) doesn't make it faster but makes TTFT feel better by showing tokens as they generate instead of waiting for the whole response. Knowing which latency a complaint is about is the serving diagnostic.

The quantization tradeoff, made concrete¶

Quantization (4-bit/8-bit) is how you fit a model on a smaller GPU and serve faster - at an accuracy cost:

7B model, fp16:    ~14 GB, highest quality, needs a 16GB+ GPU
7B model, 8-bit:   ~7 GB,  near-identical quality, fits 12GB
7B model, 4-bit:   ~4 GB,  slightly degraded, fits an 8GB consumer GPU

The practical guidance: 4-bit quantization is usually the right default for self-hosting - the quality loss is small and often imperceptible for many tasks, and it makes models fit on hardware you can afford. But measure on your actual task (the Evaluation chapter) - quantization can hurt more on reasoning-heavy or precise tasks than on casual chat. Don't assume; check the outputs.

Try it (with what you'll see)¶

Serve a model (Ollama is easiest). Hit it with one request, note tokens/sec. Then fire 5 concurrent requests (a quick loop) and watch per-request throughput drop - the batching cliff.
Send a short prompt and a 2000-token prompt; measure time-to-first-token for each. Watch TTFT grow with prompt length (prefill cost).
Run the same model at fp16, 8-bit, and 4-bit (if your GPU allows). Compare memory use (nvidia-smi) and output quality on a few prompts. Feel the size-vs-quality tradeoff.
Use stream=True vs not. Same total time, but streaming feels far faster because tokens appear immediately.

Exercise¶

Install Ollama. Pull a small model. Chat.
Call it from Python via its HTTP API.
Build a tiny FastAPI server that wraps a small HF model (page 04's MLP works, or a Phi-3-mini for fun). Curl it.
(Stretch - GPU helpful) Install vLLM. Serve microsoft/Phi-3-mini-4k-instruct. Use the OpenAI client to call it.
(Stretch) Containerize your FastAPI server. Build the image. Run via docker run.

What you might wonder¶

"Should I serve via my own framework or vLLM?" For real production: vLLM (or TGI, Triton). The hand-rolled FastAPI version works fine for a hobby project but doesn't handle concurrency well.

"How do I keep the model warm?" Don't scale to zero. Have at least one replica always running. Health checks must respond fast (≤1s) without invoking the model.

"GPU memory keeps growing - what?" KV-cache (page 07) accumulates as context grows. Limit max_total_tokens in vLLM or chunk old context out.

"Open source vs API providers?" Both have a place. OpenAI/Anthropic/Google APIs are easy and powerful; you pay per token. Self-hosting is cheaper at scale but adds ops complexity. Most production teams use both - APIs for high-quality requests, self-hosted for high-volume cheaper requests.

Done¶

Run an LLM locally with Ollama or llama.cpp.
Serve high-throughput with vLLM.
Build a custom FastAPI wrapper.
Stream responses.
Containerize for deployment.

Next: Picking a project →

13 - Picking a Project¶

What this session is¶

About 30 minutes plus browsing. AI OSS that accepts first contributions, with specific candidates.

What kinds of AI projects fit beginners¶

Your toolkit so far: PyTorch, Hugging Face, RAG, evaluation, serving. Good targets:

Inference engines & serving (vLLM, llama.cpp, Ollama) - high-quality issue tickets across many difficulty levels.
Tokenization / data tooling (HuggingFace tokenizers, datasets).
Embedding / RAG libraries (sentence-transformers, llama-index, langchain).
Evaluation tools (lm-eval-harness, ragas, promptfoo).
Adjacent ML tools (numpy, scipy, scikit-learn).
Documentation - every major project has doc-improvement work.

For more research-y work (PyTorch core, training algorithms, model architectures), you'll need deeper expertise. Build on this path first.

10-minute evaluation¶

Same standard as other beginner paths:

Signal	Target
Stars	100-50000
Last commit	Within a month
Open PRs	Some, not 300+
Recent PR merge time	Under 14 days
`good first issue` count	≥5
CONTRIBUTING.md	yes, readable
Tests pass on fresh clone	yes

Candidates¶

Tier 1 - friendly, smaller scope¶

huggingface/transformers - yes the big one, BUT they have excellent issue triage. Many docs/examples PRs. Look at good first issue.
langchain-ai/langchain - chained LLM workflows. Large but very welcoming; tons of easy-mode integrations to add.
run-llama/llama_index - RAG-focused alternative to LangChain.
promptfoo/promptfoo - eval tool. Small enough to be approachable; very active.
huggingface/tokenizers - tokenizer library. Rust core + Python bindings.

Tier 2 - well-organized¶

vllm-project/vllm - production inference serving. Issues exist at all levels.
huggingface/peft - LoRA + friends. Smaller surface; active.
huggingface/datasets - data loading. Adding a new dataset adapter is a common first contribution.
sentence-transformers/sentence-transformers - embeddings library.
unslothai/unsloth - fast fine-tuning. Welcoming.

Tier 3 - bigger, more visible¶

After 1-2 PRs.

pytorch/pytorch - yes eventually. Excellent labels; SIG structure; well-shepherded contributors.
huggingface/transformers larger contributions (new model architectures, etc.).
triton-lang/triton - GPU programming DSL. Needs Triton + CUDA understanding.

Tier 4 - don't start here¶

Foundation model labs (OpenAI, Anthropic, DeepMind) - closed source.
PyTorch internals (autograd, distributed) - deep specialty.

Finding issues¶

Project's Issues tab. Filter by good first issue / documentation / help wanted.

Many AI projects label specific kinds of work: - enhancement: docs - models: add - integration: add - bug: confirmed

Read 5-10 issues; find one with clear repro and contained fix. Comment to claim; wait for maintainer.

What counts¶

For AI OSS work:

Fix a typo in a model's documentation.
Add a missing example in a tutorial notebook.
Fix a quantization bug for a specific GPU.
Add a new embedding model adapter.
Improve an evaluation metric's implementation.
Add a missing integration to a chain framework.
Add support for a new fine-tuning recipe.
Translate documentation.

All real. All count.

Specific recommendation: `huggingface/transformers` docs¶

For an easy first PR: open transformers/docs/source/en/ in a clone, find a doc page that's missing an example or has a confusing sentence. Submit a fix. The HF team is responsive and welcoming; you'll hear back within days. PR merges fast.

Exercise¶

Browse three projects from Tier 1-2.
Run the 10-minute eval on each.
Pick the most responsive.
Read CONTRIBUTING.md.

Clone, install, run their tests:

git clone https://github.com/<owner>/<repo>
cd <repo>
pip install -e .[dev]              # or pip install -r dev-requirements.txt
pytest               # or whatever they use

Browse open issues. Pick two candidates. Don't claim yet.

What you might wonder¶

"I want to contribute to PyTorch core. Can I?" Yes, eventually. Start with their docs/ work or with the smaller modules first. The bar is real but lower than the kernel project.

"I'm scared of touching ML papers' reference implementations." Don't be. Start with documentation. Reference implementations of papers (DeepMind's work, etc.) often have terse README, scattered hyperparameters, missing examples. Doc PRs are welcomed.

"What about Anthropic / OpenAI / Google work?" Their research models are mostly closed-source. Their tools (Anthropic's claudette library, OpenAI's cookbook) are public and accept PRs.

Done¶

Recognize AI-OSS contribution shapes.
Run a 10-minute evaluation.
Have specific projects in mind.

Next: Anatomy of an AI OSS project →

14 - Anatomy of an AI OSS Project¶

What this session is¶

Read a real AI OSS repo top to bottom so the next one feels familiar.

Case study: `huggingface/peft`¶

PEFT (Parameter-Efficient Fine-Tuning) implements LoRA, QLoRA, IA3, prefix-tuning, etc. Small enough to read in a sitting; well-maintained.

git clone https://github.com/huggingface/peft
cd peft
ls

Typical top level:

README.md
CONTRIBUTING.md
LICENSE
setup.py / pyproject.toml
src/peft/                # library code
tests/
examples/
docs/
.github/workflows/

What to read, in order¶

1. README.md (5 min)¶

What the project is. Quickstart example. Supported methods.

2. CONTRIBUTING.md (5 min)¶

How to set up the dev environment. Code style. Tests. PR rules.

3. setup.py / pyproject.toml (2 min)¶

Dependencies. Optional extras. Python version.

4. src/peft/ (15 min)¶

The package itself:

src/peft/
├── __init__.py             # public API
├── peft_model.py           # main PeftModel class
├── config.py               # config classes
├── tuners/
│   ├── lora.py             # LoRA implementation
│   ├── ia3.py
│   ├── prefix_tuning.py
│   └── ...
└── utils/

Read __init__.py first - it shows the public API surface. Then pick lora.py - that's the most-used technique and you understand it.

5. tests/ (10 min)¶

Pick test_lora.py (or similar). See how the team validates that LoRA still works across model architectures.

6. examples/ (10 min)¶

Working notebooks. Reproducible end-to-end runs.

7. .github/workflows/ (5 min)¶

tests.yml - runs pytest matrix. build_docs.yml - builds docs. release.yml - pushes to PyPI.

CI is the spec. What it runs, your PR must pass.

What to look for¶

Where does data flow? For a training-time library: model → tuner wrapper → optimizer → save. For RAG: query → embed → search → context → LLM → response.
Where's the public API? Usually __init__.py or a models.py / api.py.
Where are model architectures? Usually models/ or per-architecture files.
Where are tests? tests/. Match each test to a code file.
What's "magic"? Decorators that register models (@register_model), config classes that auto-load. Read the registration logic once.

Common AI-project patterns¶

Registry pattern. Models, tuners, integrations registered by string. New addition = add to registry + implement interface.
Hub integration. Models loaded from from_pretrained("model-id"). Look for _load_pretrained_model or similar.
Configuration as a dataclass. @dataclass class FooConfig. Serializes to JSON for reproducibility.
Mixed-precision and device handling. with torch.cuda.amp.autocast(): blocks; model.to(device).
Pipeline abstraction. High-level wrapper over tokenizer + model + generation logic.

Once you see these in one project, you see them everywhere.

Reading the test suite¶

Tests document expected behavior. For peft:

test_lora.py::test_lora_save_load - round-trip preservation.
test_lora.py::test_lora_target_modules - which layers get adapters.
test_lora.py::test_lora_merge - merging LoRA back into base weights.

Each test names the contract. To break the test is to break the contract.

Counter-example: `pytorch/pytorch`¶

Several million lines. C++/CUDA/Python. Build system alone is a project. Don't read top-to-bottom. Instead, find a specific module (torch/optim/, torch/utils/data/) and read just that.

Counter-example: `langchain-ai/langchain`¶

Monorepo with ~100 packages. Hundreds of integrations. Don't read top-to-bottom. Pick one integration package (e.g., libs/community/langchain_community/llms/anthropic.py) and read just that.

Exercise¶

Clone huggingface/peft.
Spend 45 minutes reading per the order above.
After: explain to yourself, out loud:
What does this project do?
What's the public API?
Where would a new technique (e.g., new LoRA variant) be added?
How is it tested?
Pick one open good first issue. Locate the code it concerns.

What you might wonder¶

"I read it. I don't fully understand it." That's fine. Goal is geography, not mastery. You should know roughly where things live. Mastery comes from changes.

"The code uses techniques I haven't learned (mixin classes, metaclasses, etc.)." Note them. Don't get stuck. Modify a small piece first.

"It uses CUDA / accelerate / DeepSpeed. I can't run on my laptop." You can still read and contribute. Many PRs are CPU-testable. Look for @require_torch_gpu decorators on tests - those are GPU-only; the rest you can run.

Done¶

Read a real AI OSS repo with a plan.
Know the typical layout.
Have a target issue.

Next: Your first contribution →

15 - Your First Contribution¶

What this session is¶

The whole thing. Walk through an AI OSS contribution end-to-end.

The workflow¶

Fork on GitHub.
Clone your fork.
Add upstream as remote.
Branch off main.
Set up the dev environment (install with extras; run tests).
Change the file(s).
Run lint + tests locally.
Push to your fork; open PR.

Step 1: Fork & clone¶

git clone git@github.com:<you>/peft.git
cd peft
git remote add upstream git@github.com:huggingface/peft.git
git fetch upstream

Step 2: Branch¶

git checkout -b docs/fix-lora-example

Always a fresh branch off main.

Step 3: Set up dev environment¶

For most HF projects:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,test]"
pytest tests/ -x -q

For projects requiring GPU, run only CPU tests first:

pytest tests/ -x -q -m "not gpu"

If anything fails on a fresh clone, fix that first or ask in the issue.

Step 4: Make the change¶

Small. Focused. Tested.

Docs typo / clarification - edit the .md file in docs/source/.
Add an example - add a new file under examples/.
Fix a bug - change the code; add or update a test that proves the fix.

For a first PR, prefer the first two. Bug fixes are great once you know the project.

Step 5: Re-run CI's commands locally¶

Look in .github/workflows/tests.yml. Typical:

make quality              # ruff / black / isort
make test                 # pytest
make docs                 # sphinx build

All green? Push. Red? Fix locally first.

Step 6: Commit and push¶

git add <files>
git commit -m "docs: fix LoRA target_modules example for Llama"

DCO if required (git commit -s).

git push origin docs/fix-lora-example

Step 7: Open the PR¶

On upstream repo, "Compare & pull request."

Title. Short, descriptive. Conventional-commit style if the project uses it.
Description. What changed, why, how tested. Closes #123 references the issue.
Checklist. Address every item in the PR template.

Submit. CI runs. Fix anything red by pushing more commits.

Worked example: typo in PEFT LoRA docs¶

Suppose you noticed docs/source/conceptual_guides/lora.md has an outdated target_modules=["query_key_value"] example that no longer applies to current Llama configs.

git clone git@github.com:<you>/peft.git
cd peft
git remote add upstream git@github.com:huggingface/peft.git
git fetch upstream

git checkout -b docs/lora-target-modules-llama

# Edit docs/source/conceptual_guides/lora.md
# Add a note: "For Llama-style models, use ['q_proj','v_proj']."

make quality
make docs

git add docs/source/conceptual_guides/lora.md
git commit -m "docs: clarify LoRA target_modules for Llama-style models"
git push origin docs/lora-target-modules-llama

Open PR. Wait for review.

What review looks like¶

"LGTM, merging." Done.
"Could you change these?" Address. Push commits to same branch.
"Not quite - we already have a section for this." Update or close.
Silence for a week → polite check-in comment.

HF teams are responsive (usually within days).

After the merge¶

Update your fork's main:

git checkout main
git fetch upstream
git merge upstream/main
git push origin main

Delete the branch.
Take a screenshot.
Sit with it.

After your first PR¶

Pick another issue. Familiarity compounds - second is much easier.
After 3-5 PRs in one project, become a regular. Review others' PRs.
Pick a model architecture you care about. Contribute an integration.
Move toward research code: paper implementations, training-script improvements.

What you might wonder¶

"PR sits for weeks?" HF responds fast. Other AI projects (research orgs, slower-paced labs) can take longer. Polite check-in after 7-10 days.

"What about PyTorch core?" Larger surface, more rigorous review. CLA required, RFCs for non-trivial changes. Start with the docs/ tree there.

"What about OpenAI / Anthropic SDKs?" Yes, they accept PRs to their clients (openai-python, anthropic-sdk-python). Closed-source models, open-source clients.

"Maintainer rude?" Disengage. Try another project. AI OSS has many welcoming homes.

Done with this path¶

You've: - Installed PyTorch and the AI Python stack. - Trained a small neural net on MNIST. - Used Hugging Face for text generation. - Fine-tuned a model with LoRA. - Built a small RAG pipeline. - Evaluated outputs honestly. - Served a model locally. - Read a real AI OSS project. - Submitted a PR.

What you should do next: build a small AI tool you actually want to exist. The technology rewards practice. Pick one problem, build the simplest possible solution, iterate.

Recommended next paths on this site:

AI Systems Engineering (senior reference) - 24-week deep dive: kernels, distributed training, inference serving, evaluation infrastructure.
Python from Scratch - if your Python feels shaky.
Linux from Scratch - the substrate AI runs on.
Kubernetes from Scratch - where AI serving infra lives.

Congratulations. You are no longer a beginner.

AI Systems From Scratch (Beginner)¶