AI Systems From Scratch (Beginner)¶
Beginner path: heard-of-ChatGPT → training a small net, fine-tuning with LoRA, building RAG, serving locally, contributing to AI OSS.
Printing this page
Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.
AI Systems From Scratch - Beginner to OSS Contributor¶
From "I've heard of AI / LLMs" to "I can train a small model, fine-tune a transformer, build a small RAG app, evaluate it honestly, and submit a fix to an AI-adjacent OSS project."
Who this is for¶
- You've finished Python From Scratch (or you're comfortable enough in Python to write small programs).
- You've never trained a model, OR you've copy-pasted some PyTorch / Hugging Face code without really understanding what it does.
Soft prerequisite¶
Python comfort is mandatory - AI tooling lives in Python. If you can't write a function, walk a list, and read a stack trace, do Python From Scratch first.
You do not need a PhD in math. We use linear algebra at the level of "dot products and matrices"; we explain everything else as it appears.
What you'll need¶
- A computer. A GPU helps a lot but isn't required for the first 8 pages. Free options for hands-on GPU work: Google Colab (free tier), Kaggle Notebooks, Lambda Labs (cheap), AWS / GCP (paid).
- Python ≥3.10 (you set this up in the Python beginner path).
- A text editor.
- About 5 hours/week. Path is sized for 4-6 months.
Why AI systems¶
- Biggest growth area in software. The job market and OSS activity around LLMs / ML infra is the most active it's ever been.
- OSS is the heart of the field. PyTorch, Hugging Face, vLLM, llama.cpp, Ollama, LangChain - all open-source and welcoming.
- The barrier is lower than it looks. Modern tooling lets you fine-tune real models with ~50 lines of code; serve them with another ~30.
How this path works¶
Same template as the other beginner paths: one concept per page, code first then walkthrough, exercise, Q&A, done recap.
We use PyTorch as the framework throughout - most popular, best ecosystem. Hugging Face Transformers for pre-trained models. vLLM / Ollama / llama.cpp for inference.
The pages¶
| # | Title | What you'll know after |
|---|---|---|
| 00 | Introduction | What we're doing and why |
| 01 | Setup | Python + PyTorch + CUDA (or CPU) working |
| 02 | Tensors | PyTorch's central data type |
| 03 | Linear algebra you actually need | Dot products, matmul, gradients (intuitive) |
| 04 | Your first neural network | A small MLP from scratch |
| 05 | Training loop | Loss, optimizer, gradient descent |
| 06 | Inference and saving | Loading a pretrained model, running it |
| 07 | Transformers and tokenization | What an LLM actually does |
| 08 | Hugging Face Transformers | Pre-trained models in 3 lines |
| 09 | Fine-tuning | Adapt a model to your data (LoRA-friendly) |
| 10 | Retrieval-Augmented Generation | Embeddings + vector DB + LLM |
| 11 | Evaluation | The hardest part of ML done seriously |
| 12 | Serving models | vLLM, Ollama, simple HTTP wrappers |
| 13 | Picking a project | AI-OSS candidates |
| 14 | Anatomy of an AI OSS project | Case study |
| 15 | Your first contribution | Workflow + PR |
Start with Introduction.
00 - Introduction¶
What this session is¶
A 10-minute read. No code. Sets expectations.
What you're going to be able to do, eventually¶
By the end: - Manipulate tensors confidently with PyTorch. - Build, train, and use a small neural network from scratch. - Load a pre-trained transformer from Hugging Face and use it. - Fine-tune that transformer on your own data (parameter-efficient with LoRA). - Build a small Retrieval-Augmented Generation (RAG) app. - Evaluate model quality the right way (most people get this wrong). - Serve a model behind an HTTP API. - Clone an AI OSS project, find a small fix, submit a PR.
The deal¶
- It's slow on purpose. One concept per page.
- Python fluency assumed. Read a stack trace, write a function, walk a list.
- No math PhD required. Linear algebra at the "dot product and matmul" level. We explain everything else inline.
- GPU is helpful but not mandatory. Pages 01-08 work on CPU. Page 09+ benefits from GPU; Google Colab's free tier suffices.
- You will be confused. Often. AI has more vocabulary than any other technical area on this site. Don't panic.
A note on hype vs honesty¶
The AI field has more hype than any other in software. To stay sane:
- Models are token predictors. They are not "intelligent" in the way the marketing implies. They are very good at pattern completion over enormous corpora. That's an extraordinary thing - and that's all it is.
- Most "AI products" are wrappers around APIs. The actual engineering: tokenization, retrieval, prompt design, evaluation. The "model" itself is often someone else's pre-trained checkpoint.
- Evaluation is the hard part. "Looks good" is not evaluation. We'll do this properly in page 11.
This path treats AI as a practical engineering domain - what works, how it's built, how to ship it. We don't speculate about AGI.
What you need¶
- A computer (any OS).
- Python ≥3.10 (set up in Python From Scratch path).
- A text editor.
- ~5 hours/week. Path is sized for 4-6 months.
- A GPU for pages 09+ (or use Google Colab / Kaggle for free).
What you do NOT need¶
- A PhD or MS.
- A formal math background beyond high school algebra + intuitive linear algebra (we cover what you need).
- A cloud account or paid API. Open-source models run locally; we use them.
- C++ / CUDA. Those are senior-path material (AI Systems senior reference).
How long this realistically takes¶
4-6 months at 5 hours/week to "submit a PR."
The slowest pages are 07 (transformers) and 09 (fine-tuning). Plan for one or two re-reads at each.
What success looks like¶
You'll be able to:
- Look at a model.py in any HF model and roughly understand what it does.
- Build a small project end-to-end: load data, train, evaluate, serve.
- Read a research paper's abstract + introduction + experiments section and predict what their code does.
- Submit a fix to a real AI OSS project.
You will not be able to: - Train a frontier LLM. (Multi-million-dollar GPU farms; not in 6 months.) - Tell people you're "an ML engineer." (Years of work past this.) - Pass an FAANG ML interview. (Different focus - leetcode plus theory.)
What you'll have: the foundation to keep going. The AI Expert Roadmap is the natural follow-up - 12 months of structured study from here.
One last thing before we start¶
If a page feels too dense - stop, re-read. Still dense? Skip, come back.
The AI field uses jargon shamelessly. When a word appears you haven't seen, this path defines it inline. If a word slips through without definition, that's a bug - note it.
Ready? Next: Setup →
01 - Setup¶
What this session is¶
About 45 minutes. Install PyTorch (with GPU support if you have one), confirm it works, run a tiny tensor program.
Step 1: Pick your Python environment¶
You should have Python ≥3.10 from the Python beginner path. Create a fresh virtual environment for this path:
mkdir -p ~/code/ai-learning
cd ~/code/ai-learning
python3 -m venv .venv
source .venv/bin/activate # macOS / Linux
.venv\Scripts\activate # Windows
You'll work inside this .venv for the whole path. Always activate it before working.
Step 2: Install PyTorch¶
Go to pytorch.org/get-started/locally. The site has a config-builder that gives you the exact pip install command for your OS / Python / CUDA.
Typical commands:
CPU only (any platform):
Linux + NVIDIA GPU (CUDA 12):
macOS (Apple Silicon - uses MPS, Apple's Metal-based backend):
PyTorch on macOS automatically uses MPS when available. Works well for small models; not as fast as CUDA.Verify:
Should print a version (e.g., 2.4.0) and True or False depending on whether CUDA is available.
If you have an NVIDIA GPU and cuda.is_available() is False: the install picked the CPU-only wheel. Reinstall with the CUDA URL.
If you have Apple Silicon:
Should beTrue.
Step 3: Install supporting libraries¶
The rest of the path uses these:
pip install jupyter ipykernel numpy pandas matplotlib
pip install transformers datasets accelerate
pip install sentence-transformers faiss-cpu # for RAG (page 10)
pip install scikit-learn # for utilities
pip install httpx # for serving (page 12)
That's a lot at once. Each tool has its own purpose:
- transformers - Hugging Face's library; loading and using pre-trained models.
- datasets - Hugging Face's data-loading library.
- accelerate - multi-GPU + mixed-precision helper.
- sentence-transformers + faiss - embeddings + vector search for RAG.
- scikit-learn - classical ML utilities (data splits, metrics).
If any installation fails, read the error. Often it's a missing system dependency (e.g., libstdc++); search the error message for guidance.
Step 4: First PyTorch program¶
Create tensor_hello.py:
import torch
# Create a tensor - PyTorch's central data type
x = torch.tensor([1.0, 2.0, 3.0])
print("tensor:", x)
print("shape:", x.shape)
print("dtype:", x.dtype)
# Some math
y = torch.tensor([4.0, 5.0, 6.0])
print("x + y:", x + y)
print("x · y:", torch.dot(x, y))
# A 2D tensor (matrix)
mat = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
print("matrix:")
print(mat)
print("matrix shape:", mat.shape)
# Move to GPU if available
if torch.cuda.is_available():
x_gpu = x.cuda()
print("x on GPU:", x_gpu.device)
elif torch.backends.mps.is_available():
x_mps = x.to("mps")
print("x on MPS:", x_mps.device)
else:
print("running on CPU")
Run:
You should see output like:
tensor: tensor([1., 2., 3.])
shape: torch.Size([3])
dtype: torch.float32
x + y: tensor([5., 7., 9.])
x · y: tensor(32.)
matrix:
tensor([[1., 2.],
[3., 4.]])
matrix shape: torch.Size([2, 2])
running on CPU
(32 because 1·4 + 2·5 + 3·6 = 32. We'll cover dot products properly in page 03.)
Step 5: Jupyter notebooks (optional but valuable)¶
ML work happens in notebooks more than scripts. They mix code, output, plots, and prose in one document.
Opens a browser. Click "New" → "Python 3". You get a cell-based editor. Each cell runs independently; output appears below it. Great for exploration.
Many tutorials are notebook files (.ipynb). You'll meet them. VS Code also has a built-in notebook UI.
Step 6: Google Colab as a backup¶
For pages 09+ you'll want a GPU. If you don't have one:
colab.research.google.com gives you a free Jupyter notebook with a GPU (~T4-class) for a few hours per session.
For more compute: Kaggle Notebooks (free), Lambda Labs (paid by the minute), or any cloud provider.
We'll note when a page benefits from GPU.
A note on the AI ecosystem's pace¶
AI libraries change fast. PyTorch APIs are stable across minor versions but the broader ecosystem (Hugging Face, vLLM, training optimizers) iterates monthly. When in doubt, read the library's current docs, not blog posts from 2023.
Exercise¶
-
Verify PyTorch installation:
-
Run
tensor_hello.pyabove. Confirm the output. -
Modify it:
- Create a tensor
z = torch.arange(10). Print it. (Range from 0 to 9.) - Create a random 3×3 matrix with
torch.randn(3, 3). Print. -
Compute the matrix's sum (
m.sum()) and mean (m.mean()). -
(Optional) Set up Jupyter:
Create a new notebook. Run the tensor code in cells.
What you might wonder¶
"Why PyTorch and not TensorFlow / JAX?" PyTorch is the dominant ML framework as of 2026 - research uses it, most OSS uses it, the job market wants it. JAX has its place (Google ecosystem, transformer research); TensorFlow is mostly legacy. Stick with PyTorch.
"What is CUDA?"
NVIDIA's parallel computing platform - the way GPUs run general-purpose code. PyTorch built with CUDA support uses your GPU automatically when you .cuda() tensors. AMD GPUs use ROCm; Apple Silicon uses MPS.
"My GPU isn't supported / I don't have one." Use CPU for pages 01-08. They run fine. For pages 09+, use Google Colab's free tier.
"How much disk space will this take?" ~3-5 GB for PyTorch and core deps. Pre-trained models you download in later pages can add 1-10 GB each. Plan for 30-50 GB free if you'll experiment broadly.
Done¶
You have: - Python venv with PyTorch installed (CPU or GPU). - Supporting libraries (transformers, datasets, sentence-transformers, faiss). - Verified PyTorch can create and operate on tensors. - (Optional) Jupyter set up; Colab as backup.
02 - Tensors¶
What this session is¶
About 45 minutes. Tensors are PyTorch's central data type - multi-dimensional arrays, with support for GPU acceleration and automatic differentiation. Almost every line of PyTorch code touches tensors.
What a tensor is¶
A tensor is a generalized array:
- A 0-dimensional tensor is a single number (scalar):
7. - A 1-D tensor is a vector:
[1, 2, 3]. - A 2-D tensor is a matrix:
[[1, 2], [3, 4]]. - A 3-D tensor is a cube of numbers. (Often used for color images:
[height, width, channels].) - Higher: a batch of images, a batch of token sequences, etc.
Every tensor has a shape (its size per dimension) and a dtype (the type of each element).
Creating tensors¶
import torch
# From a Python list
a = torch.tensor([1, 2, 3])
print(a.shape, a.dtype) # torch.Size([3]) torch.int64
# As floats
b = torch.tensor([1.0, 2.0, 3.0])
print(b.shape, b.dtype) # torch.Size([3]) torch.float32
# Zeros, ones, random
z = torch.zeros(2, 3) # 2x3 of zeros
o = torch.ones(2, 3)
r = torch.randn(2, 3) # random normal (mean=0, std=1)
u = torch.rand(2, 3) # random uniform [0, 1)
i = torch.arange(0, 10) # 0, 1, 2, ..., 9
# An identity matrix
I = torch.eye(4)
Default float dtype is float32. Default int dtype is int64. You can specify:
Shape and reshape¶
a = torch.arange(12)
print(a.shape) # torch.Size([12])
b = a.reshape(3, 4) # 3x4 matrix
print(b.shape) # torch.Size([3, 4])
c = a.reshape(2, 2, 3) # 2x2x3 tensor
print(c.shape) # torch.Size([2, 2, 3])
reshape(...) doesn't copy data when it can avoid it - it just changes the "view" on the underlying buffer.
The -1 placeholder means "infer this dimension":
a = torch.arange(12)
b = a.reshape(-1, 4) # 3 rows of 4 (12/4)
c = a.reshape(2, -1) # 2 rows of 6 (12/2)
Useful in functions where you know all but one dimension.
Indexing¶
Like NumPy, like Python lists, but extended:
m = torch.arange(12).reshape(3, 4)
# m is:
# tensor([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
m[0] # first row: tensor([0, 1, 2, 3])
m[0, 0] # first element: tensor(0)
m[:, 0] # first column: tensor([0, 4, 8])
m[1:, 2:] # rows 1+, cols 2+: tensor([[6, 7], [10, 11]])
m[0:2, 0:2] # 2x2 top-left
Slicing returns a view (shares storage). Modifying the slice modifies the original. Use .clone() if you need an independent copy.
Arithmetic¶
Element-wise:
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b) # [5, 7, 9]
print(a * b) # [4, 10, 18] element-wise multiply
print(a ** 2) # [1, 4, 9]
print(torch.exp(a)) # [e^1, e^2, e^3]
print(torch.sin(a))
Reductions (collapse a dimension):
m = torch.randn(3, 4)
m.sum() # scalar
m.sum(dim=0) # column sums (4 values)
m.sum(dim=1) # row sums (3 values)
m.mean()
m.max()
m.argmax() # index of maximum
dim= is the dimension to reduce over. dim=0 collapses the rows; dim=1 collapses the columns. Confusing the first time; you'll internalize it.
Matrix multiplication¶
The most-used operation in ML. It is not the same as element-wise *.
A = torch.randn(2, 3) # 2x3
B = torch.randn(3, 4) # 3x4
C = A @ B # 2x4 - matrix multiply
# or: torch.matmul(A, B)
The @ operator is matrix multiply. Two requirements:
- Inner dimensions match: (2, 3) @ (3, 4) works because both have 3 in the middle.
- Result is the outer dimensions: (2, 3) @ (3, 4) → (2, 4).
If they don't match, you get an error. Get the dimensions right first; everything else follows.
Broadcasting¶
When you operate on tensors of different shapes, PyTorch tries to make them match by broadcasting the smaller one along the matching dimensions:
a = torch.tensor([[1, 2, 3], [4, 5, 6]]) # shape (2, 3)
b = torch.tensor([10, 20, 30]) # shape (3,)
print(a + b)
# tensor([[11, 22, 33],
# [14, 25, 36]])
b was broadcast across the rows of a. Equivalent to adding [10, 20, 30] to each row.
The rules are precise but the intuition is "align from the right; missing dimensions are filled in by repeating":
When in doubt, print shapes. Most "shape mismatch" errors come from this; once you see the shapes, the fix is usually obvious.
Move tensors to GPU¶
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print("using:", device)
a = torch.randn(1000, 1000).to(device)
b = torch.randn(1000, 1000).to(device)
c = a @ b # runs on GPU if device is cuda/mps
Tensors and operations have to be on the same device. Mixing CPU and GPU tensors raises an error.
Common idiom: define device once at the top of the script; .to(device) every tensor you create.
NumPy interop¶
PyTorch tensors and NumPy arrays interoperate:
import numpy as np
n = np.array([1, 2, 3])
t = torch.from_numpy(n) # tensor sharing memory with the array
back = t.numpy() # numpy array sharing memory with the tensor
If the tensor is on CPU, this is free (no copy). On GPU, you have to .cpu() first.
NumPy is the older sibling - PyTorch borrows most of its API conventions from NumPy. If you've used NumPy, PyTorch tensors will feel familiar.
Exercise¶
In a new script tensor_practice.py:
-
Create a
5×3tensor of random normal values. Print its shape and mean. -
Create the same tensor and add
1.0to every element. (Hint: justtensor + 1.) -
Create a
3×3identity matrix; create another3×3matrix withtorch.arange(9).reshape(3, 3).float(). Multiply them with@. Result? -
Create
a = torch.arange(20).reshape(4, 5). Get the third row. Get the second column. Get the bottom-right2×2submatrix. -
Broadcasting: create
a = torch.zeros(3, 4)andb = torch.tensor([1, 2, 3, 4]). Computea + b. What shape? What values? -
GPU (if available): create two
1000×1000random matrices. Time how longa @ btakes on CPU vs your device. Usetime.time()around the multiplications.
What you might wonder¶
"Why are tensors not just NumPy arrays?" PyTorch tensors add: GPU support, automatic differentiation (page 04), automatic device placement, and a richer API for ML-specific operations. They're NumPy++.
"What's float32 vs float16 vs bfloat16?"
Number formats with different precision/memory trade-offs. float32 (FP32) is the default - 4 bytes per number, lots of precision. float16 and bfloat16 are half-precision (2 bytes); used heavily in training large models for memory savings. Modern GPUs (Volta+) have tensor cores that specifically accelerate these.
"Why both reshape and view?"
view requires the data to be contiguous in memory. reshape may copy if needed. Prefer reshape; reach for view only when you've measured it matters.
"My tensors are on different devices and I'm confused."
Set a device = ... constant at the top of your script. Always .to(device) after creation. This rule alone eliminates 80% of device-mismatch bugs.
Done¶
- Create tensors with various constructors.
- Reshape, index, slice.
- Use element-wise arithmetic, matrix multiplication.
- Use broadcasting confidently.
- Move tensors between CPU and GPU.
Next: Linear algebra you actually need →
03 - Linear Algebra You Actually Need¶
What this session is¶
About 30 minutes. The math for neural networks at the intuitive level. No proofs. By the end you'll know what a dot product, matrix multiply, and gradient are - and what they mean in ML code.
Dot product¶
Two vectors a and b of the same length:
A single number. Measures how aligned the vectors are: large when they point the same way; zero when perpendicular; negative when opposite.
import torch
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(torch.dot(a, b)) # 32.0
# (1*4 + 2*5 + 3*6 = 32)
Why it matters: the simplest neuron computes a dot product between its inputs and its weights, adds a bias, applies a nonlinearity. Every neural network is built up from this.
Matrix multiplication¶
Treat a matrix as a stack of row vectors (or column vectors). Matrix multiplication A @ B:
- The entry at row
i, columnjofA @ Bis the dot product of rowiofAand columnjofB.
Shape rule: (m, k) @ (k, n) = (m, n). The inner dimensions match; the outer dimensions become the result's shape.
A = torch.tensor([[1.0, 2.0],
[3.0, 4.0]])
B = torch.tensor([[5.0, 6.0],
[7.0, 8.0]])
print(A @ B)
# tensor([[19., 22.],
# [43., 50.]])
# 1*5 + 2*7 = 19, 1*6 + 2*8 = 22, etc.
Why it matters: an entire neural network layer is output = input @ weights + bias. Matmul is what GPUs are designed to accelerate; everything else is supporting infrastructure.
Transpose¶
Swap rows and columns:
A = torch.tensor([[1, 2, 3], [4, 5, 6]]) # shape (2, 3)
print(A.T) # shape (3, 2)
# tensor([[1, 4],
# [2, 5],
# [3, 6]])
Often used to make shapes line up for matrix multiplication.
A neuron¶
A single artificial neuron:
inputandweightsare vectors of the same length.biasis a single number.activationis a nonlinear function (relu,sigmoid,tanh, etc.).
A layer of n neurons is just n of these stacked - equivalent to one big matmul:
batch_size = 4
input_dim = 10
output_dim = 5
x = torch.randn(batch_size, input_dim) # (4, 10)
W = torch.randn(input_dim, output_dim) # (10, 5)
b = torch.randn(output_dim) # (5,)
out = x @ W + b # (4, 5)
x @ W is (4, 10) @ (10, 5) → (4, 5). The bias b broadcasts across the batch.
Welcome - that's what a dense layer (also called a "linear layer" or "fully-connected layer") does. Everything else is variations.
Nonlinearity¶
Without a nonlinearity between layers, stacking matmuls collapses to one matmul (matrix multiplication is linear). A non-linear function applied element-wise restores the network's power:
Common nonlinearities:
- ReLU - max(0, x). Cheap, effective, the default for hidden layers.
- GELU - smoother ReLU. Used heavily in transformers.
- Sigmoid - 1 / (1 + exp(-x)). Outputs in (0, 1). Used for binary classification outputs.
- Softmax - normalizes a vector to sum to 1. Used for multi-class classification outputs.
You'll mostly use ReLU or GELU in hidden layers; softmax in output.
Gradient (intuitively)¶
A gradient is "the slope of a function at a point, in N dimensions." For a single-variable function f(x), the gradient is the derivative f'(x). For a multi-variable function L(w₁, w₂, ..., wₙ), it's a vector - one partial derivative per variable.
Why it matters: training a network is "minimize the loss function." The gradient of the loss with respect to the weights tells you "if I nudge each weight in the direction opposite the gradient, the loss decreases." That's gradient descent.
You don't compute gradients by hand. PyTorch's autograd does it for you - every operation you do on tensors with requires_grad=True is tracked, and .backward() walks the graph computing gradients automatically.
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward() # computes dy/dx
print(x.grad) # 2*x + 3 = 7
Page 05 uses this in a training loop. For now, just know: gradients let you adjust weights to reduce loss.
Vectors in geometry vs in ML¶
In math classes, vectors had geometric meaning - points, directions, magnitudes. In ML, a vector is just a list of features. A user's embedding might be 1536 numbers - no geometric interpretation, but the "directions" still capture meaningful similarities (cosine of the angle between two user embeddings = how similar they are in the model's learned space).
The math is the same - the interpretation is "feature-space similarity," not "physical space."
Cosine similarity¶
The dot product of two normalized vectors (each with length 1):
import torch.nn.functional as F
a = torch.randn(100)
b = torch.randn(100)
sim = F.cosine_similarity(a, b, dim=0)
print(sim) # between -1 and 1
A standard "how similar are these two embeddings" metric. Used heavily in RAG (page 10).
What you'll never need from a math course¶
- Eigenvalues / eigenvectors (occasionally relevant; not for daily work).
- Singular Value Decomposition (used in LoRA fine-tuning page 09; we'll cover what you need).
- Convex analysis. Calculus of variations. Differential geometry.
Don't get nerve-sniped by Twitter saying you need to "understand linear algebra before doing ML." You need the operations on this page. The rest is for research, not engineering.
Exercise¶
-
Dot product: create two random vectors of length 100. Compute their dot product manually (loop with sum) AND with
torch.dot. Verify they match. -
Matmul shape check: create
Aof shape(3, 5)andBof shape(5, 7). What's the shape ofA @ B? Verify in code. -
A neuron from scratch:
What's the value? Why? -
Batch: create
Xof shape(8, 3)(batch of 8 inputs, each 3-dim). CreateWof shape(3, 5). ComputeX @ Wand inspect the shape. What does each row represent? -
Gradient: define
f(x) = x³ - 4x² + 7x - 1. Atx = 2.0, compute the gradient using PyTorch. (Hint: usex.backward(). The math answer is3x² - 8x + 7 = 3 at x=2.)
What you might wonder¶
"I see lots of torch.bmm in code. What's that?"
Batched matrix multiplication - when you have a batch dimension. bmm is (B, m, k) @ (B, k, n) → (B, m, n). Common in transformers' attention.
"What's torch.einsum?"
Einstein summation notation - a powerful, terse way to express tensor operations. torch.einsum("ij,jk->ik", A, B) is matmul. Worth learning once you've seen the same matmul pattern enough times.
"How does a network 'know' which way to adjust weights?" The gradient gives the direction of steepest increase. Going opposite the gradient (gradient descent) decreases the loss locally. That's all. The magic is that this simple rule works in millions of dimensions.
Done¶
- Dot product, matrix multiplication, transpose.
- A single neuron and a dense layer.
- Common nonlinearities.
- What a gradient is and why it matters.
- Recognizing what's NOT essential math for ML engineering.
Next: Your first neural network →
04 - Your First Neural Network¶
What this session is¶
About 45 minutes. Build a small multi-layer perceptron (MLP) in PyTorch - the simplest interesting neural network. You'll see how nn.Module, layers, forward pass, and autograd compose.
The plan¶
We'll build a network that classifies handwritten digits (MNIST - the "hello world" of ML). 28×28 grayscale images → one of 10 digit classes.
We won't train it yet (page 05 covers training). This page is about defining the model.
nn.Module: PyTorch's central abstraction¶
Every model in PyTorch is a class extending torch.nn.Module. Define your layers in __init__; define how data flows through them in forward.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128) # input layer: 784 → 128
self.fc2 = nn.Linear(128, 64) # hidden: 128 → 64
self.fc3 = nn.Linear(64, 10) # output: 64 → 10
def forward(self, x):
# x is shape (batch_size, 784)
x = F.relu(self.fc1(x)) # → (batch, 128)
x = F.relu(self.fc2(x)) # → (batch, 64)
x = self.fc3(x) # → (batch, 10) - raw logits
return x
model = MLP()
print(model)
Run this. You'll see:
MLP(
(fc1): Linear(in_features=784, out_features=128, bias=True)
(fc2): Linear(in_features=128, out_features=64, bias=True)
(fc3): Linear(in_features=64, out_features=10, bias=True)
)
PyTorch auto-prints the architecture.
What each line does¶
nn.Linear(in_features, out_features) - a fully-connected layer. Internally: a weight matrix of shape (in_features, out_features) and a bias vector. When you call it with input x, it computes x @ W + b.
F.relu(x) - element-wise ReLU activation. max(0, x). Adds non-linearity (page 03).
The last layer produces logits - raw scores, one per class. We don't apply softmax here; the loss function (page 05) does it more numerically stably.
Run a forward pass¶
x = torch.randn(4, 784) # batch of 4 random "images"
out = model(x)
print(out.shape) # torch.Size([4, 10])
print(out[0]) # 10 logits for the first sample
model(x) calls forward(x) under the hood. The output is (batch_size, num_classes) - 10 logits per input.
To turn logits into probabilities:
To get the predicted class (argmax over the classes):
The model is randomly initialized - predictions are garbage. Training (page 05) fixes that.
Counting parameters¶
.parameters() yields all the learnable tensors (weights + biases). For this MLP:
- fc1: 784 × 128 weights + 128 biases = 100,480.
- fc2: 128 × 64 + 64 = 8,256.
- fc3: 64 × 10 + 10 = 650.
- Total: 109,386.
For comparison: GPT-2 (small) is 124M params. GPT-3 is 175B. Modern open-source LLMs (Llama 3 70B) are 70B. Parameter count is one rough proxy for model capability (and one direct proxy for memory cost).
A more compact form: nn.Sequential¶
For simple feed-forward stacks:
model = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
)
Layers in order. Use Sequential for "just call these in sequence" cases. Use the class form when forward needs anything more than that (skip connections, conditional flow).
Activations as modules vs functions¶
Two ways to apply ReLU:
Both work. Functions are cleaner inside forward. Modules are required inside Sequential. Use whichever fits.
Initialization¶
PyTorch initializes weights with sensible defaults (Kaiming uniform for linear layers). For most cases this is fine. To override:
Initialization matters for very deep networks; modern architectures (with normalization layers) are robust enough that the default usually works.
Move to GPU¶
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
x = torch.randn(4, 784).to(device)
out = model(x)
.to(device) moves all parameters. After that, your input must also be on the same device.
Save and load¶
# Save just the parameters
torch.save(model.state_dict(), "mlp.pt")
# Load into the same architecture
model2 = MLP()
model2.load_state_dict(torch.load("mlp.pt"))
model2.eval()
state_dict() returns a dict of {name: tensor} for every parameter. Saving the state dict (not the whole model object) is the recommended pattern - portable across code changes.
.eval() switches the model into evaluation mode (matters for layers like dropout and batch norm that behave differently during training vs inference).
Common architectures (very brief preview)¶
The MLP is the simplest. You'll meet others:
- CNN (Convolutional Neural Network) - for images. Layers detect local patterns (edges, textures) at multiple scales.
- RNN / LSTM - for sequences (older approach). Largely replaced by transformers.
- Transformer - attention-based, the modern default for language and increasingly vision. Page 07 covers it.
For this beginner path: MLPs for the early pages; transformers for the LLM pages.
Exercise¶
Create mlp_demo.py:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MLP(nn.Module):
def __init__(self, input_dim=784, hidden=128, num_classes=10):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden)
self.fc2 = nn.Linear(hidden, hidden)
self.fc3 = nn.Linear(hidden, num_classes)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
def main():
device = "cuda" if torch.cuda.is_available() else "cpu"
print("device:", device)
model = MLP().to(device)
print(model)
total = sum(p.numel() for p in model.parameters())
print(f"params: {total:,}")
# Forward pass on a fake batch
x = torch.randn(8, 784, device=device)
logits = model(x)
print("logits shape:", logits.shape)
probs = F.softmax(logits, dim=1)
print("sample probs:", probs[0])
print("sample sum (should be ~1):", probs[0].sum().item())
if __name__ == "__main__":
main()
Run it. Verify everything makes sense.
Stretch: rewrite the model as nn.Sequential. Same output, fewer lines.
Bigger stretch: add a dropout layer (nn.Dropout(p=0.2)) between the hidden layers. Dropout randomly zeros some activations during training; an effective regularizer. Print the model to see the new architecture.
What you might wonder¶
"What's super().__init__() for?"
Calls nn.Module's constructor - sets up internal bookkeeping so .parameters() and .to() work on your model. Always call it first in __init__.
"Why F.relu vs nn.ReLU?"
Same thing. Functional form is shorter inside forward; module form composes with Sequential and is registered as a child module (so it shows in print(model)).
"What's a 'logit'?" A pre-softmax output. The raw score the model assigns to each class. The class with the largest logit is the prediction. Logits aren't probabilities; softmax converts them.
"Should I worry about backpropagation math?" No - PyTorch's autograd handles it. You define the forward pass; gradients are computed automatically. Page 05 shows the loop.
Done¶
- Define a model by extending
nn.Module. - Use
nn.Linear,F.relu,nn.Sequential. - Run a forward pass.
- Count parameters.
- Save and load
state_dict. - Move models to GPU.
05 - Training Loop¶
What this session is¶
About an hour. Train the MLP from page 04 to recognize MNIST digits. By the end you'll have written a full training loop - the same shape as every PyTorch training loop in existence.
The pattern¶
Every training loop is:
For each epoch (pass over the data):
For each batch:
1. Forward pass - compute predictions
2. Compute loss - how wrong are we?
3. Backward pass - compute gradients
4. Optimizer step - adjust weights
That's it. The rest is bookkeeping.
Load MNIST¶
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
transform = transforms.Compose([
transforms.ToTensor(), # PIL image → tensor
transforms.Normalize((0.1307,), (0.3081,)), # standardize: (x - mean) / std
])
train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_ds = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=512)
Three pieces: - Dataset - knows how to load and transform one example. - DataLoader - wraps the dataset, batches it, optionally shuffles. - Transform - preprocessing applied to each example.
torchvision provides MNIST out of the box. First run downloads ~10MB; subsequent runs use the cached copy.
The full training script¶
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# Reproducibility
torch.manual_seed(42)
# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"device: {device}")
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)),
])
train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_ds = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=512)
# Model
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(28 * 28, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(x.size(0), -1) # flatten 28x28 → 784
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
model = MLP().to(device)
# Loss + optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# Train
for epoch in range(3):
model.train()
total_loss = 0
correct = 0
n = 0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
# 1. Forward
logits = model(x)
# 2. Loss
loss = criterion(logits, y)
# 3. Backward
optimizer.zero_grad() # clear gradients from last step
loss.backward() # compute gradients
# 4. Optimizer step
optimizer.step()
total_loss += loss.item() * x.size(0)
correct += (logits.argmax(dim=1) == y).sum().item()
n += x.size(0)
print(f"epoch {epoch}: train loss {total_loss/n:.4f}, acc {correct/n:.4f}")
# Test
model.eval()
correct = 0
n = 0
with torch.no_grad():
for x, y in test_loader:
x, y = x.to(device), y.to(device)
logits = model(x)
correct += (logits.argmax(dim=1) == y).sum().item()
n += x.size(0)
print(f"test accuracy: {correct/n:.4f}")
Run. After ~30 seconds (on CPU) or ~5 seconds (on GPU), you should see ~97% test accuracy. Your first trained model.
What each line is doing¶
x.view(x.size(0), -1) - flatten the 28x28 images into 784-length vectors. The -1 infers the dimension. x.size(0) is the batch dimension.
nn.CrossEntropyLoss - standard loss for classification. Internally: softmax + negative log-likelihood. Stable and standard.
optimizer = torch.optim.Adam(...) - the optimizer. Adam is the most-used optimizer for modern ML. Other options: SGD (stochastic gradient descent - classic but needs more tuning), AdamW (Adam with corrected weight decay).
optimizer.zero_grad() - clear gradients from the last batch. PyTorch accumulates gradients by default; you must clear them explicitly each step. Forget this and your gradients grow unbounded.
loss.backward() - autograd walks the computation graph backward, computing gradients for every parameter that participated in computing loss. Stores them in parameter.grad.
optimizer.step() - applies the gradient update. For Adam: complex math; for SGD: param = param - lr * param.grad.
model.train() vs model.eval() - switches the model's internal mode. Affects layers like dropout and batch norm that behave differently during training vs inference. Always set the right mode.
with torch.no_grad(): - disables autograd. Inference doesn't need gradients; this skips the bookkeeping and uses less memory.
What the loss tells you¶
Training loss going down = model is fitting the training data. Plateauing means we've hit the model's capacity (or the optimizer is stuck - try different hyperparams).
Test accuracy is what you actually care about - performance on data the model didn't see during training. If train loss keeps dropping but test accuracy stops improving, you're overfitting - memorizing the training data.
Mitigations: more data, regularization (dropout, weight decay), smaller model, early stopping.
Hyperparameters¶
The numbers you set that aren't learned:
- Learning rate (lr=1e-3) - how big a step the optimizer takes. Too high → unstable. Too low → slow. 1e-3 is a great starting point for Adam.
- Batch size (64) - larger = smoother gradients, more memory; smaller = noisier gradients, sometimes generalizes better.
- Epochs - how many passes over the data. More = better fit (until overfitting).
- Architecture - depth, width, normalizations.
These need tuning. For MNIST, defaults work. For real problems, expect to iterate.
Save the model¶
Load later:
What this scales to¶
The training loop pattern above is identical for huge models - the only thing that changes is the model definition. Add a few wrinkles for big-model training (mixed precision, gradient accumulation, distributed, checkpointing) and you have what a real LLM training script looks like.
Exercise¶
-
Run the script. Get ~97% test accuracy.
-
Tweak hyperparameters:
- Set
lr=1e-2. Does it train? (Probably loss goes NaN - too high.) - Set
lr=1e-5. (Trains but slowly.) -
Increase epochs to 10. Does test accuracy improve? Plateau? Degrade (overfit)?
-
Architecture changes:
- Add a third hidden layer of size 128.
- Increase hidden sizes to 256, 128.
-
Watch parameter count + accuracy change.
-
Visualize: plot the per-epoch training loss using matplotlib. (Or use TensorBoard -
pip install tensorboardthentensorboard --logdir runs/.) -
Stretch: instead of an MLP, try a small CNN. A 2-layer CNN beats this MLP at >98% accuracy. Look up
nn.Conv2d. Don't worry if it doesn't work first try - convolutions take some adjusting.
What you might wonder¶
"Why Adam over SGD?" Adam adapts the learning rate per-parameter; converges faster on most problems without tuning. SGD with momentum can outperform on specific architectures (vision CNNs) when carefully tuned. For getting started: Adam.
"How big should my batch be?" For GPU work, "as big as fits in memory" is a common heuristic. Common sizes: 32, 64, 128, 256. Larger batches give smoother gradients but you might need more epochs to converge.
"What does loss.item() do?"
Extracts a Python float from a 0-dim tensor. Detached from the graph (no gradient tracking). Use when you want a number for logging.
"Why is my loss not going down?" Common causes: - Learning rate too high (loss NaN) or too low (loss flat). - Wrong loss function for your task. - Bug in the model (wrong shapes - print them). - Data not normalized.
Print loss every step initially. If it's not going down within ~100 steps on MNIST, something's wrong.
Done¶
- The four-step training loop pattern.
- DataLoader for batched iteration.
nn.CrossEntropyLoss+ Adam optimizer.optimizer.zero_grad()/loss.backward()/optimizer.step().model.train()vsmodel.eval().- Save/load weights.
06 - Inference and Saving¶
What this session is¶
About 30 minutes. The other half of training - using a trained model. Loading saved weights, running predictions, the eval-mode + no-grad pattern.
Inference vs training¶
Training: forward pass + loss + backward pass + optimizer step. Slow, memory-hungry, uses gradients.
Inference: forward pass only. Fast, cheap, no gradients.
The difference matters because most of a model's lifetime is inference - users sending requests; you predicting. Optimizing inference is its own discipline (page 12).
The basic pattern¶
import torch
model = MLP() # the same class you trained with
model.load_state_dict(torch.load("mnist_mlp.pt"))
model.eval() # IMPORTANT: switch to eval mode
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# An input
x = torch.randn(1, 1, 28, 28).to(device) # one fake "image"
with torch.no_grad(): # IMPORTANT: skip gradient tracking
logits = model(x)
probs = torch.softmax(logits, dim=1)
pred = logits.argmax(dim=1)
print(f"predicted: {pred.item()}, confidence: {probs.max().item():.4f}")
Three things you must remember:
1. Load the weights - load_state_dict into a freshly-constructed model with the same architecture.
2. model.eval() - disables dropout, freezes batch norm running statistics.
3. with torch.no_grad(): - disables gradient tracking. Faster, uses less memory.
Forgetting any of these gives subtle bugs.
Predict on a real image¶
from PIL import Image
from torchvision import transforms
# The same transform you used during training
transform = transforms.Compose([
transforms.Grayscale(),
transforms.Resize((28, 28)),
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)),
])
img = Image.open("my_digit.png")
x = transform(img).unsqueeze(0).to(device) # add batch dim → (1, 1, 28, 28)
with torch.no_grad():
logits = model(x)
pred = logits.argmax(dim=1).item()
print(f"predicted digit: {pred}")
Key point: the inference preprocessing must match training. Same resize, same normalize, same color space. Mismatched preprocessing is the #1 silent-bug source in ML - the model still produces a prediction, just a bad one.
Batching for speed¶
If you have many inputs, predict on them in batches - much faster than one-at-a-time:
images = [transform(Image.open(p)) for p in paths]
batch = torch.stack(images).to(device) # (N, 1, 28, 28)
with torch.no_grad():
logits = model(batch)
preds = logits.argmax(dim=1)
for path, p in zip(paths, preds):
print(path, p.item())
Batches let the GPU keep busy. For latency-sensitive online inference, you might still process single inputs; for throughput-sensitive batch jobs, batch as much as memory allows.
Save more than weights¶
state_dict() is just the parameters. For a fully-recoverable training session, save more:
torch.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": loss,
}, "checkpoint.pt")
# Resume:
ckpt = torch.load("checkpoint.pt")
model.load_state_dict(ckpt["model_state_dict"])
optimizer.load_state_dict(ckpt["optimizer_state_dict"])
start_epoch = ckpt["epoch"] + 1
For production deployment, save just the model weights. For "I want to resume training tomorrow," save the full checkpoint.
TorchScript and ONNX (briefly)¶
For shipping models, two portability formats:
- TorchScript -
torch.jit.script(model)ortorch.jit.trace(model, example_input)produces a deployment-friendly version. Can run without Python. - ONNX - open standard.
torch.onnx.export(model, ...)produces a file readable by many runtimes (ONNX Runtime, TensorRT, browsers).
Beyond beginner scope; mentioned because deployment paths sometimes need them.
For most cases, deploying a PyTorch model directly (page 12) is fine.
Inference performance: the gotchas¶
A few things that catch people:
- First inference is slow. PyTorch JIT-compiles kernels on first use. Warm up with a dummy forward pass before timing.
.cuda()/.to(device)is async. GPU operations are queued. To time them, calltorch.cuda.synchronize()first.with torch.no_grad():matters even for small inferences. Saves memory; can be 2x faster.torch.set_num_threads(1)for CPU inference can speed up small models by avoiding thread overhead.
A complete example¶
"""
Load a trained MNIST MLP and predict on a single image.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(28 * 28, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
def main(image_path: str):
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MLP().to(device)
model.load_state_dict(torch.load("mnist_mlp.pt", map_location=device))
model.eval()
transform = transforms.Compose([
transforms.Grayscale(),
transforms.Resize((28, 28)),
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)),
])
img = Image.open(image_path)
x = transform(img).unsqueeze(0).to(device)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1)
pred = logits.argmax(dim=1).item()
confidence = probs.max().item()
print(f"predicted: {pred} (confidence: {confidence:.4f})")
if __name__ == "__main__":
import sys
main(sys.argv[1])
Run: python infer.py my_digit.png.
Exercise¶
-
Train a model from page 05. Save its weights as
mnist_mlp.pt. -
Write
infer.pyabove. Load the weights. Get a digit image (download one from Google or draw one in Paint, save as PNG). Run prediction. -
Measure speed:
import time for _ in range(3): # warm up with torch.no_grad(): model(x) torch.cuda.synchronize() if device == "cuda" else None t0 = time.time() for _ in range(1000): with torch.no_grad(): _ = model(x) torch.cuda.synchronize() if device == "cuda" else None print(f"{(time.time() - t0) * 1000 / 1000:.3f} ms / inference") -
Stretch: load multiple images at once into one batch. Time
forwardon the batch vs looping single-images. The batch is much faster per-image - that's why batching matters.
What you might wonder¶
"Why map_location=device in torch.load?"
Loads tensors directly to the target device. Without it, PyTorch tries to load to the device they were saved on, which fails if you trained on GPU but are inferring on CPU.
"What's torch.compile?"
PyTorch 2's JIT compiler. model = torch.compile(model) can give 1.5x-3x speedup. Sometimes flaky; experiment carefully. Mentioned for awareness.
"Should I use .half() or .bfloat16() for inference?"
For modern GPUs (Volta and newer), yes - half-precision inference is ~2x faster with negligible quality drop for most models. model = model.half() then x = x.half(). Test accuracy afterward; some models tolerate half-precision better than others.
"What about quantization (INT8, INT4)?" Even more aggressive than half-precision. Used heavily for LLM inference (page 12). Beyond beginner scope on this page.
Done¶
- Load weights into a model architecture.
- Use
model.eval()+with torch.no_grad():. - Preprocess inference inputs the same way as training inputs.
- Batch inference for throughput.
- Save/load full training checkpoints.
Next: Transformers and tokenization →
07 - Transformers and Tokenization¶
What this session is¶
About an hour. What an LLM actually does - at the level needed to use, fine-tune, and serve them. The architecture, the tokenization step that confuses everyone, the autoregressive generation loop.
This page is dense. Plan to re-read.
The big picture¶
A language model: 1. Takes a sequence of tokens (integer IDs representing chunks of text). 2. Predicts a probability distribution over the next token. 3. You sample one. Append it. Repeat.
That's it. The clever part - what makes LLMs work - is the architecture that produces the next-token prediction. That architecture is the transformer.
Tokenization¶
Text → token IDs. The model never sees characters or words; it sees integers.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"
ids = tok.encode(text)
print(ids)
# [15496, 11, 995, 0]
print([tok.decode([i]) for i in ids])
# ['Hello', ',', ' world', '!']
Each token is roughly a "subword piece." Common words → single token. Rare words → multiple. The tokenizer learned its vocabulary during the model's training; you can't change it.
Why subwords: vocabulary size matters. Word-level vocabulary needs hundreds of thousands of entries (with new ones constantly appearing). Character-level produces very long sequences. Subword (BPE, WordPiece, SentencePiece) is the compromise: 32k-256k tokens covering nearly any text.
Practical implications: - Token counts are not word counts. "I am happy" = 3 tokens, "antidisestablishmentarianism" = many. - Prices are per-token; latency is per-token. - Different models have different tokenizers; same text → different token counts.
What a transformer does¶
Inside the model, each token ID becomes an embedding - a vector of ~768 to ~12000 dimensions (depending on the model).
A transformer layer processes a sequence of these vectors and produces an updated sequence of the same shape. The crucial operation is attention - every output position is a weighted sum of all input positions (and itself), where the weights are computed from the inputs themselves.
This lets the model "look at" other parts of the sequence when generating each output. Long-range dependencies (across thousands of tokens) become tractable.
Stack many such layers (~12-100), add positional encodings so the model knows token order, end with a linear projection back to the vocabulary, apply softmax - you have the next-token distribution.
The full math is in the AI Systems senior path, Deep Dive 07. For now, the operational view: a transformer transforms a sequence of token embeddings into a probability distribution over the next token.
Causal (autoregressive) vs masked¶
Two families:
-
Causal / autoregressive models (GPT-family, Llama, Mistral, Gemma) - each position attends only to positions before it. Generates left-to-right. Used for language generation.
-
Masked models (BERT-family) - every position attends to every other. Used for understanding (classification, NER, embeddings).
If you're working with chatbots, code completion, RAG - causal. Embedding for retrieval - masked. Many modern open-source LLMs are causal (the GPT-style architecture won).
The generation loop¶
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()
prompt = "The quick brown fox"
input_ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=30,
do_sample=True,
temperature=0.8,
top_p=0.9,
)
print(tok.decode(output[0]))
What's happening:
1. Tokenize the prompt into IDs.
2. Call model.generate(...) - a wrapped helper around the basic predict-and-append loop.
3. The model generates 30 more tokens, sampling from each next-token distribution.
4. Decode the resulting IDs back to text.
Sampling parameters:
- temperature - how peaked the distribution is. 0 = pick the most likely (greedy). Higher = more diverse, less predictable. 0.7-1.0 typical.
- top_p (nucleus) - only consider tokens whose cumulative probability is up to p. Avoids low-probability "weird" tokens.
- top_k - only consider the k most-likely. Cruder but works.
- max_new_tokens - when to stop.
- stop - explicit stop strings.
Different sampling strategies → different output styles. Greedy is deterministic but often repetitive. Top-p sampling is the modern default.
What "small" and "big" mean¶
Some calibration:
- GPT-2 small: 124M params. Fits in 500MB. Runs on a laptop.
- Llama 3 8B: 8 billion. 16GB at FP16. Single high-end GPU.
- Llama 3 70B: 70 billion. 140GB at FP16. Multiple GPUs (or quantized down to 4-bit for a single ~40GB GPU).
- GPT-4-class frontier models: ~hundreds of billions to trillions (rumors; not public). Many GPUs.
The pattern: 10x bigger → noticeably smarter on hard tasks. Quality scales with parameters + training data + compute (the "scaling laws").
For learning, GPT-2 / Llama 3 8B (or smaller) suffice.
Context length¶
The maximum sequence length the model can attend over. GPT-2 was 1024. Llama 3 is 8192-128000+. Frontier models claim 200k+.
Two implications: - Memory cost is quadratic in context length (attention is O(n²)). 100k context is 10000x more attention compute than 1k. - Practical context isn't the same as advertised context. A model trained on long contexts doesn't necessarily use the middle well - research papers ("Lost in the Middle") show degradation. RAG (page 10) mitigates this.
Embeddings (preview)¶
A model's tokenization step also produces token embeddings - vector representations of pieces of text. These are useful beyond generation: semantic search, classification, clustering.
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["a cat sat on a mat", "the dog ran"]
embeddings = m.encode(texts) # shape (2, 384) for this model
Cosine similarity between two embeddings ≈ semantic similarity. Page 10 builds RAG on this.
Exercise¶
-
Run the GPT-2 generation example above. Try different prompts. Vary
temperaturefrom 0.1 to 1.5. Note how it changes. -
Inspect tokenization:
Notice how rare words and emoji become multiple tokens. -
Greedy decoding:
Run twice with the same prompt. Output is identical (deterministic). Then withdo_sample=True, different each time. -
(Stretch - GPU helpful): try a small open-source LLM. Hugging Face Hub: search
gpt2-medium,microsoft/phi-2,meta-llama/Llama-3.2-1B. (Llama gates require accepting a license on HF.) Load and generate. Note the quality difference.
What you might wonder¶
"Why is the same word sometimes one token and sometimes two?" Subword tokenizers split based on frequency in their training data. " happy" (with leading space) and "happy" (without) are distinct tokens. Case matters too. Don't fight it; understand it.
"What's a 'chat model' vs a 'base model'?" Base models are trained on raw text. Chat models are fine-tuned (page 09) with conversation data + safety training. For "ask a question, get an answer" use chat models. For raw text completion or further fine-tuning, base models.
"What's the actual difference between GPT-2 and modern LLMs architecturally?" Mostly: more parameters, more training data, more compute. Architectural tweaks (rotary positional encoding, grouped-query attention, SwiGLU activations, RMSNorm) are real but modest. The scaling matters more than the architecture changes.
"Should I implement attention from scratch?" Once, for understanding. Andrej Karpathy's "Let's build GPT" video walks you through it. Then use library implementations for production.
Done¶
- Understand the token → embedding → transformer → next-token-prob pipeline.
- Use a tokenizer; understand subword units.
- Generate text with sampling parameters.
- Know causal vs masked models.
- Have a calibration of model sizes.
Next: Hugging Face Transformers →
08 - Hugging Face Transformers¶
What this session is¶
About 30 minutes. Hugging Face is the GitHub of AI models. The transformers library makes using thousands of pre-trained models a 3-line operation.
The library¶
(You did this in page 01.) The library provides three main classes you'll use:
AutoTokenizer- load any model's tokenizer.AutoModel/AutoModelForCausalLM/AutoModelForSequenceClassification/ etc. - load a model. TheAutoModelFor...variants add task-specific heads.pipeline- a high-level helper that combines tokenization + model + post-processing into one call.
The simplest possible usage: pipeline¶
from transformers import pipeline
# Text classification
clf = pipeline("sentiment-analysis")
print(clf("I love this!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]
print(clf("This is terrible."))
# [{'label': 'NEGATIVE', 'score': 0.9991}]
# Text generation
gen = pipeline("text-generation", model="gpt2")
print(gen("The future of AI is", max_new_tokens=20))
# Translation
trans = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
print(trans("Hello, how are you?"))
# Question answering
qa = pipeline("question-answering")
print(qa(question="Where do I live?", context="My name is Alice and I live in Lagos."))
# {'answer': 'Lagos', ...}
Each pipeline picks a default model, downloads it (first time), runs it. Useful for prototyping.
Browsing the Hub¶
huggingface.co hosts hundreds of thousands of models. Filter by task, language, license. Common model names you'll see:
gpt2,gpt2-medium- small classical LLMs. Good for learning.microsoft/phi-3-mini-4k-instruct- small + capable + permissive license.meta-llama/Llama-3.2-1B,Llama-3.2-3B,Llama-3.1-8B- Meta's open weights (gated; accept license).mistralai/Mistral-7B-v0.3- open-source Mistral.google/gemma-2-2b- small Gemma.sentence-transformers/all-MiniLM-L6-v2- tiny embedding model. Page 10.distilbert-base-uncased- small BERT-family for classification, embedding.
Each model page on the Hub has a README with usage, license, evaluation, intended use.
Direct usage (not via pipeline)¶
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "microsoft/phi-3-mini-4k-instruct"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = "Write a haiku about garbage collection:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
print(tok.decode(output[0], skip_special_tokens=True))
Key arguments:
- torch_dtype=torch.bfloat16 - load weights in bfloat16 instead of float32. Halves memory; minimal quality loss.
- device_map="auto" - automatically distribute layers across available devices (GPU + CPU fallback).
- return_tensors="pt" - tokenizer returns PyTorch tensors.
- skip_special_tokens=True - strip <eos>, <bos>, etc. from output.
Chat templates¶
Modern chat-tuned models expect a specific message format. The tokenizer knows it:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the capital of Nigeria?"},
]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
with torch.no_grad():
output = model.generate(inputs, max_new_tokens=50)
response_only = tok.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response_only)
apply_chat_template formats the messages with the model's expected special tokens (<|user|>, <|assistant|>, etc.). add_generation_prompt=True adds the assistant's turn-start so the model knows it's its turn to speak.
For chat-tuned models, always use the chat template. Raw prompt-completion produces worse results.
Embedding models¶
For semantic search (used in page 10):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["A dog is running.", "A cat is sleeping.", "I bought milk."]
embeddings = model.encode(texts)
print(embeddings.shape) # (3, 384) - three 384-dim vectors
# Compute similarities
import numpy as np
sim = embeddings @ embeddings.T
print(sim) # diagonal is 1.0 (each vector with itself);
# off-diagonal close to 0 for unrelated texts
sentence-transformers wraps Hugging Face models and handles the "pool tokens into a sentence vector" step.
Caching¶
By default, models download to ~/.cache/huggingface/. Big models (gigabytes) live here. To change:
To pre-download a model without using it (useful in Docker):
from huggingface_hub import snapshot_download
snapshot_download(repo_id="microsoft/phi-3-mini-4k-instruct")
Quantized models¶
LLMs are huge. Loading FP16 needs gigabytes; FP32 needs 2x. Quantization reduces precision further:
- GPTQ / AWQ - 4-bit quantization, requires specific quantized weights.
- bitsandbytes - runtime 8-bit / 4-bit quantization for any model:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", quantization_config=bnb)
8B parameters at 4-bit ≈ 4GB. Fits on consumer GPUs.
Quality drops ~1-5% on benchmarks; for many tasks, indistinguishable. Used heavily in production inference (page 12).
Exercise¶
-
Run the simplest pipeline:
-
Generate text with a small model:
-
Direct model usage with the chat-template form above. Use any chat-tuned model that fits your hardware. Try several prompts.
-
Embeddings: with sentence-transformers, encode 5 sentences (some related, some not). Compute the similarity matrix. Notice which pairs score high.
-
(Stretch - GPU helpful) Load Llama-3.2-1B (accept license on HF first; small enough for most setups). Compare its outputs to gpt2's.
What you might wonder¶
"How big a model can I run?" Rule of thumb (FP16): need ~2 bytes per parameter for inference. 1B params = 2GB. 7B = 14GB. 70B = 140GB. With 4-bit quantization, ~0.5 bytes per param. 70B at 4-bit ≈ 40GB.
"Why does the model download so slowly?"
HF servers throttle anonymous traffic. Authenticate (huggingface-cli login) for higher limits, especially for gated models.
"What's device_map="auto" actually doing?"
Hugging Face's accelerate library partitions the model across available devices (GPU layers; CPU offload for excess). For small models, the whole thing goes on GPU. For huge models, layers spill to CPU (much slower but possible).
"Should I use safetensors or pytorch_model.bin?" Safetensors. Faster loading, safer (no arbitrary code execution risk). All modern HF models ship both.
Done¶
- Use
pipelinefor the quickest possible model usage. - Use
AutoTokenizer+AutoModelForCausalLMfor direct control. - Apply chat templates for chat-tuned models.
- Use sentence-transformers for embeddings.
- Load quantized models for memory efficiency.
09 - Fine-Tuning¶
What this session is¶
About 90 minutes. Adapt a pretrained model to your data. Modern parameter-efficient fine-tuning (LoRA) - feasible on consumer GPUs. By the end you'll have fine-tuned a small LLM on a custom dataset.
This page benefits from a GPU. CPU works but is very slow.
The two modes¶
- Full fine-tuning - update all model weights. Best quality; needs massive memory (7B model in FP16 needs ~28GB just for the optimizer states). Beyond most beginners' budgets.
- Parameter-efficient fine-tuning (PEFT) - update only a tiny subset of new parameters. LoRA is the most-used. Same effective quality for ~1% of the parameters' worth of training. Runs on a single consumer GPU.
We'll do LoRA.
What LoRA actually does¶
Each big linear layer in a transformer (the nn.Linear from page 04, scaled up) is a matrix W. Instead of updating W directly, LoRA learns a low-rank update:
Where A is (d, r) and B is (r, d). The rank r is small (typically 8-64). The original W stays frozen; only A and B train.
Memory savings: instead of d × d parameters per layer (millions), you train d × r + r × d (tens of thousands). For a 7B model with rank-16 LoRA: ~10M trainable parameters instead of 7B.
You don't implement this - peft library handles it.
Setup¶
- peft - parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning, etc.).
- trl - Transformers Reinforcement Learning. Includes
SFTTrainer, the easiest fine-tuning loop wrapper. - bitsandbytes - 4-bit quantization, used by QLoRA.
A complete LoRA fine-tuning¶
We'll fine-tune a small model on a tiny dataset to make it answer in a specific style.
import torch
from datasets import Dataset
from transformers import (
AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
model_name = "microsoft/Phi-3-mini-4k-instruct"
# Quantized to 4-bit
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# LoRA config
lora = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 6.3M || all params: 3.8B || trainable: 0.16%
# A tiny training dataset (in production: load real data with `datasets`)
examples = [
{"text": "<|user|>\nWhat's 2+2?<|end|>\n<|assistant|>\nIt's 4, mate.<|end|>"},
{"text": "<|user|>\nHello!<|end|>\n<|assistant|>\nG'day!<|end|>"},
{"text": "<|user|>\nWhat's your favorite color?<|end|>\n<|assistant|>\nProbably blue, mate.<|end|>"},
# ... in a real run, hundreds to thousands of examples ...
] * 50
train_ds = Dataset.from_list(examples)
# Training config
cfg = SFTConfig(
output_dir="./lora-out",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
max_seq_length=512,
)
trainer = SFTTrainer(
model=model,
tokenizer=tok,
train_dataset=train_ds,
args=cfg,
dataset_text_field="text",
)
trainer.train()
trainer.save_model("./lora-out/final")
The whole thing - model load, LoRA setup, training loop - fits in ~50 lines. SFTTrainer from trl wraps Hugging Face's Trainer with sensible defaults.
Run time: ~5-15 minutes on a free Colab T4 GPU for the small dataset above.
Use the fine-tuned model¶
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = PeftModel.from_pretrained(base, "./lora-out/final")
model.eval()
inputs = tok("<|user|>\nHi there!<|end|>\n<|assistant|>\n", return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# Hopefully responds in the trained style ("G'day mate!")
The fine-tuned model = base model + LoRA adapter. The adapter is small (~30MB for our config); the base is shared.
Merge LoRA into the base (for deployment)¶
For inference at scale, you may want a single merged model:
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tok.save_pretrained("./merged-model")
Result: a standalone model with LoRA's updates baked in. Drops the adapter layer overhead at inference time.
Hyperparameter notes¶
r(LoRA rank) - 8, 16, 32, 64. Higher = more capacity, more memory. Start at 16.lora_alpha- usually 2×r. Acts as a scaling factor.target_modules- which linear layers to LoRA-fy. Common:["q_proj", "v_proj"]for cheap,["q_proj", "k_proj", "v_proj", "o_proj"]for fuller coverage. Model-specific naming.learning_rate- much higher than full fine-tuning (because you have fewer params). 1e-4 to 5e-4 typical.per_device_train_batch_size+gradient_accumulation_steps- effective batch is the product. Small batch fits memory; accumulation simulates a bigger batch.
These need experimentation. Start with the defaults above; adjust.
QLoRA - even smaller memory¶
The BitsAndBytesConfig(load_in_4bit=True, ...) we used is QLoRA - quantize the base model to 4-bit, train LoRA adapters on top in higher precision. Lets you fine-tune 7B models on a 12GB GPU. The standard approach for hobbyist fine-tuning.
What you can / can't fine-tune¶
LoRA fine-tuning is great for: - Style adaptation - "respond in our brand's voice." - Domain-specific Q&A - train on your support docs. - Output format - JSON conformance, structured outputs. - Tool / function calling - train the model to emit specific function calls.
LoRA is bad for: - Teaching the model NEW factual knowledge. That requires more data + full fine-tuning, and the model often half-learns and hallucinates the rest. For facts, use RAG (page 10). - Reasoning skill upgrades. Generally requires lots of data + more compute than LoRA gives.
Dataset format¶
Most fine-tuning recipes want a list of conversation strings in the model's chat format. Building one:
- Collect 50-1000+ example interactions in the desired style.
- Format each as a single string using the model's chat template.
- Wrap in a
datasets.Dataset.
Real datasets often live on Hugging Face Hub - datasets.load_dataset("squad") etc. Filter / format as needed.
Exercise¶
You need a GPU (or Colab) for this exercise. CPU works but takes hours.
-
Run the example above. Train for 1 epoch on the toy dataset. Confirm training loss decreases.
-
Use the trained model. Run a few prompts. Notice the style.
-
Increase the dataset size. Add 10 more diverse examples. Re-train. Compare outputs.
-
Tweak
r: tryr=8vsr=64. Quality difference? Memory difference? -
(Stretch) Use a real dataset from Hugging Face:
datasets.load_dataset("squad", split="train[:1000]"). Format the QA pairs into the chat template. Fine-tune. Evaluate by hand.
What you might wonder¶
"Why is my fine-tuned model worse than the base?" Common causes: dataset too small (under ~100 examples), learning rate too high (model overfits and forgets general knowledge), bad data formatting (model is learning your formatting bugs not your style). Start with a known-good recipe and iterate.
"What's 'catastrophic forgetting'?" The fine-tuned model loses knowledge from its base training. Severe with full fine-tuning; minimal with LoRA (the base weights are frozen). One reason LoRA is the default.
"How do I evaluate the fine-tuned model?" Page 11. Critical and the hardest part of ML.
"DPO? RLHF? PPO? GRPO?" Reinforcement-learning-from-feedback techniques used by frontier labs to align chat models. Beyond beginner; mentioned for awareness.
Done¶
- Distinguish full fine-tuning from LoRA / PEFT.
- Set up
peft+trlfor a real LoRA training run. - Train and save a fine-tuned model.
- Load and use the trained adapter.
- Pick reasonable hyperparameters.
Next: Retrieval-Augmented Generation →
10 - Retrieval-Augmented Generation (RAG)¶
What this session is¶
About an hour. RAG is the dominant production pattern for LLMs answering questions over your data. Instead of fine-tuning facts in, you retrieve relevant passages at query time and pass them to the model.
Why RAG, not fine-tuning, for facts¶
Fine-tuning teaches a model patterns, styles, formats. Asking it to memorize facts works poorly: knowledge degrades, the model hallucinates "knowing" things, no clean way to update when facts change.
RAG separates concerns: the LLM is the language interface; the knowledge is in a database. Update facts by updating the database - no retraining.
The architecture¶
User question
↓
Embed the question (vector)
↓
Search a vector DB for similar passages → top-k passages
↓
Build a prompt: question + retrieved passages
↓
LLM generates answer using both
Five components: 1. Documents - your knowledge corpus (docs, PDFs, wiki). 2. Chunker - splits docs into ~200-1000 token passages. 3. Embedder - a model that turns text into vectors. 4. Vector store - stores passages + their embeddings; supports nearest-neighbor search. 5. LLM - generates the final answer.
A complete (minimal) RAG¶
from sentence_transformers import SentenceTransformer
import numpy as np
import torch
from transformers import pipeline
# 1. Documents - a tiny corpus
docs = [
"Lagos is the most populous city in Nigeria.",
"Abuja is the capital of Nigeria.",
"The Niger River flows through Mali, Niger, and Nigeria.",
"Python was created by Guido van Rossum in 1991.",
"Rust was first released in 2010 by Mozilla.",
"Go was designed at Google starting in 2007.",
]
# 2. Embed all documents (one-time index-build)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)
# shape: (6, 384)
# 3. Search function
def retrieve(query, k=2):
q_emb = embedder.encode([query], normalize_embeddings=True)
sims = (doc_embeddings @ q_emb.T).flatten() # cosine sim because normalized
topk = np.argsort(-sims)[:k]
return [docs[i] for i in topk]
# 4. Generate with retrieved context
gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.bfloat16, device_map="auto")
def answer(question):
context = "\n".join(f"- {p}" for p in retrieve(question))
prompt = f"""<|user|>
Use the following context to answer the question.
Context:
{context}
Question: {question}<|end|>
<|assistant|>
"""
out = gen(prompt, max_new_tokens=100, do_sample=False)[0]["generated_text"]
return out[len(prompt):] # strip the prompt
print(answer("What is the capital of Nigeria?"))
print(answer("Who created Python?"))
That's a working RAG in ~30 lines. The model answers using the retrieved context, not just its baked-in knowledge.
For real production you'd swap in a proper vector DB (next section); the LLM call stays the same.
Vector databases¶
For 100 documents, a NumPy dot product is fine. For 1M+ documents, you need a vector database with efficient approximate nearest neighbor search.
Self-hosted: - FAISS (Facebook) - library, in-process. Fast. No persistence layer; you build that. - Chroma - embedded, easy to start. - Qdrant - server-mode, production-grade. - Weaviate - feature-rich, server-mode. - Milvus - distributed, for very large scale.
Hosted:
- Pinecone - first popular hosted vector DB.
- Cloud-native: AWS OpenSearch with k-NN, Postgres + pgvector, Redis with vector search.
For learning: Chroma or FAISS. For production: depends on scale and existing infra.
A Chroma example:
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(
documents=docs,
embeddings=doc_embeddings.tolist(),
ids=[f"doc-{i}" for i in range(len(docs))],
)
results = collection.query(
query_embeddings=embedder.encode(["What is Lagos?"]).tolist(),
n_results=2,
)
print(results["documents"])
Chunking strategies¶
Long documents must be split. Naive: split every N characters. Better: - Fixed-size with overlap (e.g., 500 chars, 50-char overlap to preserve context across boundaries). - Semantic chunks (paragraphs, headings). - Recursive chunking - try paragraphs first, fall back to sentences, fall back to words.
langchain.text_splitter.RecursiveCharacterTextSplitter is the popular tool. Try a few; the best chunking is task-dependent.
Embedding choices¶
Bigger embedding model = better retrieval, slower to embed, larger vectors.
| Model | Dim | Speed | Quality |
|---|---|---|---|
all-MiniLM-L6-v2 |
384 | very fast | decent |
all-mpnet-base-v2 |
768 | medium | good |
BAAI/bge-base-en-v1.5 |
768 | medium | excellent |
BAAI/bge-large-en-v1.5 |
1024 | slow | best |
text-embedding-3-small (OpenAI) |
1536 | API | excellent |
nomic-embed-text-v1.5 |
768 | medium | excellent, open source |
For learning: all-MiniLM-L6-v2. For production: BGE or Nomic embed are strong open options.
Quality knobs¶
Things that matter, in order of impact:
- Chunking strategy. Bad chunks = bad retrieval. Tune first.
- Number of retrieved chunks (k). 3-10 typical. Too few = miss relevant info. Too many = context bloat, "lost in the middle."
- Re-ranking. Retrieve k=20, then re-rank with a more expensive model down to top-5. Improves quality at modest cost.
- Hybrid search. Combine semantic (vector) with keyword (BM25). Catches cases where exact word match matters.
- Query rewriting. LLM rewrites the user's question into a better search query.
- Embedding model. Better embeddings = better retrieval. Worth experimenting.
For a beginner, just fixed-size chunks + top-3 semantic retrieval is a strong baseline.
When RAG fails¶
- User asks a question whose answer requires synthesis across many docs. RAG retrieves top-k passages; each independent. Synthesis fails.
- Question is ambiguous. Retrieval gets the wrong passage; answer is confidently wrong.
- The corpus genuinely doesn't contain the answer. The LLM hallucinates because the user expects an answer.
Mitigations: explicit "I don't know" in the prompt; structured outputs that include source citations; user-facing transparency about what was retrieved.
Frameworks¶
Real RAG apps often use: - LangChain - most popular framework. Composable chains for retrieval + generation. - LlamaIndex - alternative, more retrieval-focused. - Haystack - pipeline-oriented, German-engineered.
These wrap the patterns above with batteries included. For learning, building from scratch (like this page) makes the mechanics clear; for production, frameworks save time.
Exercise¶
-
Run the minimal RAG above. Confirm the answers use the retrieved context.
-
Expand the corpus: add 20 more facts. Try a question that's ambiguous between two retrieved docs - see how the model handles it.
-
Different embedder: swap
all-MiniLM-L6-v2forBAAI/bge-base-en-v1.5. Larger model; do retrieval results improve for tricky questions? -
Chunking exercise: download a Markdown doc (your
README.mdor any project's). UseRecursiveCharacterTextSplitterfrom langchain to chunk it. Index the chunks. Ask questions. -
(Stretch) Try Chroma instead of in-memory NumPy. Same RAG flow with persistent index.
What you might wonder¶
"What if the LLM ignores the retrieved context?" Happens. Make the prompt clearer: "Answer ONLY using the context above. If the context doesn't contain the answer, say 'I don't know.'" Smaller models ignore instructions more; bigger ones follow.
"Should I do RAG or fine-tuning?" Both, often. RAG for facts; fine-tuning for style + format. Don't pit them against each other.
"What's a 'retriever' vs an 'embedder'?" An embedder produces vectors. A retriever uses the embedder + a vector DB + post-processing to return passages. Same pipeline, different name for different layers.
"How do I evaluate a RAG?" Next page. Hardest part.
Done¶
- Build a RAG pipeline end-to-end with embeddings + vector search + LLM.
- Distinguish from fine-tuning (facts vs style).
- Recognize vector DB options.
- Apply basic quality knobs (chunking, k, re-ranking, hybrid search).
- Know LangChain / LlamaIndex / Haystack exist.
11 - Evaluation¶
What this session is¶
About an hour. The hardest part of building with AI - and the one most beginner tutorials skip. By the end you'll know why "looks good" is not evaluation, and how to do it for real.
Why this page matters¶
Most "AI products" you'll see are evaluated by their authors clicking around and saying "yep, looks good." That's how launched products go viral with embarrassing failures the moment a user does something unexpected.
Good evaluation is what separates a demo from a system. Most engineers - even experienced ML practitioners - get this wrong. Take this page seriously.
The fundamental rule¶
You cannot iterate on what you cannot measure.
Without an objective evaluation, every change is a coin flip. Did this prompt change improve things or make them worse? You can't tell. Without a number, you'll convince yourself it's better - because you wrote it.
A measurable eval lets you see real improvement, A/B test prompts, catch regressions when you change models, ship confidently.
Types of evaluation¶
Different problems need different evals.
Classification - easy¶
If your output is a class (positive/negative, A/B/C, 0-9):
- Accuracy - fraction correct.
- Precision - of predicted positives, what fraction are actually positive.
- Recall - of actual positives, what fraction did you find.
- F1 - harmonic mean of precision and recall.
- Confusion matrix - full breakdown of predicted vs actual.
scikit-learn:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))
Done. Easy.
Free-form text generation - hard¶
If your output is a paragraph of text (LLM chatbot, summarizer):
- Exact match - useless unless you're matching against a fixed answer.
- BLEU, ROUGE, METEOR - n-gram overlap with a reference. Useful for translation; poor for chat (paraphrase = bad score).
- Embedding similarity - cosine similarity between generated and reference embedding. Better than n-gram.
- LLM-as-judge - use a strong model to grade outputs. Most-used in practice, with caveats below.
- Human eval - gold standard, expensive.
For chat / RAG / summarization, LLM-as-judge is the practical default.
Retrieval - medium¶
If your problem is "did I retrieve the right passages":
- Recall@K - of all relevant passages, how many appear in your top-K results.
- MRR (Mean Reciprocal Rank) - average of
1 / rank-of-first-relevant. - nDCG - discounted gain over ranking quality.
You need a labeled dataset: each query has a known correct passage. Build this manually for ~100-1000 queries.
LLM-as-judge¶
You have outputs from your system. You want to know "are these good?"
from openai import OpenAI # or any LLM client
judge_prompt = """You are grading an AI assistant's answer.
Question: {question}
Expected answer: {gold}
AI answer: {generated}
Grade the AI answer on:
- Correctness (0-5): does it match the expected answer in substance?
- Completeness (0-5): does it cover the key points?
- Conciseness (0-5): is it free of fluff?
Respond ONLY with JSON: {{"correctness": N, "completeness": N, "conciseness": N, "rationale": "..."}}
"""
def grade(question, gold, generated):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": judge_prompt.format(
question=question, gold=gold, generated=generated
)}],
)
import json
return json.loads(response.choices[0].message.content)
Run your system against a held-out dataset of (question, gold-answer) pairs. Have the judge grade each. Aggregate the scores.
Caveats: - Use a more capable model for judging than for generating. Don't have GPT-3.5 grade GPT-3.5; have GPT-4 or Claude do it. - Judge bias. LLM judges have biases (preferring longer answers, preferring their own family's models). Counter with care. - Calibration. Run human-judged grades on a subset; check the LLM judge agrees with humans. - Pairwise > absolute. "Which of A and B is better" judgments are more stable than absolute 1-5 scores.
Even with caveats, LLM-as-judge is far better than "looks good to me."
Build an evaluation dataset¶
For LLM apps, this is the work you'll spend the most time on. Patterns:
- Production traces. Sample real user queries from your service logs. Manually label expected answers. ~100-1000 examples.
- Adversarial cases. Specifically construct queries that should fail or should succeed. Boundary cases, ambiguous queries, out-of-scope queries.
- Public benchmarks. MMLU (multitask), TruthfulQA, HumanEval (coding), GSM8K (math). Useful for "how does my model compare," less useful for "is my prompt better."
A good eval dataset is representative + adversarial + maintained. Production examples + manually-curated edge cases. Refresh as your product evolves.
A real workflow¶
Pattern that works:
- Build a small golden dataset. ~50-200 examples to start.
- Run your current system. Score with LLM-as-judge. Get a baseline number.
- Make a change (new prompt, new model, new retrieval).
- Re-run. Compare. If the number went up materially, ship; if it went down, revert; if it's noise, you didn't change enough.
- Expand the eval set as you discover failure modes in production.
This loop is the whole job. Every successful AI product team runs some version of it. Every failed one didn't.
Cost and latency are evaluation criteria¶
A model that's 5% better but 10× slower might be worse for your product. Track both quality and ops costs:
- Per-request cost (tokens × model price).
- p50, p95 latency.
- Throughput (requests per second).
A useful "is the next model worth it" question: "for every 1% quality improvement, how much do cost/latency change?"
Bias, fairness, safety¶
Big and important; out of scope for a beginner page. The minimum:
- Test on diverse inputs. Different demographics, languages, dialects, edge cases.
- Test refusal behavior. Does it refuse harmful requests? Does it over-refuse benign ones?
- Test on adversarial prompts (prompt injection).
Production teams have dedicated red-teamers. For your first project, manual sampling is fine.
Specific tools¶
evaluate(Hugging Face) - eval-metric library. Bundles many standard metrics.langsmith(LangChain Labs) - tracing + evaluation platform.promptfoo- open-source eval CLI for LLM prompts.ragas- RAG-specific evaluation metrics (faithfulness, context relevancy).lm-eval-harness(EleutherAI) - runs many academic benchmarks.
For learning: roll your own (the snippet above). For scale: pick one tool.
Exercise¶
-
Build a tiny eval dataset. Use the RAG from page 10. Write 10 (question, expected-answer) pairs covering your corpus.
-
Run the RAG. Score each output manually with a 1-5 score on correctness. Average the scores; that's your baseline.
-
Make a change. Change the prompt, or k=2 → k=4, or use a bigger embedder. Re-score. Did it improve?
-
Add LLM-as-judge. Have the model itself score the outputs against expected answers. Compare to your manual scores. How well do they agree?
-
(Stretch) Use the
ragaslibrary on your RAG. Run its faithfulness + answer_relevancy metrics.
What you might wonder¶
"How big does my eval set need to be?" 50 is the minimum for noisy signal. 500+ is comfortable. 5000+ for academic-paper-strength results. For getting started, start at 50 and expand.
"Can I trust LLM-as-judge?" Mostly. Pair with human spot-checks (10% of examples reviewed by you). When LLM-as-judge says scores went up but you can't see the improvement, something's miscalibrated.
"What about RLHF / DPO / online evaluation?" Real production AI products have ongoing eval pipelines collecting user feedback, A/B testing changes, fine-tuning on preference data. Beyond beginner; mentioned for awareness.
"How does this compare to evaluating a 'normal' classifier?" Classifiers have ground truth + simple metrics. LLM outputs have ambiguity at every step - there's no single correct answer to "summarize this article." Evaluation gets correspondingly fuzzy. The discipline is the same; the metrics are softer.
Done¶
- Recognize different eval types (classification, generation, retrieval).
- Build an evaluation dataset.
- Use LLM-as-judge correctly (with calibration awareness).
- Run the build → measure → change → measure loop.
- Track cost + latency alongside quality.
12 - Serving Models¶
What this session is¶
About 45 minutes. How to expose your model as a service users can call. Local options (Ollama, llama.cpp), high-performance serving (vLLM), and rolling your own HTTP API.
The simplest possible serve: Ollama¶
Ollama is the easiest way to run an LLM locally.
# Install (macOS):
brew install ollama
ollama serve &
# Pull and run a model:
ollama pull llama3.2:3b
ollama run llama3.2:3b
You get an interactive chat. To use programmatically:
import httpx
r = httpx.post("http://localhost:11434/api/chat", json={
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": False,
})
print(r.json()["message"]["content"])
Ollama handles the model loading, quantization, GPU detection. For local dev, it's the easiest start.
llama.cpp - for tighter control¶
llama.cpp is a C++ inference engine that runs GGUF-quantized models on CPU + GPU. Lower-level than Ollama; faster; more configurable.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download a quantized model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Run
./llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello!" -n 100
Or serve as HTTP:
Many projects use llama.cpp under the hood (Ollama, LM Studio, etc.).
vLLM - high-throughput serving¶
For production-grade serving of large models with high concurrency: vLLM.
Run a server:
OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
r = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(r.choices[0].message.content)
vLLM's killer features: - PagedAttention - KV-cache management like virtual memory. Way better GPU utilization than naive serving. - Continuous batching - interleaves requests at the token level. Many concurrent users; high throughput. - OpenAI-compatible API - drop-in replacement for OpenAI client libraries.
Used by many production LLM deployments. Needs a GPU; doesn't run on CPU usefully.
A simple HTTP wrapper around your own model¶
If you want full control:
# server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
app = FastAPI()
device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
class GenerateRequest(BaseModel):
prompt: str
max_new_tokens: int = 100
temperature: float = 0.7
@app.post("/generate")
def generate(req: GenerateRequest):
inputs = tok(req.prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=req.max_new_tokens,
temperature=req.temperature,
do_sample=True,
)
text = tok.decode(out[0], skip_special_tokens=True)
return {"text": text[len(req.prompt):]}
Run: uvicorn server:app --host 0.0.0.0 --port 8080. Test:
curl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Once upon a time", "max_new_tokens": 30}'
For a small model and a few users, this works fine. For high concurrency, use vLLM.
Streaming responses¶
Users want to see tokens as they generate, not wait for the whole response. Implementation:
from fastapi.responses import StreamingResponse
from threading import Thread
from transformers import TextIteratorStreamer
@app.post("/stream")
def stream(req: GenerateRequest):
inputs = tok(req.prompt, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
Thread(target=model.generate, kwargs={
**inputs, "max_new_tokens": req.max_new_tokens,
"streamer": streamer, "do_sample": True, "temperature": req.temperature,
}).start()
def gen():
for token in streamer:
yield token
return StreamingResponse(gen(), media_type="text/plain")
For production: use vLLM (streaming is built-in) rather than rolling your own threading.
Containerize for deployment¶
A Dockerfile for the FastAPI server:
FROM python:3.12-slim
WORKDIR /app
# CUDA wheel for PyTorch (adjust to your CUDA version)
RUN pip install --no-cache-dir torch fastapi uvicorn pydantic transformers
COPY server.py .
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]
Build, push, run on Kubernetes (Containers + Kubernetes paths).
Deployment concerns¶
- GPU scheduling. Kubernetes can schedule pods to GPU nodes (
nvidia.com/gpu: 1in resources). NVIDIA's GPU Operator manages drivers. - Cold start. Loading a 7B model takes 10-30 seconds. Don't scale to zero unless cold start is acceptable.
- Model caching. Embed the model weights in the container image (huge), or mount as a PV (faster restarts).
- Autoscaling. GPU pods are expensive. Scale based on request queue depth or GPU utilization, not CPU.
- Observability. Latency per request, tokens/sec, GPU memory, queue depth.
Cost / latency calibration¶
Rough numbers for a single A100 GPU serving Llama-3.1-8B (FP16): - ~30-80 tokens/sec generation rate. - 5-15 GB GPU memory. - A few concurrent users; vLLM bumps this to dozens.
Quantized (4-bit) on a single consumer 24GB GPU: - ~40-100 tokens/sec. - Single users comfortable; concurrency lower.
For frontier models (70B+), you need multi-GPU or sharded serving. Beyond beginner.
OpenAI compatibility is a contract¶
Many tools (LangChain, llama-index, CLI tools) speak the OpenAI HTTP API. vLLM, llama.cpp's server, Ollama (with its /v1/... endpoints) all implement it. Building against the OpenAI API makes you portable across self-hosted and hosted backends.
from openai import OpenAI
# Same code works for openai.com, vLLM, Ollama, llama.cpp:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
Exercise¶
- Install Ollama. Pull a small model. Chat.
- Call it from Python via its HTTP API.
- Build a tiny FastAPI server that wraps a small HF model (page 04's MLP works, or a Phi-3-mini for fun). Curl it.
- (Stretch - GPU helpful) Install vLLM. Serve
microsoft/Phi-3-mini-4k-instruct. Use the OpenAI client to call it. - (Stretch) Containerize your FastAPI server. Build the image. Run via
docker run.
What you might wonder¶
"Should I serve via my own framework or vLLM?" For real production: vLLM (or TGI, Triton). The hand-rolled FastAPI version works fine for a hobby project but doesn't handle concurrency well.
"How do I keep the model warm?" Don't scale to zero. Have at least one replica always running. Health checks must respond fast (≤1s) without invoking the model.
"GPU memory keeps growing - what?"
KV-cache (page 07) accumulates as context grows. Limit max_total_tokens in vLLM or chunk old context out.
"Open source vs API providers?" Both have a place. OpenAI/Anthropic/Google APIs are easy and powerful; you pay per token. Self-hosting is cheaper at scale but adds ops complexity. Most production teams use both - APIs for high-quality requests, self-hosted for high-volume cheaper requests.
Done¶
- Run an LLM locally with Ollama or llama.cpp.
- Serve high-throughput with vLLM.
- Build a custom FastAPI wrapper.
- Stream responses.
- Containerize for deployment.
13 - Picking a Project¶
What this session is¶
About 30 minutes plus browsing. AI OSS that accepts first contributions, with specific candidates.
What kinds of AI projects fit beginners¶
Your toolkit so far: PyTorch, Hugging Face, RAG, evaluation, serving. Good targets:
- Inference engines & serving (vLLM, llama.cpp, Ollama) - high-quality issue tickets across many difficulty levels.
- Tokenization / data tooling (HuggingFace
tokenizers,datasets). - Embedding / RAG libraries (sentence-transformers, llama-index, langchain).
- Evaluation tools (lm-eval-harness, ragas, promptfoo).
- Adjacent ML tools (numpy, scipy, scikit-learn).
- Documentation - every major project has doc-improvement work.
For more research-y work (PyTorch core, training algorithms, model architectures), you'll need deeper expertise. Build on this path first.
10-minute evaluation¶
Same standard as other beginner paths:
| Signal | Target |
|---|---|
| Stars | 100-50000 |
| Last commit | Within a month |
| Open PRs | Some, not 300+ |
| Recent PR merge time | Under 14 days |
good first issue count |
≥5 |
| CONTRIBUTING.md | yes, readable |
| Tests pass on fresh clone | yes |
Candidates¶
Tier 1 - friendly, smaller scope¶
huggingface/transformers- yes the big one, BUT they have excellent issue triage. Many docs/examples PRs. Look atgood first issue.langchain-ai/langchain- chained LLM workflows. Large but very welcoming; tons of easy-mode integrations to add.run-llama/llama_index- RAG-focused alternative to LangChain.promptfoo/promptfoo- eval tool. Small enough to be approachable; very active.huggingface/tokenizers- tokenizer library. Rust core + Python bindings.
Tier 2 - well-organized¶
vllm-project/vllm- production inference serving. Issues exist at all levels.huggingface/peft- LoRA + friends. Smaller surface; active.huggingface/datasets- data loading. Adding a new dataset adapter is a common first contribution.sentence-transformers/sentence-transformers- embeddings library.unslothai/unsloth- fast fine-tuning. Welcoming.
Tier 3 - bigger, more visible¶
After 1-2 PRs.
pytorch/pytorch- yes eventually. Excellent labels; SIG structure; well-shepherded contributors.huggingface/transformerslarger contributions (new model architectures, etc.).triton-lang/triton- GPU programming DSL. Needs Triton + CUDA understanding.
Tier 4 - don't start here¶
- Foundation model labs (OpenAI, Anthropic, DeepMind) - closed source.
- PyTorch internals (autograd, distributed) - deep specialty.
Finding issues¶
Project's Issues tab. Filter by good first issue / documentation / help wanted.
Many AI projects label specific kinds of work:
- enhancement: docs
- models: add
- integration: add
- bug: confirmed
Read 5-10 issues; find one with clear repro and contained fix. Comment to claim; wait for maintainer.
What counts¶
For AI OSS work:
- Fix a typo in a model's documentation.
- Add a missing example in a tutorial notebook.
- Fix a quantization bug for a specific GPU.
- Add a new embedding model adapter.
- Improve an evaluation metric's implementation.
- Add a missing integration to a chain framework.
- Add support for a new fine-tuning recipe.
- Translate documentation.
All real. All count.
Specific recommendation: huggingface/transformers docs¶
For an easy first PR: open transformers/docs/source/en/ in a clone, find a doc page that's missing an example or has a confusing sentence. Submit a fix. The HF team is responsive and welcoming; you'll hear back within days. PR merges fast.
Exercise¶
- Browse three projects from Tier 1-2.
- Run the 10-minute eval on each.
- Pick the most responsive.
- Read CONTRIBUTING.md.
- Clone, install, run their tests:
- Browse open issues. Pick two candidates. Don't claim yet.
What you might wonder¶
"I want to contribute to PyTorch core. Can I?"
Yes, eventually. Start with their docs/ work or with the smaller modules first. The bar is real but lower than the kernel project.
"I'm scared of touching ML papers' reference implementations." Don't be. Start with documentation. Reference implementations of papers (DeepMind's work, etc.) often have terse README, scattered hyperparameters, missing examples. Doc PRs are welcomed.
"What about Anthropic / OpenAI / Google work?"
Their research models are mostly closed-source. Their tools (Anthropic's claudette library, OpenAI's cookbook) are public and accept PRs.
Done¶
- Recognize AI-OSS contribution shapes.
- Run a 10-minute evaluation.
- Have specific projects in mind.
Next: Anatomy of an AI OSS project →
14 - Anatomy of an AI OSS Project¶
What this session is¶
Read a real AI OSS repo top to bottom so the next one feels familiar.
Case study: huggingface/peft¶
PEFT (Parameter-Efficient Fine-Tuning) implements LoRA, QLoRA, IA3, prefix-tuning, etc. Small enough to read in a sitting; well-maintained.
Typical top level:
README.md
CONTRIBUTING.md
LICENSE
setup.py / pyproject.toml
src/peft/ # library code
tests/
examples/
docs/
.github/workflows/
What to read, in order¶
1. README.md (5 min)¶
What the project is. Quickstart example. Supported methods.
2. CONTRIBUTING.md (5 min)¶
How to set up the dev environment. Code style. Tests. PR rules.
3. setup.py / pyproject.toml (2 min)¶
Dependencies. Optional extras. Python version.
4. src/peft/ (15 min)¶
The package itself:
src/peft/
├── __init__.py # public API
├── peft_model.py # main PeftModel class
├── config.py # config classes
├── tuners/
│ ├── lora.py # LoRA implementation
│ ├── ia3.py
│ ├── prefix_tuning.py
│ └── ...
└── utils/
Read __init__.py first - it shows the public API surface. Then pick lora.py - that's the most-used technique and you understand it.
5. tests/ (10 min)¶
Pick test_lora.py (or similar). See how the team validates that LoRA still works across model architectures.
6. examples/ (10 min)¶
Working notebooks. Reproducible end-to-end runs.
7. .github/workflows/ (5 min)¶
tests.yml - runs pytest matrix. build_docs.yml - builds docs. release.yml - pushes to PyPI.
CI is the spec. What it runs, your PR must pass.
What to look for¶
- Where does data flow? For a training-time library: model → tuner wrapper → optimizer → save. For RAG: query → embed → search → context → LLM → response.
- Where's the public API? Usually
__init__.pyor amodels.py/api.py. - Where are model architectures? Usually
models/or per-architecture files. - Where are tests?
tests/. Match each test to a code file. - What's "magic"? Decorators that register models (
@register_model), config classes that auto-load. Read the registration logic once.
Common AI-project patterns¶
- Registry pattern. Models, tuners, integrations registered by string. New addition = add to registry + implement interface.
- Hub integration. Models loaded from
from_pretrained("model-id"). Look for_load_pretrained_modelor similar. - Configuration as a dataclass.
@dataclass class FooConfig. Serializes to JSON for reproducibility. - Mixed-precision and device handling.
with torch.cuda.amp.autocast():blocks;model.to(device). - Pipeline abstraction. High-level wrapper over tokenizer + model + generation logic.
Once you see these in one project, you see them everywhere.
Reading the test suite¶
Tests document expected behavior. For peft:
test_lora.py::test_lora_save_load- round-trip preservation.test_lora.py::test_lora_target_modules- which layers get adapters.test_lora.py::test_lora_merge- merging LoRA back into base weights.
Each test names the contract. To break the test is to break the contract.
Counter-example: pytorch/pytorch¶
Several million lines. C++/CUDA/Python. Build system alone is a project. Don't read top-to-bottom. Instead, find a specific module (torch/optim/, torch/utils/data/) and read just that.
Counter-example: langchain-ai/langchain¶
Monorepo with ~100 packages. Hundreds of integrations. Don't read top-to-bottom. Pick one integration package (e.g., libs/community/langchain_community/llms/anthropic.py) and read just that.
Exercise¶
- Clone
huggingface/peft. - Spend 45 minutes reading per the order above.
- After: explain to yourself, out loud:
- What does this project do?
- What's the public API?
- Where would a new technique (e.g., new LoRA variant) be added?
- How is it tested?
- Pick one open
good first issue. Locate the code it concerns.
What you might wonder¶
"I read it. I don't fully understand it." That's fine. Goal is geography, not mastery. You should know roughly where things live. Mastery comes from changes.
"The code uses techniques I haven't learned (mixin classes, metaclasses, etc.)." Note them. Don't get stuck. Modify a small piece first.
"It uses CUDA / accelerate / DeepSpeed. I can't run on my laptop."
You can still read and contribute. Many PRs are CPU-testable. Look for @require_torch_gpu decorators on tests - those are GPU-only; the rest you can run.
Done¶
- Read a real AI OSS repo with a plan.
- Know the typical layout.
- Have a target issue.
Next: Your first contribution →
15 - Your First Contribution¶
What this session is¶
The whole thing. Walk through an AI OSS contribution end-to-end.
The workflow¶
- Fork on GitHub.
- Clone your fork.
- Add upstream as remote.
- Branch off main.
- Set up the dev environment (install with extras; run tests).
- Change the file(s).
- Run lint + tests locally.
- Push to your fork; open PR.
Step 1: Fork & clone¶
git clone git@github.com:<you>/peft.git
cd peft
git remote add upstream git@github.com:huggingface/peft.git
git fetch upstream
Step 2: Branch¶
Always a fresh branch off main.
Step 3: Set up dev environment¶
For most HF projects:
For projects requiring GPU, run only CPU tests first:
If anything fails on a fresh clone, fix that first or ask in the issue.
Step 4: Make the change¶
Small. Focused. Tested.
- Docs typo / clarification - edit the
.mdfile indocs/source/. - Add an example - add a new file under
examples/. - Fix a bug - change the code; add or update a test that proves the fix.
For a first PR, prefer the first two. Bug fixes are great once you know the project.
Step 5: Re-run CI's commands locally¶
Look in .github/workflows/tests.yml. Typical:
All green? Push. Red? Fix locally first.
Step 6: Commit and push¶
DCO if required (git commit -s).
Step 7: Open the PR¶
On upstream repo, "Compare & pull request."
- Title. Short, descriptive. Conventional-commit style if the project uses it.
- Description. What changed, why, how tested.
Closes #123references the issue. - Checklist. Address every item in the PR template.
Submit. CI runs. Fix anything red by pushing more commits.
Worked example: typo in PEFT LoRA docs¶
Suppose you noticed docs/source/conceptual_guides/lora.md has an outdated target_modules=["query_key_value"] example that no longer applies to current Llama configs.
git clone git@github.com:<you>/peft.git
cd peft
git remote add upstream git@github.com:huggingface/peft.git
git fetch upstream
git checkout -b docs/lora-target-modules-llama
# Edit docs/source/conceptual_guides/lora.md
# Add a note: "For Llama-style models, use ['q_proj','v_proj']."
make quality
make docs
git add docs/source/conceptual_guides/lora.md
git commit -m "docs: clarify LoRA target_modules for Llama-style models"
git push origin docs/lora-target-modules-llama
Open PR. Wait for review.
What review looks like¶
- "LGTM, merging." Done.
- "Could you change these?" Address. Push commits to same branch.
- "Not quite - we already have a section for this." Update or close.
- Silence for a week → polite check-in comment.
HF teams are responsive (usually within days).
After the merge¶
- Update your fork's
main: - Delete the branch.
- Take a screenshot.
- Sit with it.
After your first PR¶
- Pick another issue. Familiarity compounds - second is much easier.
- After 3-5 PRs in one project, become a regular. Review others' PRs.
- Pick a model architecture you care about. Contribute an integration.
- Move toward research code: paper implementations, training-script improvements.
What you might wonder¶
"PR sits for weeks?" HF responds fast. Other AI projects (research orgs, slower-paced labs) can take longer. Polite check-in after 7-10 days.
"What about PyTorch core?"
Larger surface, more rigorous review. CLA required, RFCs for non-trivial changes. Start with the docs/ tree there.
"What about OpenAI / Anthropic SDKs?"
Yes, they accept PRs to their clients (openai-python, anthropic-sdk-python). Closed-source models, open-source clients.
"Maintainer rude?" Disengage. Try another project. AI OSS has many welcoming homes.
Done with this path¶
You've: - Installed PyTorch and the AI Python stack. - Trained a small neural net on MNIST. - Used Hugging Face for text generation. - Fine-tuned a model with LoRA. - Built a small RAG pipeline. - Evaluated outputs honestly. - Served a model locally. - Read a real AI OSS project. - Submitted a PR.
What you should do next: build a small AI tool you actually want to exist. The technology rewards practice. Pick one problem, build the simplest possible solution, iterate.
Recommended next paths on this site:
- AI Systems Engineering (senior reference) - 24-week deep dive: kernels, distributed training, inference serving, evaluation infrastructure.
- Python from Scratch - if your Python feels shaky.
- Linux from Scratch - the substrate AI runs on.
- Kubernetes from Scratch - where AI serving infra lives.
Congratulations. You are no longer a beginner.