Skip to content

AI Systems From Scratch (Beginner)

Beginner path: heard-of-ChatGPT → training a small net, fine-tuning with LoRA, building RAG, serving locally, contributing to AI OSS.

Printing this page

Use your browser's PrintSave as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.


AI Systems From Scratch - Beginner to OSS Contributor

From "I've heard of AI / LLMs" to "I can train a small model, fine-tune a transformer, build a small RAG app, evaluate it honestly, and submit a fix to an AI-adjacent OSS project."

Who this is for

  • You've finished Python From Scratch (or you're comfortable enough in Python to write small programs).
  • You've never trained a model, OR you've copy-pasted some PyTorch / Hugging Face code without really understanding what it does.

Soft prerequisite

Python comfort is mandatory - AI tooling lives in Python. If you can't write a function, walk a list, and read a stack trace, do Python From Scratch first.

You do not need a PhD in math. We use linear algebra at the level of "dot products and matrices"; we explain everything else as it appears.

What you'll need

  • A computer. A GPU helps a lot but isn't required for the first 8 pages. Free options for hands-on GPU work: Google Colab (free tier), Kaggle Notebooks, Lambda Labs (cheap), AWS / GCP (paid).
  • Python ≥3.10 (you set this up in the Python beginner path).
  • A text editor.
  • About 5 hours/week. Path is sized for 4-6 months.

Why AI systems

  • Biggest growth area in software. The job market and OSS activity around LLMs / ML infra is the most active it's ever been.
  • OSS is the heart of the field. PyTorch, Hugging Face, vLLM, llama.cpp, Ollama, LangChain - all open-source and welcoming.
  • The barrier is lower than it looks. Modern tooling lets you fine-tune real models with ~50 lines of code; serve them with another ~30.

How this path works

Same template as the other beginner paths: one concept per page, code first then walkthrough, exercise, Q&A, done recap.

We use PyTorch as the framework throughout - most popular, best ecosystem. Hugging Face Transformers for pre-trained models. vLLM / Ollama / llama.cpp for inference.

The pages

# Title What you'll know after
00 Introduction What we're doing and why
01 Setup Python + PyTorch + CUDA (or CPU) working
02 Tensors PyTorch's central data type
03 Linear algebra you actually need Dot products, matmul, gradients (intuitive)
04 Your first neural network A small MLP from scratch
05 Training loop Loss, optimizer, gradient descent
06 Inference and saving Loading a pretrained model, running it
07 Transformers and tokenization What an LLM actually does
08 Hugging Face Transformers Pre-trained models in 3 lines
09 Fine-tuning Adapt a model to your data (LoRA-friendly)
10 Retrieval-Augmented Generation Embeddings + vector DB + LLM
11 Evaluation The hardest part of ML done seriously
12 Serving models vLLM, Ollama, simple HTTP wrappers
13 Picking a project AI-OSS candidates
14 Anatomy of an AI OSS project Case study
15 Your first contribution Workflow + PR

Start with Introduction.

00 - Introduction

What this session is

A 10-minute read. No code. Sets expectations.

What you're going to be able to do, eventually

By the end: - Manipulate tensors confidently with PyTorch. - Build, train, and use a small neural network from scratch. - Load a pre-trained transformer from Hugging Face and use it. - Fine-tune that transformer on your own data (parameter-efficient with LoRA). - Build a small Retrieval-Augmented Generation (RAG) app. - Evaluate model quality the right way (most people get this wrong). - Serve a model behind an HTTP API. - Clone an AI OSS project, find a small fix, submit a PR.

The deal

  • It's slow on purpose. One concept per page.
  • Python fluency assumed. Read a stack trace, write a function, walk a list.
  • No math PhD required. Linear algebra at the "dot product and matmul" level. We explain everything else inline.
  • GPU is helpful but not mandatory. Pages 01-08 work on CPU. Page 09+ benefits from GPU; Google Colab's free tier suffices.
  • You will be confused. Often. AI has more vocabulary than any other technical area on this site. Don't panic.

A note on hype vs honesty

The AI field has more hype than any other in software. To stay sane:

  • Models are token predictors. They are not "intelligent" in the way the marketing implies. They are very good at pattern completion over enormous corpora. That's an extraordinary thing - and that's all it is.
  • Most "AI products" are wrappers around APIs. The actual engineering: tokenization, retrieval, prompt design, evaluation. The "model" itself is often someone else's pre-trained checkpoint.
  • Evaluation is the hard part. "Looks good" is not evaluation. We'll do this properly in page 11.

This path treats AI as a practical engineering domain - what works, how it's built, how to ship it. We don't speculate about AGI.

What you need

  • A computer (any OS).
  • Python ≥3.10 (set up in Python From Scratch path).
  • A text editor.
  • ~5 hours/week. Path is sized for 4-6 months.
  • A GPU for pages 09+ (or use Google Colab / Kaggle for free).

What you do NOT need

  • A PhD or MS.
  • A formal math background beyond high school algebra + intuitive linear algebra (we cover what you need).
  • A cloud account or paid API. Open-source models run locally; we use them.
  • C++ / CUDA. Those are senior-path material (AI Systems senior reference).

How long this realistically takes

4-6 months at 5 hours/week to "submit a PR."

The slowest pages are 07 (transformers) and 09 (fine-tuning). Plan for one or two re-reads at each.

What success looks like

You'll be able to: - Look at a model.py in any HF model and roughly understand what it does. - Build a small project end-to-end: load data, train, evaluate, serve. - Read a research paper's abstract + introduction + experiments section and predict what their code does. - Submit a fix to a real AI OSS project.

You will not be able to: - Train a frontier LLM. (Multi-million-dollar GPU farms; not in 6 months.) - Tell people you're "an ML engineer." (Years of work past this.) - Pass an FAANG ML interview. (Different focus - leetcode plus theory.)

What you'll have: the foundation to keep going. The AI Expert Roadmap is the natural follow-up - 12 months of structured study from here.

One last thing before we start

If a page feels too dense - stop, re-read. Still dense? Skip, come back.

The AI field uses jargon shamelessly. When a word appears you haven't seen, this path defines it inline. If a word slips through without definition, that's a bug - note it.

Ready? Next: Setup →

01 - Setup

What this session is

About 45 minutes. Install PyTorch (with GPU support if you have one), confirm it works, run a tiny tensor program.

Step 1: Pick your Python environment

You should have Python ≥3.10 from the Python beginner path. Create a fresh virtual environment for this path:

mkdir -p ~/code/ai-learning
cd ~/code/ai-learning
python3 -m venv .venv
source .venv/bin/activate            # macOS / Linux
.venv\Scripts\activate                # Windows

You'll work inside this .venv for the whole path. Always activate it before working.

Step 2: Install PyTorch

Go to pytorch.org/get-started/locally. The site has a config-builder that gives you the exact pip install command for your OS / Python / CUDA.

Typical commands:

CPU only (any platform):

pip install torch torchvision

Linux + NVIDIA GPU (CUDA 12):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

macOS (Apple Silicon - uses MPS, Apple's Metal-based backend):

pip install torch torchvision
PyTorch on macOS automatically uses MPS when available. Works well for small models; not as fast as CUDA.

Verify:

python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

Should print a version (e.g., 2.4.0) and True or False depending on whether CUDA is available.

If you have an NVIDIA GPU and cuda.is_available() is False: the install picked the CPU-only wheel. Reinstall with the CUDA URL.

If you have Apple Silicon:

python -c "import torch; print(torch.backends.mps.is_available())"
Should be True.

Step 3: Install supporting libraries

The rest of the path uses these:

pip install jupyter ipykernel numpy pandas matplotlib
pip install transformers datasets accelerate
pip install sentence-transformers faiss-cpu      # for RAG (page 10)
pip install scikit-learn                          # for utilities
pip install httpx                                 # for serving (page 12)

That's a lot at once. Each tool has its own purpose:

  • transformers - Hugging Face's library; loading and using pre-trained models.
  • datasets - Hugging Face's data-loading library.
  • accelerate - multi-GPU + mixed-precision helper.
  • sentence-transformers + faiss - embeddings + vector search for RAG.
  • scikit-learn - classical ML utilities (data splits, metrics).

If any installation fails, read the error. Often it's a missing system dependency (e.g., libstdc++); search the error message for guidance.

Step 4: First PyTorch program

Create tensor_hello.py:

import torch

# Create a tensor - PyTorch's central data type
x = torch.tensor([1.0, 2.0, 3.0])
print("tensor:", x)
print("shape:", x.shape)
print("dtype:", x.dtype)

# Some math
y = torch.tensor([4.0, 5.0, 6.0])
print("x + y:", x + y)
print("x · y:", torch.dot(x, y))

# A 2D tensor (matrix)
mat = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
print("matrix:")
print(mat)
print("matrix shape:", mat.shape)

# Move to GPU if available
if torch.cuda.is_available():
    x_gpu = x.cuda()
    print("x on GPU:", x_gpu.device)
elif torch.backends.mps.is_available():
    x_mps = x.to("mps")
    print("x on MPS:", x_mps.device)
else:
    print("running on CPU")

Run:

python tensor_hello.py

You should see output like:

tensor: tensor([1., 2., 3.])
shape: torch.Size([3])
dtype: torch.float32
x + y: tensor([5., 7., 9.])
x · y: tensor(32.)
matrix:
tensor([[1., 2.],
        [3., 4.]])
matrix shape: torch.Size([2, 2])
running on CPU

(32 because 1·4 + 2·5 + 3·6 = 32. We'll cover dot products properly in page 03.)

Step 5: Jupyter notebooks (optional but valuable)

ML work happens in notebooks more than scripts. They mix code, output, plots, and prose in one document.

jupyter notebook

Opens a browser. Click "New" → "Python 3". You get a cell-based editor. Each cell runs independently; output appears below it. Great for exploration.

Many tutorials are notebook files (.ipynb). You'll meet them. VS Code also has a built-in notebook UI.

Step 6: Google Colab as a backup

For pages 09+ you'll want a GPU. If you don't have one:

colab.research.google.com gives you a free Jupyter notebook with a GPU (~T4-class) for a few hours per session.

!nvidia-smi              # in Colab - shows the GPU you got

For more compute: Kaggle Notebooks (free), Lambda Labs (paid by the minute), or any cloud provider.

We'll note when a page benefits from GPU.

A note on the AI ecosystem's pace

AI libraries change fast. PyTorch APIs are stable across minor versions but the broader ecosystem (Hugging Face, vLLM, training optimizers) iterates monthly. When in doubt, read the library's current docs, not blog posts from 2023.

Exercise

  1. Verify PyTorch installation:

    python -c "import torch; print(torch.__version__, torch.cuda.is_available() or torch.backends.mps.is_available())"
    

  2. Run tensor_hello.py above. Confirm the output.

  3. Modify it:

  4. Create a tensor z = torch.arange(10). Print it. (Range from 0 to 9.)
  5. Create a random 3×3 matrix with torch.randn(3, 3). Print.
  6. Compute the matrix's sum (m.sum()) and mean (m.mean()).

  7. (Optional) Set up Jupyter:

    jupyter notebook
    
    Create a new notebook. Run the tensor code in cells.

What you might wonder

"Why PyTorch and not TensorFlow / JAX?" PyTorch is the dominant ML framework as of 2026 - research uses it, most OSS uses it, the job market wants it. JAX has its place (Google ecosystem, transformer research); TensorFlow is mostly legacy. Stick with PyTorch.

"What is CUDA?" NVIDIA's parallel computing platform - the way GPUs run general-purpose code. PyTorch built with CUDA support uses your GPU automatically when you .cuda() tensors. AMD GPUs use ROCm; Apple Silicon uses MPS.

"My GPU isn't supported / I don't have one." Use CPU for pages 01-08. They run fine. For pages 09+, use Google Colab's free tier.

"How much disk space will this take?" ~3-5 GB for PyTorch and core deps. Pre-trained models you download in later pages can add 1-10 GB each. Plan for 30-50 GB free if you'll experiment broadly.

Done

You have: - Python venv with PyTorch installed (CPU or GPU). - Supporting libraries (transformers, datasets, sentence-transformers, faiss). - Verified PyTorch can create and operate on tensors. - (Optional) Jupyter set up; Colab as backup.

Next: Tensors →

02 - Tensors

What this session is

About 45 minutes. Tensors are PyTorch's central data type - multi-dimensional arrays, with support for GPU acceleration and automatic differentiation. Almost every line of PyTorch code touches tensors.

What a tensor is

A tensor is a generalized array:

  • A 0-dimensional tensor is a single number (scalar): 7.
  • A 1-D tensor is a vector: [1, 2, 3].
  • A 2-D tensor is a matrix: [[1, 2], [3, 4]].
  • A 3-D tensor is a cube of numbers. (Often used for color images: [height, width, channels].)
  • Higher: a batch of images, a batch of token sequences, etc.

Every tensor has a shape (its size per dimension) and a dtype (the type of each element).

Creating tensors

import torch

# From a Python list
a = torch.tensor([1, 2, 3])
print(a.shape, a.dtype)            # torch.Size([3]) torch.int64

# As floats
b = torch.tensor([1.0, 2.0, 3.0])
print(b.shape, b.dtype)            # torch.Size([3]) torch.float32

# Zeros, ones, random
z = torch.zeros(2, 3)              # 2x3 of zeros
o = torch.ones(2, 3)
r = torch.randn(2, 3)              # random normal (mean=0, std=1)
u = torch.rand(2, 3)               # random uniform [0, 1)
i = torch.arange(0, 10)            # 0, 1, 2, ..., 9

# An identity matrix
I = torch.eye(4)

Default float dtype is float32. Default int dtype is int64. You can specify:

x = torch.zeros(2, 3, dtype=torch.float16)
y = torch.tensor([1, 2, 3], dtype=torch.int32)

Shape and reshape

a = torch.arange(12)
print(a.shape)                     # torch.Size([12])

b = a.reshape(3, 4)                # 3x4 matrix
print(b.shape)                     # torch.Size([3, 4])

c = a.reshape(2, 2, 3)             # 2x2x3 tensor
print(c.shape)                     # torch.Size([2, 2, 3])

reshape(...) doesn't copy data when it can avoid it - it just changes the "view" on the underlying buffer.

The -1 placeholder means "infer this dimension":

a = torch.arange(12)
b = a.reshape(-1, 4)               # 3 rows of 4 (12/4)
c = a.reshape(2, -1)               # 2 rows of 6 (12/2)

Useful in functions where you know all but one dimension.

Indexing

Like NumPy, like Python lists, but extended:

m = torch.arange(12).reshape(3, 4)
# m is:
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

m[0]              # first row: tensor([0, 1, 2, 3])
m[0, 0]           # first element: tensor(0)
m[:, 0]           # first column: tensor([0, 4, 8])
m[1:, 2:]         # rows 1+, cols 2+: tensor([[6, 7], [10, 11]])
m[0:2, 0:2]       # 2x2 top-left

Slicing returns a view (shares storage). Modifying the slice modifies the original. Use .clone() if you need an independent copy.

Arithmetic

Element-wise:

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(a + b)               # [5, 7, 9]
print(a * b)               # [4, 10, 18]    element-wise multiply
print(a ** 2)              # [1, 4, 9]
print(torch.exp(a))        # [e^1, e^2, e^3]
print(torch.sin(a))

Reductions (collapse a dimension):

m = torch.randn(3, 4)
m.sum()                    # scalar
m.sum(dim=0)               # column sums (4 values)
m.sum(dim=1)               # row sums (3 values)
m.mean()
m.max()
m.argmax()                 # index of maximum

dim= is the dimension to reduce over. dim=0 collapses the rows; dim=1 collapses the columns. Confusing the first time; you'll internalize it.

Matrix multiplication

The most-used operation in ML. It is not the same as element-wise *.

A = torch.randn(2, 3)      # 2x3
B = torch.randn(3, 4)      # 3x4
C = A @ B                  # 2x4 - matrix multiply
# or: torch.matmul(A, B)

The @ operator is matrix multiply. Two requirements: - Inner dimensions match: (2, 3) @ (3, 4) works because both have 3 in the middle. - Result is the outer dimensions: (2, 3) @ (3, 4) → (2, 4).

If they don't match, you get an error. Get the dimensions right first; everything else follows.

Broadcasting

When you operate on tensors of different shapes, PyTorch tries to make them match by broadcasting the smaller one along the matching dimensions:

a = torch.tensor([[1, 2, 3], [4, 5, 6]])     # shape (2, 3)
b = torch.tensor([10, 20, 30])                # shape (3,)

print(a + b)
# tensor([[11, 22, 33],
#         [14, 25, 36]])

b was broadcast across the rows of a. Equivalent to adding [10, 20, 30] to each row.

The rules are precise but the intuition is "align from the right; missing dimensions are filled in by repeating":

a.shape = (2, 3)
b.shape = (3,)        becomes (1, 3) then broadcast  (2, 3)

When in doubt, print shapes. Most "shape mismatch" errors come from this; once you see the shapes, the fix is usually obvious.

Move tensors to GPU

device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
print("using:", device)

a = torch.randn(1000, 1000).to(device)
b = torch.randn(1000, 1000).to(device)
c = a @ b           # runs on GPU if device is cuda/mps

Tensors and operations have to be on the same device. Mixing CPU and GPU tensors raises an error.

Common idiom: define device once at the top of the script; .to(device) every tensor you create.

NumPy interop

PyTorch tensors and NumPy arrays interoperate:

import numpy as np

n = np.array([1, 2, 3])
t = torch.from_numpy(n)            # tensor sharing memory with the array
back = t.numpy()                   # numpy array sharing memory with the tensor

If the tensor is on CPU, this is free (no copy). On GPU, you have to .cpu() first.

NumPy is the older sibling - PyTorch borrows most of its API conventions from NumPy. If you've used NumPy, PyTorch tensors will feel familiar.

Exercise

In a new script tensor_practice.py:

  1. Create a 5×3 tensor of random normal values. Print its shape and mean.

  2. Create the same tensor and add 1.0 to every element. (Hint: just tensor + 1.)

  3. Create a 3×3 identity matrix; create another 3×3 matrix with torch.arange(9).reshape(3, 3).float(). Multiply them with @. Result?

  4. Create a = torch.arange(20).reshape(4, 5). Get the third row. Get the second column. Get the bottom-right 2×2 submatrix.

  5. Broadcasting: create a = torch.zeros(3, 4) and b = torch.tensor([1, 2, 3, 4]). Compute a + b. What shape? What values?

  6. GPU (if available): create two 1000×1000 random matrices. Time how long a @ b takes on CPU vs your device. Use time.time() around the multiplications.

What you might wonder

"Why are tensors not just NumPy arrays?" PyTorch tensors add: GPU support, automatic differentiation (page 04), automatic device placement, and a richer API for ML-specific operations. They're NumPy++.

"What's float32 vs float16 vs bfloat16?" Number formats with different precision/memory trade-offs. float32 (FP32) is the default - 4 bytes per number, lots of precision. float16 and bfloat16 are half-precision (2 bytes); used heavily in training large models for memory savings. Modern GPUs (Volta+) have tensor cores that specifically accelerate these.

"Why both reshape and view?" view requires the data to be contiguous in memory. reshape may copy if needed. Prefer reshape; reach for view only when you've measured it matters.

"My tensors are on different devices and I'm confused." Set a device = ... constant at the top of your script. Always .to(device) after creation. This rule alone eliminates 80% of device-mismatch bugs.

Done

  • Create tensors with various constructors.
  • Reshape, index, slice.
  • Use element-wise arithmetic, matrix multiplication.
  • Use broadcasting confidently.
  • Move tensors between CPU and GPU.

Next: Linear algebra you actually need →

03 - Linear Algebra You Actually Need

What this session is

About 30 minutes. The math for neural networks at the intuitive level. No proofs. By the end you'll know what a dot product, matrix multiply, and gradient are - and what they mean in ML code.

Dot product

Two vectors a and b of the same length:

a · b = a[0]*b[0] + a[1]*b[1] + ... + a[n-1]*b[n-1]

A single number. Measures how aligned the vectors are: large when they point the same way; zero when perpendicular; negative when opposite.

import torch
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(torch.dot(a, b))        # 32.0
# (1*4 + 2*5 + 3*6 = 32)

Why it matters: the simplest neuron computes a dot product between its inputs and its weights, adds a bias, applies a nonlinearity. Every neural network is built up from this.

Matrix multiplication

Treat a matrix as a stack of row vectors (or column vectors). Matrix multiplication A @ B:

  • The entry at row i, column j of A @ B is the dot product of row i of A and column j of B.

Shape rule: (m, k) @ (k, n) = (m, n). The inner dimensions match; the outer dimensions become the result's shape.

A = torch.tensor([[1.0, 2.0],
                  [3.0, 4.0]])
B = torch.tensor([[5.0, 6.0],
                  [7.0, 8.0]])
print(A @ B)
# tensor([[19., 22.],
#         [43., 50.]])
# 1*5 + 2*7 = 19, 1*6 + 2*8 = 22, etc.

Why it matters: an entire neural network layer is output = input @ weights + bias. Matmul is what GPUs are designed to accelerate; everything else is supporting infrastructure.

Transpose

Swap rows and columns:

A = torch.tensor([[1, 2, 3], [4, 5, 6]])     # shape (2, 3)
print(A.T)                                    # shape (3, 2)
# tensor([[1, 4],
#         [2, 5],
#         [3, 6]])

Often used to make shapes line up for matrix multiplication.

A neuron

A single artificial neuron:

output = activation(input · weights + bias)
  • input and weights are vectors of the same length.
  • bias is a single number.
  • activation is a nonlinear function (relu, sigmoid, tanh, etc.).

A layer of n neurons is just n of these stacked - equivalent to one big matmul:

batch_size = 4
input_dim = 10
output_dim = 5

x = torch.randn(batch_size, input_dim)           # (4, 10)
W = torch.randn(input_dim, output_dim)           # (10, 5)
b = torch.randn(output_dim)                      # (5,)
out = x @ W + b                                  # (4, 5)

x @ W is (4, 10) @ (10, 5) → (4, 5). The bias b broadcasts across the batch.

Welcome - that's what a dense layer (also called a "linear layer" or "fully-connected layer") does. Everything else is variations.

Nonlinearity

Without a nonlinearity between layers, stacking matmuls collapses to one matmul (matrix multiplication is linear). A non-linear function applied element-wise restores the network's power:

import torch.nn.functional as F

x = torch.randn(4, 10)
h = F.relu(x @ W + b)         # ReLU: max(0, x)

Common nonlinearities: - ReLU - max(0, x). Cheap, effective, the default for hidden layers. - GELU - smoother ReLU. Used heavily in transformers. - Sigmoid - 1 / (1 + exp(-x)). Outputs in (0, 1). Used for binary classification outputs. - Softmax - normalizes a vector to sum to 1. Used for multi-class classification outputs.

You'll mostly use ReLU or GELU in hidden layers; softmax in output.

Gradient (intuitively)

A gradient is "the slope of a function at a point, in N dimensions." For a single-variable function f(x), the gradient is the derivative f'(x). For a multi-variable function L(w₁, w₂, ..., wₙ), it's a vector - one partial derivative per variable.

Why it matters: training a network is "minimize the loss function." The gradient of the loss with respect to the weights tells you "if I nudge each weight in the direction opposite the gradient, the loss decreases." That's gradient descent.

You don't compute gradients by hand. PyTorch's autograd does it for you - every operation you do on tensors with requires_grad=True is tracked, and .backward() walks the graph computing gradients automatically.

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward()                  # computes dy/dx
print(x.grad)                 # 2*x + 3 = 7

Page 05 uses this in a training loop. For now, just know: gradients let you adjust weights to reduce loss.

Vectors in geometry vs in ML

In math classes, vectors had geometric meaning - points, directions, magnitudes. In ML, a vector is just a list of features. A user's embedding might be 1536 numbers - no geometric interpretation, but the "directions" still capture meaningful similarities (cosine of the angle between two user embeddings = how similar they are in the model's learned space).

The math is the same - the interpretation is "feature-space similarity," not "physical space."

Cosine similarity

The dot product of two normalized vectors (each with length 1):

import torch.nn.functional as F
a = torch.randn(100)
b = torch.randn(100)
sim = F.cosine_similarity(a, b, dim=0)
print(sim)        # between -1 and 1

A standard "how similar are these two embeddings" metric. Used heavily in RAG (page 10).

What you'll never need from a math course

  • Eigenvalues / eigenvectors (occasionally relevant; not for daily work).
  • Singular Value Decomposition (used in LoRA fine-tuning page 09; we'll cover what you need).
  • Convex analysis. Calculus of variations. Differential geometry.

Don't get nerve-sniped by Twitter saying you need to "understand linear algebra before doing ML." You need the operations on this page. The rest is for research, not engineering.

Exercise

  1. Dot product: create two random vectors of length 100. Compute their dot product manually (loop with sum) AND with torch.dot. Verify they match.

  2. Matmul shape check: create A of shape (3, 5) and B of shape (5, 7). What's the shape of A @ B? Verify in code.

  3. A neuron from scratch:

    import torch.nn.functional as F
    x = torch.tensor([1.0, 2.0, 3.0])
    w = torch.tensor([0.1, 0.2, 0.3])
    b = torch.tensor(0.5)
    out = F.relu(torch.dot(x, w) + b)
    print(out)
    
    What's the value? Why?

  4. Batch: create X of shape (8, 3) (batch of 8 inputs, each 3-dim). Create W of shape (3, 5). Compute X @ W and inspect the shape. What does each row represent?

  5. Gradient: define f(x) = x³ - 4x² + 7x - 1. At x = 2.0, compute the gradient using PyTorch. (Hint: use x.backward(). The math answer is 3x² - 8x + 7 = 3 at x=2.)

What you might wonder

"I see lots of torch.bmm in code. What's that?" Batched matrix multiplication - when you have a batch dimension. bmm is (B, m, k) @ (B, k, n) → (B, m, n). Common in transformers' attention.

"What's torch.einsum?" Einstein summation notation - a powerful, terse way to express tensor operations. torch.einsum("ij,jk->ik", A, B) is matmul. Worth learning once you've seen the same matmul pattern enough times.

"How does a network 'know' which way to adjust weights?" The gradient gives the direction of steepest increase. Going opposite the gradient (gradient descent) decreases the loss locally. That's all. The magic is that this simple rule works in millions of dimensions.

Done

  • Dot product, matrix multiplication, transpose.
  • A single neuron and a dense layer.
  • Common nonlinearities.
  • What a gradient is and why it matters.
  • Recognizing what's NOT essential math for ML engineering.

Next: Your first neural network →

04 - Your First Neural Network

What this session is

About 45 minutes. Build a small multi-layer perceptron (MLP) in PyTorch - the simplest interesting neural network. You'll see how nn.Module, layers, forward pass, and autograd compose.

The plan

We'll build a network that classifies handwritten digits (MNIST - the "hello world" of ML). 28×28 grayscale images → one of 10 digit classes.

We won't train it yet (page 05 covers training). This page is about defining the model.

nn.Module: PyTorch's central abstraction

Every model in PyTorch is a class extending torch.nn.Module. Define your layers in __init__; define how data flows through them in forward.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)     # input layer: 784 → 128
        self.fc2 = nn.Linear(128, 64)       # hidden:      128 → 64
        self.fc3 = nn.Linear(64, 10)        # output:      64 → 10

    def forward(self, x):
        # x is shape (batch_size, 784)
        x = F.relu(self.fc1(x))             # → (batch, 128)
        x = F.relu(self.fc2(x))             # → (batch, 64)
        x = self.fc3(x)                     # → (batch, 10) - raw logits
        return x

model = MLP()
print(model)

Run this. You'll see:

MLP(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)

PyTorch auto-prints the architecture.

What each line does

nn.Linear(in_features, out_features) - a fully-connected layer. Internally: a weight matrix of shape (in_features, out_features) and a bias vector. When you call it with input x, it computes x @ W + b.

F.relu(x) - element-wise ReLU activation. max(0, x). Adds non-linearity (page 03).

The last layer produces logits - raw scores, one per class. We don't apply softmax here; the loss function (page 05) does it more numerically stably.

Run a forward pass

x = torch.randn(4, 784)            # batch of 4 random "images"
out = model(x)
print(out.shape)                   # torch.Size([4, 10])
print(out[0])                      # 10 logits for the first sample

model(x) calls forward(x) under the hood. The output is (batch_size, num_classes) - 10 logits per input.

To turn logits into probabilities:

probs = F.softmax(out, dim=1)
print(probs[0])
print(probs.sum(dim=1))            # each row sums to 1

To get the predicted class (argmax over the classes):

preds = out.argmax(dim=1)
print(preds)                       # tensor of class indices

The model is randomly initialized - predictions are garbage. Training (page 05) fixes that.

Counting parameters

total = sum(p.numel() for p in model.parameters())
print(f"total params: {total:,}")    # ~109,000

.parameters() yields all the learnable tensors (weights + biases). For this MLP: - fc1: 784 × 128 weights + 128 biases = 100,480. - fc2: 128 × 64 + 64 = 8,256. - fc3: 64 × 10 + 10 = 650. - Total: 109,386.

For comparison: GPT-2 (small) is 124M params. GPT-3 is 175B. Modern open-source LLMs (Llama 3 70B) are 70B. Parameter count is one rough proxy for model capability (and one direct proxy for memory cost).

A more compact form: nn.Sequential

For simple feed-forward stacks:

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)

Layers in order. Use Sequential for "just call these in sequence" cases. Use the class form when forward needs anything more than that (skip connections, conditional flow).

Activations as modules vs functions

Two ways to apply ReLU:

# As a function (no parameters):
x = F.relu(x)

# As a module (in Sequential):
nn.ReLU()

Both work. Functions are cleaner inside forward. Modules are required inside Sequential. Use whichever fits.

Initialization

PyTorch initializes weights with sensible defaults (Kaiming uniform for linear layers). For most cases this is fine. To override:

for p in model.parameters():
    if p.dim() > 1:
        nn.init.kaiming_normal_(p)

Initialization matters for very deep networks; modern architectures (with normalization layers) are robust enough that the default usually works.

Move to GPU

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

x = torch.randn(4, 784).to(device)
out = model(x)

.to(device) moves all parameters. After that, your input must also be on the same device.

Save and load

# Save just the parameters
torch.save(model.state_dict(), "mlp.pt")

# Load into the same architecture
model2 = MLP()
model2.load_state_dict(torch.load("mlp.pt"))
model2.eval()

state_dict() returns a dict of {name: tensor} for every parameter. Saving the state dict (not the whole model object) is the recommended pattern - portable across code changes.

.eval() switches the model into evaluation mode (matters for layers like dropout and batch norm that behave differently during training vs inference).

Common architectures (very brief preview)

The MLP is the simplest. You'll meet others:

  • CNN (Convolutional Neural Network) - for images. Layers detect local patterns (edges, textures) at multiple scales.
  • RNN / LSTM - for sequences (older approach). Largely replaced by transformers.
  • Transformer - attention-based, the modern default for language and increasingly vision. Page 07 covers it.

For this beginner path: MLPs for the early pages; transformers for the LLM pages.

Exercise

Create mlp_demo.py:

import torch
import torch.nn as nn
import torch.nn.functional as F


class MLP(nn.Module):
    def __init__(self, input_dim=784, hidden=128, num_classes=10):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print("device:", device)

    model = MLP().to(device)
    print(model)
    total = sum(p.numel() for p in model.parameters())
    print(f"params: {total:,}")

    # Forward pass on a fake batch
    x = torch.randn(8, 784, device=device)
    logits = model(x)
    print("logits shape:", logits.shape)
    probs = F.softmax(logits, dim=1)
    print("sample probs:", probs[0])
    print("sample sum (should be ~1):", probs[0].sum().item())


if __name__ == "__main__":
    main()

Run it. Verify everything makes sense.

Stretch: rewrite the model as nn.Sequential. Same output, fewer lines.

Bigger stretch: add a dropout layer (nn.Dropout(p=0.2)) between the hidden layers. Dropout randomly zeros some activations during training; an effective regularizer. Print the model to see the new architecture.

What you might wonder

"What's super().__init__() for?" Calls nn.Module's constructor - sets up internal bookkeeping so .parameters() and .to() work on your model. Always call it first in __init__.

"Why F.relu vs nn.ReLU?" Same thing. Functional form is shorter inside forward; module form composes with Sequential and is registered as a child module (so it shows in print(model)).

"What's a 'logit'?" A pre-softmax output. The raw score the model assigns to each class. The class with the largest logit is the prediction. Logits aren't probabilities; softmax converts them.

"Should I worry about backpropagation math?" No - PyTorch's autograd handles it. You define the forward pass; gradients are computed automatically. Page 05 shows the loop.

Done

  • Define a model by extending nn.Module.
  • Use nn.Linear, F.relu, nn.Sequential.
  • Run a forward pass.
  • Count parameters.
  • Save and load state_dict.
  • Move models to GPU.

Next: Training loop →

05 - Training Loop

What this session is

About an hour. Train the MLP from page 04 to recognize MNIST digits. By the end you'll have written a full training loop - the same shape as every PyTorch training loop in existence.

The pattern

Every training loop is:

For each epoch (pass over the data):
    For each batch:
        1. Forward pass - compute predictions
        2. Compute loss - how wrong are we?
        3. Backward pass - compute gradients
        4. Optimizer step - adjust weights

That's it. The rest is bookkeeping.

Load MNIST

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.ToTensor(),                          # PIL image → tensor
    transforms.Normalize((0.1307,), (0.3081,)),     # standardize: (x - mean) / std
])

train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_ds  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=512)

Three pieces: - Dataset - knows how to load and transform one example. - DataLoader - wraps the dataset, batches it, optionally shuffles. - Transform - preprocessing applied to each example.

torchvision provides MNIST out of the box. First run downloads ~10MB; subsequent runs use the cached copy.

The full training script

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Reproducibility
torch.manual_seed(42)

# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"device: {device}")

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])
train_ds = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
test_ds  = datasets.MNIST(root="./data", train=False, download=True, transform=transform)
train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=512)

# Model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)             # flatten 28x28 → 784
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

model = MLP().to(device)

# Loss + optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Train
for epoch in range(3):
    model.train()
    total_loss = 0
    correct = 0
    n = 0
    for x, y in train_loader:
        x, y = x.to(device), y.to(device)

        # 1. Forward
        logits = model(x)

        # 2. Loss
        loss = criterion(logits, y)

        # 3. Backward
        optimizer.zero_grad()                  # clear gradients from last step
        loss.backward()                        # compute gradients

        # 4. Optimizer step
        optimizer.step()

        total_loss += loss.item() * x.size(0)
        correct += (logits.argmax(dim=1) == y).sum().item()
        n += x.size(0)

    print(f"epoch {epoch}: train loss {total_loss/n:.4f}, acc {correct/n:.4f}")

# Test
model.eval()
correct = 0
n = 0
with torch.no_grad():
    for x, y in test_loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        correct += (logits.argmax(dim=1) == y).sum().item()
        n += x.size(0)

print(f"test accuracy: {correct/n:.4f}")

Run. After ~30 seconds (on CPU) or ~5 seconds (on GPU), you should see ~97% test accuracy. Your first trained model.

What each line is doing

x.view(x.size(0), -1) - flatten the 28x28 images into 784-length vectors. The -1 infers the dimension. x.size(0) is the batch dimension.

nn.CrossEntropyLoss - standard loss for classification. Internally: softmax + negative log-likelihood. Stable and standard.

optimizer = torch.optim.Adam(...) - the optimizer. Adam is the most-used optimizer for modern ML. Other options: SGD (stochastic gradient descent - classic but needs more tuning), AdamW (Adam with corrected weight decay).

optimizer.zero_grad() - clear gradients from the last batch. PyTorch accumulates gradients by default; you must clear them explicitly each step. Forget this and your gradients grow unbounded.

loss.backward() - autograd walks the computation graph backward, computing gradients for every parameter that participated in computing loss. Stores them in parameter.grad.

optimizer.step() - applies the gradient update. For Adam: complex math; for SGD: param = param - lr * param.grad.

model.train() vs model.eval() - switches the model's internal mode. Affects layers like dropout and batch norm that behave differently during training vs inference. Always set the right mode.

with torch.no_grad(): - disables autograd. Inference doesn't need gradients; this skips the bookkeeping and uses less memory.

What the loss tells you

Training loss going down = model is fitting the training data. Plateauing means we've hit the model's capacity (or the optimizer is stuck - try different hyperparams).

Test accuracy is what you actually care about - performance on data the model didn't see during training. If train loss keeps dropping but test accuracy stops improving, you're overfitting - memorizing the training data.

Mitigations: more data, regularization (dropout, weight decay), smaller model, early stopping.

Hyperparameters

The numbers you set that aren't learned: - Learning rate (lr=1e-3) - how big a step the optimizer takes. Too high → unstable. Too low → slow. 1e-3 is a great starting point for Adam. - Batch size (64) - larger = smoother gradients, more memory; smaller = noisier gradients, sometimes generalizes better. - Epochs - how many passes over the data. More = better fit (until overfitting). - Architecture - depth, width, normalizations.

These need tuning. For MNIST, defaults work. For real problems, expect to iterate.

Save the model

torch.save(model.state_dict(), "mnist_mlp.pt")

Load later:

model = MLP().to(device)
model.load_state_dict(torch.load("mnist_mlp.pt"))
model.eval()

What this scales to

The training loop pattern above is identical for huge models - the only thing that changes is the model definition. Add a few wrinkles for big-model training (mixed precision, gradient accumulation, distributed, checkpointing) and you have what a real LLM training script looks like.

Exercise

  1. Run the script. Get ~97% test accuracy.

  2. Tweak hyperparameters:

  3. Set lr=1e-2. Does it train? (Probably loss goes NaN - too high.)
  4. Set lr=1e-5. (Trains but slowly.)
  5. Increase epochs to 10. Does test accuracy improve? Plateau? Degrade (overfit)?

  6. Architecture changes:

  7. Add a third hidden layer of size 128.
  8. Increase hidden sizes to 256, 128.
  9. Watch parameter count + accuracy change.

  10. Visualize: plot the per-epoch training loss using matplotlib. (Or use TensorBoard - pip install tensorboard then tensorboard --logdir runs/.)

  11. Stretch: instead of an MLP, try a small CNN. A 2-layer CNN beats this MLP at >98% accuracy. Look up nn.Conv2d. Don't worry if it doesn't work first try - convolutions take some adjusting.

What you might wonder

"Why Adam over SGD?" Adam adapts the learning rate per-parameter; converges faster on most problems without tuning. SGD with momentum can outperform on specific architectures (vision CNNs) when carefully tuned. For getting started: Adam.

"How big should my batch be?" For GPU work, "as big as fits in memory" is a common heuristic. Common sizes: 32, 64, 128, 256. Larger batches give smoother gradients but you might need more epochs to converge.

"What does loss.item() do?" Extracts a Python float from a 0-dim tensor. Detached from the graph (no gradient tracking). Use when you want a number for logging.

"Why is my loss not going down?" Common causes: - Learning rate too high (loss NaN) or too low (loss flat). - Wrong loss function for your task. - Bug in the model (wrong shapes - print them). - Data not normalized.

Print loss every step initially. If it's not going down within ~100 steps on MNIST, something's wrong.

Done

  • The four-step training loop pattern.
  • DataLoader for batched iteration.
  • nn.CrossEntropyLoss + Adam optimizer.
  • optimizer.zero_grad() / loss.backward() / optimizer.step().
  • model.train() vs model.eval().
  • Save/load weights.

Next: Inference and saving →

06 - Inference and Saving

What this session is

About 30 minutes. The other half of training - using a trained model. Loading saved weights, running predictions, the eval-mode + no-grad pattern.

Inference vs training

Training: forward pass + loss + backward pass + optimizer step. Slow, memory-hungry, uses gradients.

Inference: forward pass only. Fast, cheap, no gradients.

The difference matters because most of a model's lifetime is inference - users sending requests; you predicting. Optimizing inference is its own discipline (page 12).

The basic pattern

import torch

model = MLP()                              # the same class you trained with
model.load_state_dict(torch.load("mnist_mlp.pt"))
model.eval()                               # IMPORTANT: switch to eval mode

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# An input
x = torch.randn(1, 1, 28, 28).to(device)   # one fake "image"

with torch.no_grad():                       # IMPORTANT: skip gradient tracking
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred = logits.argmax(dim=1)

print(f"predicted: {pred.item()}, confidence: {probs.max().item():.4f}")

Three things you must remember: 1. Load the weights - load_state_dict into a freshly-constructed model with the same architecture. 2. model.eval() - disables dropout, freezes batch norm running statistics. 3. with torch.no_grad(): - disables gradient tracking. Faster, uses less memory.

Forgetting any of these gives subtle bugs.

Predict on a real image

from PIL import Image
from torchvision import transforms

# The same transform you used during training
transform = transforms.Compose([
    transforms.Grayscale(),
    transforms.Resize((28, 28)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])

img = Image.open("my_digit.png")
x = transform(img).unsqueeze(0).to(device)    # add batch dim → (1, 1, 28, 28)

with torch.no_grad():
    logits = model(x)
    pred = logits.argmax(dim=1).item()

print(f"predicted digit: {pred}")

Key point: the inference preprocessing must match training. Same resize, same normalize, same color space. Mismatched preprocessing is the #1 silent-bug source in ML - the model still produces a prediction, just a bad one.

Batching for speed

If you have many inputs, predict on them in batches - much faster than one-at-a-time:

images = [transform(Image.open(p)) for p in paths]
batch = torch.stack(images).to(device)         # (N, 1, 28, 28)

with torch.no_grad():
    logits = model(batch)
    preds = logits.argmax(dim=1)

for path, p in zip(paths, preds):
    print(path, p.item())

Batches let the GPU keep busy. For latency-sensitive online inference, you might still process single inputs; for throughput-sensitive batch jobs, batch as much as memory allows.

Save more than weights

state_dict() is just the parameters. For a fully-recoverable training session, save more:

torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "loss": loss,
}, "checkpoint.pt")

# Resume:
ckpt = torch.load("checkpoint.pt")
model.load_state_dict(ckpt["model_state_dict"])
optimizer.load_state_dict(ckpt["optimizer_state_dict"])
start_epoch = ckpt["epoch"] + 1

For production deployment, save just the model weights. For "I want to resume training tomorrow," save the full checkpoint.

TorchScript and ONNX (briefly)

For shipping models, two portability formats:

  • TorchScript - torch.jit.script(model) or torch.jit.trace(model, example_input) produces a deployment-friendly version. Can run without Python.
  • ONNX - open standard. torch.onnx.export(model, ...) produces a file readable by many runtimes (ONNX Runtime, TensorRT, browsers).

Beyond beginner scope; mentioned because deployment paths sometimes need them.

For most cases, deploying a PyTorch model directly (page 12) is fine.

Inference performance: the gotchas

A few things that catch people:

  • First inference is slow. PyTorch JIT-compiles kernels on first use. Warm up with a dummy forward pass before timing.
  • .cuda()/.to(device) is async. GPU operations are queued. To time them, call torch.cuda.synchronize() first.
  • with torch.no_grad(): matters even for small inferences. Saves memory; can be 2x faster.
  • torch.set_num_threads(1) for CPU inference can speed up small models by avoiding thread overhead.

A complete example

"""
Load a trained MNIST MLP and predict on a single image.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from torchvision import transforms


class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main(image_path: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = MLP().to(device)
    model.load_state_dict(torch.load("mnist_mlp.pt", map_location=device))
    model.eval()

    transform = transforms.Compose([
        transforms.Grayscale(),
        transforms.Resize((28, 28)),
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])

    img = Image.open(image_path)
    x = transform(img).unsqueeze(0).to(device)

    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1)
        pred = logits.argmax(dim=1).item()
        confidence = probs.max().item()

    print(f"predicted: {pred} (confidence: {confidence:.4f})")


if __name__ == "__main__":
    import sys
    main(sys.argv[1])

Run: python infer.py my_digit.png.

Exercise

  1. Train a model from page 05. Save its weights as mnist_mlp.pt.

  2. Write infer.py above. Load the weights. Get a digit image (download one from Google or draw one in Paint, save as PNG). Run prediction.

  3. Measure speed:

    import time
    for _ in range(3):                # warm up
        with torch.no_grad(): model(x)
    torch.cuda.synchronize() if device == "cuda" else None
    t0 = time.time()
    for _ in range(1000):
        with torch.no_grad(): _ = model(x)
    torch.cuda.synchronize() if device == "cuda" else None
    print(f"{(time.time() - t0) * 1000 / 1000:.3f} ms / inference")
    

  4. Stretch: load multiple images at once into one batch. Time forward on the batch vs looping single-images. The batch is much faster per-image - that's why batching matters.

What you might wonder

"Why map_location=device in torch.load?" Loads tensors directly to the target device. Without it, PyTorch tries to load to the device they were saved on, which fails if you trained on GPU but are inferring on CPU.

"What's torch.compile?" PyTorch 2's JIT compiler. model = torch.compile(model) can give 1.5x-3x speedup. Sometimes flaky; experiment carefully. Mentioned for awareness.

"Should I use .half() or .bfloat16() for inference?" For modern GPUs (Volta and newer), yes - half-precision inference is ~2x faster with negligible quality drop for most models. model = model.half() then x = x.half(). Test accuracy afterward; some models tolerate half-precision better than others.

"What about quantization (INT8, INT4)?" Even more aggressive than half-precision. Used heavily for LLM inference (page 12). Beyond beginner scope on this page.

Done

  • Load weights into a model architecture.
  • Use model.eval() + with torch.no_grad():.
  • Preprocess inference inputs the same way as training inputs.
  • Batch inference for throughput.
  • Save/load full training checkpoints.

Next: Transformers and tokenization →

07 - Transformers and Tokenization

What this session is

About an hour. What an LLM actually does - at the level needed to use, fine-tune, and serve them. The architecture, the tokenization step that confuses everyone, the autoregressive generation loop.

This page is dense. Plan to re-read.

The big picture

A language model: 1. Takes a sequence of tokens (integer IDs representing chunks of text). 2. Predicts a probability distribution over the next token. 3. You sample one. Append it. Repeat.

That's it. The clever part - what makes LLMs work - is the architecture that produces the next-token prediction. That architecture is the transformer.

Tokenization

Text → token IDs. The model never sees characters or words; it sees integers.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"
ids = tok.encode(text)
print(ids)
# [15496, 11, 995, 0]
print([tok.decode([i]) for i in ids])
# ['Hello', ',', ' world', '!']

Each token is roughly a "subword piece." Common words → single token. Rare words → multiple. The tokenizer learned its vocabulary during the model's training; you can't change it.

Why subwords: vocabulary size matters. Word-level vocabulary needs hundreds of thousands of entries (with new ones constantly appearing). Character-level produces very long sequences. Subword (BPE, WordPiece, SentencePiece) is the compromise: 32k-256k tokens covering nearly any text.

Practical implications: - Token counts are not word counts. "I am happy" = 3 tokens, "antidisestablishmentarianism" = many. - Prices are per-token; latency is per-token. - Different models have different tokenizers; same text → different token counts.

What a transformer does

Inside the model, each token ID becomes an embedding - a vector of ~768 to ~12000 dimensions (depending on the model).

A transformer layer processes a sequence of these vectors and produces an updated sequence of the same shape. The crucial operation is attention - every output position is a weighted sum of all input positions (and itself), where the weights are computed from the inputs themselves.

This lets the model "look at" other parts of the sequence when generating each output. Long-range dependencies (across thousands of tokens) become tractable.

Stack many such layers (~12-100), add positional encodings so the model knows token order, end with a linear projection back to the vocabulary, apply softmax - you have the next-token distribution.

The full math is in the AI Systems senior path, Deep Dive 07. For now, the operational view: a transformer transforms a sequence of token embeddings into a probability distribution over the next token.

Causal (autoregressive) vs masked

Two families:

  • Causal / autoregressive models (GPT-family, Llama, Mistral, Gemma) - each position attends only to positions before it. Generates left-to-right. Used for language generation.

  • Masked models (BERT-family) - every position attends to every other. Used for understanding (classification, NER, embeddings).

If you're working with chatbots, code completion, RAG - causal. Embedding for retrieval - masked. Many modern open-source LLMs are causal (the GPT-style architecture won).

The generation loop

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

prompt = "The quick brown fox"
input_ids = tok.encode(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=30,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
    )

print(tok.decode(output[0]))

What's happening: 1. Tokenize the prompt into IDs. 2. Call model.generate(...) - a wrapped helper around the basic predict-and-append loop. 3. The model generates 30 more tokens, sampling from each next-token distribution. 4. Decode the resulting IDs back to text.

Sampling parameters: - temperature - how peaked the distribution is. 0 = pick the most likely (greedy). Higher = more diverse, less predictable. 0.7-1.0 typical. - top_p (nucleus) - only consider tokens whose cumulative probability is up to p. Avoids low-probability "weird" tokens. - top_k - only consider the k most-likely. Cruder but works. - max_new_tokens - when to stop. - stop - explicit stop strings.

Different sampling strategies → different output styles. Greedy is deterministic but often repetitive. Top-p sampling is the modern default.

What "small" and "big" mean

Some calibration:

  • GPT-2 small: 124M params. Fits in 500MB. Runs on a laptop.
  • Llama 3 8B: 8 billion. 16GB at FP16. Single high-end GPU.
  • Llama 3 70B: 70 billion. 140GB at FP16. Multiple GPUs (or quantized down to 4-bit for a single ~40GB GPU).
  • GPT-4-class frontier models: ~hundreds of billions to trillions (rumors; not public). Many GPUs.

The pattern: 10x bigger → noticeably smarter on hard tasks. Quality scales with parameters + training data + compute (the "scaling laws").

For learning, GPT-2 / Llama 3 8B (or smaller) suffice.

Context length

The maximum sequence length the model can attend over. GPT-2 was 1024. Llama 3 is 8192-128000+. Frontier models claim 200k+.

Two implications: - Memory cost is quadratic in context length (attention is O(n²)). 100k context is 10000x more attention compute than 1k. - Practical context isn't the same as advertised context. A model trained on long contexts doesn't necessarily use the middle well - research papers ("Lost in the Middle") show degradation. RAG (page 10) mitigates this.

Embeddings (preview)

A model's tokenization step also produces token embeddings - vector representations of pieces of text. These are useful beyond generation: semantic search, classification, clustering.

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["a cat sat on a mat", "the dog ran"]
embeddings = m.encode(texts)            # shape (2, 384) for this model

Cosine similarity between two embeddings ≈ semantic similarity. Page 10 builds RAG on this.

Exercise

  1. Run the GPT-2 generation example above. Try different prompts. Vary temperature from 0.1 to 1.5. Note how it changes.

  2. Inspect tokenization:

    for text in ["hello", "antidisestablishmentarianism", "I'm fine.", "🦀"]:
        ids = tok.encode(text)
        print(f"{text!r} -> {ids} -> {[tok.decode([i]) for i in ids]}")
    
    Notice how rare words and emoji become multiple tokens.

  3. Greedy decoding:

    output = model.generate(input_ids, max_new_tokens=30, do_sample=False)
    
    Run twice with the same prompt. Output is identical (deterministic). Then with do_sample=True, different each time.

  4. (Stretch - GPU helpful): try a small open-source LLM. Hugging Face Hub: search gpt2-medium, microsoft/phi-2, meta-llama/Llama-3.2-1B. (Llama gates require accepting a license on HF.) Load and generate. Note the quality difference.

What you might wonder

"Why is the same word sometimes one token and sometimes two?" Subword tokenizers split based on frequency in their training data. " happy" (with leading space) and "happy" (without) are distinct tokens. Case matters too. Don't fight it; understand it.

"What's a 'chat model' vs a 'base model'?" Base models are trained on raw text. Chat models are fine-tuned (page 09) with conversation data + safety training. For "ask a question, get an answer" use chat models. For raw text completion or further fine-tuning, base models.

"What's the actual difference between GPT-2 and modern LLMs architecturally?" Mostly: more parameters, more training data, more compute. Architectural tweaks (rotary positional encoding, grouped-query attention, SwiGLU activations, RMSNorm) are real but modest. The scaling matters more than the architecture changes.

"Should I implement attention from scratch?" Once, for understanding. Andrej Karpathy's "Let's build GPT" video walks you through it. Then use library implementations for production.

Done

  • Understand the token → embedding → transformer → next-token-prob pipeline.
  • Use a tokenizer; understand subword units.
  • Generate text with sampling parameters.
  • Know causal vs masked models.
  • Have a calibration of model sizes.

Next: Hugging Face Transformers →

08 - Hugging Face Transformers

What this session is

About 30 minutes. Hugging Face is the GitHub of AI models. The transformers library makes using thousands of pre-trained models a 3-line operation.

The library

pip install transformers

(You did this in page 01.) The library provides three main classes you'll use:

  • AutoTokenizer - load any model's tokenizer.
  • AutoModel / AutoModelForCausalLM / AutoModelForSequenceClassification / etc. - load a model. The AutoModelFor... variants add task-specific heads.
  • pipeline - a high-level helper that combines tokenization + model + post-processing into one call.

The simplest possible usage: pipeline

from transformers import pipeline

# Text classification
clf = pipeline("sentiment-analysis")
print(clf("I love this!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]
print(clf("This is terrible."))
# [{'label': 'NEGATIVE', 'score': 0.9991}]

# Text generation
gen = pipeline("text-generation", model="gpt2")
print(gen("The future of AI is", max_new_tokens=20))

# Translation
trans = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
print(trans("Hello, how are you?"))

# Question answering
qa = pipeline("question-answering")
print(qa(question="Where do I live?", context="My name is Alice and I live in Lagos."))
# {'answer': 'Lagos', ...}

Each pipeline picks a default model, downloads it (first time), runs it. Useful for prototyping.

Browsing the Hub

huggingface.co hosts hundreds of thousands of models. Filter by task, language, license. Common model names you'll see:

  • gpt2, gpt2-medium - small classical LLMs. Good for learning.
  • microsoft/phi-3-mini-4k-instruct - small + capable + permissive license.
  • meta-llama/Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B - Meta's open weights (gated; accept license).
  • mistralai/Mistral-7B-v0.3 - open-source Mistral.
  • google/gemma-2-2b - small Gemma.
  • sentence-transformers/all-MiniLM-L6-v2 - tiny embedding model. Page 10.
  • distilbert-base-uncased - small BERT-family for classification, embedding.

Each model page on the Hub has a README with usage, license, evaluation, intended use.

Direct usage (not via pipeline)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "microsoft/phi-3-mini-4k-instruct"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Write a haiku about garbage collection:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
print(tok.decode(output[0], skip_special_tokens=True))

Key arguments: - torch_dtype=torch.bfloat16 - load weights in bfloat16 instead of float32. Halves memory; minimal quality loss. - device_map="auto" - automatically distribute layers across available devices (GPU + CPU fallback). - return_tensors="pt" - tokenizer returns PyTorch tensors. - skip_special_tokens=True - strip <eos>, <bos>, etc. from output.

Chat templates

Modern chat-tuned models expect a specific message format. The tokenizer knows it:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of Nigeria?"},
]
inputs = tok.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=50)
response_only = tok.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response_only)

apply_chat_template formats the messages with the model's expected special tokens (<|user|>, <|assistant|>, etc.). add_generation_prompt=True adds the assistant's turn-start so the model knows it's its turn to speak.

For chat-tuned models, always use the chat template. Raw prompt-completion produces worse results.

Embedding models

For semantic search (used in page 10):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["A dog is running.", "A cat is sleeping.", "I bought milk."]
embeddings = model.encode(texts)
print(embeddings.shape)            # (3, 384) - three 384-dim vectors

# Compute similarities
import numpy as np
sim = embeddings @ embeddings.T
print(sim)        # diagonal is 1.0 (each vector with itself);
                  # off-diagonal close to 0 for unrelated texts

sentence-transformers wraps Hugging Face models and handles the "pool tokens into a sentence vector" step.

Caching

By default, models download to ~/.cache/huggingface/. Big models (gigabytes) live here. To change:

export HF_HOME=/path/to/your/cache

To pre-download a model without using it (useful in Docker):

from huggingface_hub import snapshot_download
snapshot_download(repo_id="microsoft/phi-3-mini-4k-instruct")

Quantized models

LLMs are huge. Loading FP16 needs gigabytes; FP32 needs 2x. Quantization reduces precision further:

  • GPTQ / AWQ - 4-bit quantization, requires specific quantized weights.
  • bitsandbytes - runtime 8-bit / 4-bit quantization for any model:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", quantization_config=bnb)

8B parameters at 4-bit ≈ 4GB. Fits on consumer GPUs.

Quality drops ~1-5% on benchmarks; for many tasks, indistinguishable. Used heavily in production inference (page 12).

Exercise

  1. Run the simplest pipeline:

    from transformers import pipeline
    clf = pipeline("sentiment-analysis")
    print(clf("Containers from scratch is a good path"))
    

  2. Generate text with a small model:

    gen = pipeline("text-generation", model="gpt2")
    print(gen("Once upon a time", max_new_tokens=40)[0]["generated_text"])
    

  3. Direct model usage with the chat-template form above. Use any chat-tuned model that fits your hardware. Try several prompts.

  4. Embeddings: with sentence-transformers, encode 5 sentences (some related, some not). Compute the similarity matrix. Notice which pairs score high.

  5. (Stretch - GPU helpful) Load Llama-3.2-1B (accept license on HF first; small enough for most setups). Compare its outputs to gpt2's.

What you might wonder

"How big a model can I run?" Rule of thumb (FP16): need ~2 bytes per parameter for inference. 1B params = 2GB. 7B = 14GB. 70B = 140GB. With 4-bit quantization, ~0.5 bytes per param. 70B at 4-bit ≈ 40GB.

"Why does the model download so slowly?" HF servers throttle anonymous traffic. Authenticate (huggingface-cli login) for higher limits, especially for gated models.

"What's device_map="auto" actually doing?" Hugging Face's accelerate library partitions the model across available devices (GPU layers; CPU offload for excess). For small models, the whole thing goes on GPU. For huge models, layers spill to CPU (much slower but possible).

"Should I use safetensors or pytorch_model.bin?" Safetensors. Faster loading, safer (no arbitrary code execution risk). All modern HF models ship both.

Done

  • Use pipeline for the quickest possible model usage.
  • Use AutoTokenizer + AutoModelForCausalLM for direct control.
  • Apply chat templates for chat-tuned models.
  • Use sentence-transformers for embeddings.
  • Load quantized models for memory efficiency.

Next: Fine-tuning →

09 - Fine-Tuning

What this session is

About 90 minutes. Adapt a pretrained model to your data. Modern parameter-efficient fine-tuning (LoRA) - feasible on consumer GPUs. By the end you'll have fine-tuned a small LLM on a custom dataset.

This page benefits from a GPU. CPU works but is very slow.

The two modes

  • Full fine-tuning - update all model weights. Best quality; needs massive memory (7B model in FP16 needs ~28GB just for the optimizer states). Beyond most beginners' budgets.
  • Parameter-efficient fine-tuning (PEFT) - update only a tiny subset of new parameters. LoRA is the most-used. Same effective quality for ~1% of the parameters' worth of training. Runs on a single consumer GPU.

We'll do LoRA.

What LoRA actually does

Each big linear layer in a transformer (the nn.Linear from page 04, scaled up) is a matrix W. Instead of updating W directly, LoRA learns a low-rank update:

W_new = W + (A @ B)

Where A is (d, r) and B is (r, d). The rank r is small (typically 8-64). The original W stays frozen; only A and B train.

Memory savings: instead of d × d parameters per layer (millions), you train d × r + r × d (tens of thousands). For a 7B model with rank-16 LoRA: ~10M trainable parameters instead of 7B.

You don't implement this - peft library handles it.

Setup

pip install transformers datasets accelerate peft trl bitsandbytes
  • peft - parameter-efficient fine-tuning (LoRA, QLoRA, prefix tuning, etc.).
  • trl - Transformers Reinforcement Learning. Includes SFTTrainer, the easiest fine-tuning loop wrapper.
  • bitsandbytes - 4-bit quantization, used by QLoRA.

A complete LoRA fine-tuning

We'll fine-tune a small model on a tiny dataset to make it answer in a specific style.

import torch
from datasets import Dataset
from transformers import (
    AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

model_name = "microsoft/Phi-3-mini-4k-instruct"
# Quantized to 4-bit
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# LoRA config
lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 6.3M || all params: 3.8B || trainable: 0.16%

# A tiny training dataset (in production: load real data with `datasets`)
examples = [
    {"text": "<|user|>\nWhat's 2+2?<|end|>\n<|assistant|>\nIt's 4, mate.<|end|>"},
    {"text": "<|user|>\nHello!<|end|>\n<|assistant|>\nG'day!<|end|>"},
    {"text": "<|user|>\nWhat's your favorite color?<|end|>\n<|assistant|>\nProbably blue, mate.<|end|>"},
    # ... in a real run, hundreds to thousands of examples ...
] * 50

train_ds = Dataset.from_list(examples)

# Training config
cfg = SFTConfig(
    output_dir="./lora-out",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=512,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tok,
    train_dataset=train_ds,
    args=cfg,
    dataset_text_field="text",
)

trainer.train()
trainer.save_model("./lora-out/final")

The whole thing - model load, LoRA setup, training loop - fits in ~50 lines. SFTTrainer from trl wraps Hugging Face's Trainer with sensible defaults.

Run time: ~5-15 minutes on a free Colab T4 GPU for the small dataset above.

Use the fine-tuned model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = PeftModel.from_pretrained(base, "./lora-out/final")
model.eval()

inputs = tok("<|user|>\nHi there!<|end|>\n<|assistant|>\n", return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
# Hopefully responds in the trained style ("G'day mate!")

The fine-tuned model = base model + LoRA adapter. The adapter is small (~30MB for our config); the base is shared.

Merge LoRA into the base (for deployment)

For inference at scale, you may want a single merged model:

merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
tok.save_pretrained("./merged-model")

Result: a standalone model with LoRA's updates baked in. Drops the adapter layer overhead at inference time.

Hyperparameter notes

  • r (LoRA rank) - 8, 16, 32, 64. Higher = more capacity, more memory. Start at 16.
  • lora_alpha - usually 2× r. Acts as a scaling factor.
  • target_modules - which linear layers to LoRA-fy. Common: ["q_proj", "v_proj"] for cheap, ["q_proj", "k_proj", "v_proj", "o_proj"] for fuller coverage. Model-specific naming.
  • learning_rate - much higher than full fine-tuning (because you have fewer params). 1e-4 to 5e-4 typical.
  • per_device_train_batch_size + gradient_accumulation_steps - effective batch is the product. Small batch fits memory; accumulation simulates a bigger batch.

These need experimentation. Start with the defaults above; adjust.

QLoRA - even smaller memory

The BitsAndBytesConfig(load_in_4bit=True, ...) we used is QLoRA - quantize the base model to 4-bit, train LoRA adapters on top in higher precision. Lets you fine-tune 7B models on a 12GB GPU. The standard approach for hobbyist fine-tuning.

What you can / can't fine-tune

LoRA fine-tuning is great for: - Style adaptation - "respond in our brand's voice." - Domain-specific Q&A - train on your support docs. - Output format - JSON conformance, structured outputs. - Tool / function calling - train the model to emit specific function calls.

LoRA is bad for: - Teaching the model NEW factual knowledge. That requires more data + full fine-tuning, and the model often half-learns and hallucinates the rest. For facts, use RAG (page 10). - Reasoning skill upgrades. Generally requires lots of data + more compute than LoRA gives.

Dataset format

Most fine-tuning recipes want a list of conversation strings in the model's chat format. Building one:

  1. Collect 50-1000+ example interactions in the desired style.
  2. Format each as a single string using the model's chat template.
  3. Wrap in a datasets.Dataset.

Real datasets often live on Hugging Face Hub - datasets.load_dataset("squad") etc. Filter / format as needed.

Exercise

You need a GPU (or Colab) for this exercise. CPU works but takes hours.

  1. Run the example above. Train for 1 epoch on the toy dataset. Confirm training loss decreases.

  2. Use the trained model. Run a few prompts. Notice the style.

  3. Increase the dataset size. Add 10 more diverse examples. Re-train. Compare outputs.

  4. Tweak r: try r=8 vs r=64. Quality difference? Memory difference?

  5. (Stretch) Use a real dataset from Hugging Face: datasets.load_dataset("squad", split="train[:1000]"). Format the QA pairs into the chat template. Fine-tune. Evaluate by hand.

What you might wonder

"Why is my fine-tuned model worse than the base?" Common causes: dataset too small (under ~100 examples), learning rate too high (model overfits and forgets general knowledge), bad data formatting (model is learning your formatting bugs not your style). Start with a known-good recipe and iterate.

"What's 'catastrophic forgetting'?" The fine-tuned model loses knowledge from its base training. Severe with full fine-tuning; minimal with LoRA (the base weights are frozen). One reason LoRA is the default.

"How do I evaluate the fine-tuned model?" Page 11. Critical and the hardest part of ML.

"DPO? RLHF? PPO? GRPO?" Reinforcement-learning-from-feedback techniques used by frontier labs to align chat models. Beyond beginner; mentioned for awareness.

Done

  • Distinguish full fine-tuning from LoRA / PEFT.
  • Set up peft + trl for a real LoRA training run.
  • Train and save a fine-tuned model.
  • Load and use the trained adapter.
  • Pick reasonable hyperparameters.

Next: Retrieval-Augmented Generation →

10 - Retrieval-Augmented Generation (RAG)

What this session is

About an hour. RAG is the dominant production pattern for LLMs answering questions over your data. Instead of fine-tuning facts in, you retrieve relevant passages at query time and pass them to the model.

Why RAG, not fine-tuning, for facts

Fine-tuning teaches a model patterns, styles, formats. Asking it to memorize facts works poorly: knowledge degrades, the model hallucinates "knowing" things, no clean way to update when facts change.

RAG separates concerns: the LLM is the language interface; the knowledge is in a database. Update facts by updating the database - no retraining.

The architecture

User question
Embed the question (vector)
Search a vector DB for similar passages → top-k passages
Build a prompt: question + retrieved passages
LLM generates answer using both

Five components: 1. Documents - your knowledge corpus (docs, PDFs, wiki). 2. Chunker - splits docs into ~200-1000 token passages. 3. Embedder - a model that turns text into vectors. 4. Vector store - stores passages + their embeddings; supports nearest-neighbor search. 5. LLM - generates the final answer.

A complete (minimal) RAG

from sentence_transformers import SentenceTransformer
import numpy as np
import torch
from transformers import pipeline

# 1. Documents - a tiny corpus
docs = [
    "Lagos is the most populous city in Nigeria.",
    "Abuja is the capital of Nigeria.",
    "The Niger River flows through Mali, Niger, and Nigeria.",
    "Python was created by Guido van Rossum in 1991.",
    "Rust was first released in 2010 by Mozilla.",
    "Go was designed at Google starting in 2007.",
]

# 2. Embed all documents (one-time index-build)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)
# shape: (6, 384)

# 3. Search function
def retrieve(query, k=2):
    q_emb = embedder.encode([query], normalize_embeddings=True)
    sims = (doc_embeddings @ q_emb.T).flatten()        # cosine sim because normalized
    topk = np.argsort(-sims)[:k]
    return [docs[i] for i in topk]

# 4. Generate with retrieved context
gen = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct",
               torch_dtype=torch.bfloat16, device_map="auto")

def answer(question):
    context = "\n".join(f"- {p}" for p in retrieve(question))
    prompt = f"""<|user|>
Use the following context to answer the question.

Context:
{context}

Question: {question}<|end|>
<|assistant|>
"""
    out = gen(prompt, max_new_tokens=100, do_sample=False)[0]["generated_text"]
    return out[len(prompt):]    # strip the prompt

print(answer("What is the capital of Nigeria?"))
print(answer("Who created Python?"))

That's a working RAG in ~30 lines. The model answers using the retrieved context, not just its baked-in knowledge.

For real production you'd swap in a proper vector DB (next section); the LLM call stays the same.

Vector databases

For 100 documents, a NumPy dot product is fine. For 1M+ documents, you need a vector database with efficient approximate nearest neighbor search.

Self-hosted: - FAISS (Facebook) - library, in-process. Fast. No persistence layer; you build that. - Chroma - embedded, easy to start. - Qdrant - server-mode, production-grade. - Weaviate - feature-rich, server-mode. - Milvus - distributed, for very large scale.

Hosted: - Pinecone - first popular hosted vector DB. - Cloud-native: AWS OpenSearch with k-NN, Postgres + pgvector, Redis with vector search.

For learning: Chroma or FAISS. For production: depends on scale and existing infra.

A Chroma example:

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")

collection.add(
    documents=docs,
    embeddings=doc_embeddings.tolist(),
    ids=[f"doc-{i}" for i in range(len(docs))],
)

results = collection.query(
    query_embeddings=embedder.encode(["What is Lagos?"]).tolist(),
    n_results=2,
)
print(results["documents"])

Chunking strategies

Long documents must be split. Naive: split every N characters. Better: - Fixed-size with overlap (e.g., 500 chars, 50-char overlap to preserve context across boundaries). - Semantic chunks (paragraphs, headings). - Recursive chunking - try paragraphs first, fall back to sentences, fall back to words.

langchain.text_splitter.RecursiveCharacterTextSplitter is the popular tool. Try a few; the best chunking is task-dependent.

Embedding choices

Bigger embedding model = better retrieval, slower to embed, larger vectors.

Model Dim Speed Quality
all-MiniLM-L6-v2 384 very fast decent
all-mpnet-base-v2 768 medium good
BAAI/bge-base-en-v1.5 768 medium excellent
BAAI/bge-large-en-v1.5 1024 slow best
text-embedding-3-small (OpenAI) 1536 API excellent
nomic-embed-text-v1.5 768 medium excellent, open source

For learning: all-MiniLM-L6-v2. For production: BGE or Nomic embed are strong open options.

Quality knobs

Things that matter, in order of impact:

  1. Chunking strategy. Bad chunks = bad retrieval. Tune first.
  2. Number of retrieved chunks (k). 3-10 typical. Too few = miss relevant info. Too many = context bloat, "lost in the middle."
  3. Re-ranking. Retrieve k=20, then re-rank with a more expensive model down to top-5. Improves quality at modest cost.
  4. Hybrid search. Combine semantic (vector) with keyword (BM25). Catches cases where exact word match matters.
  5. Query rewriting. LLM rewrites the user's question into a better search query.
  6. Embedding model. Better embeddings = better retrieval. Worth experimenting.

For a beginner, just fixed-size chunks + top-3 semantic retrieval is a strong baseline.

When RAG fails

  • User asks a question whose answer requires synthesis across many docs. RAG retrieves top-k passages; each independent. Synthesis fails.
  • Question is ambiguous. Retrieval gets the wrong passage; answer is confidently wrong.
  • The corpus genuinely doesn't contain the answer. The LLM hallucinates because the user expects an answer.

Mitigations: explicit "I don't know" in the prompt; structured outputs that include source citations; user-facing transparency about what was retrieved.

Frameworks

Real RAG apps often use: - LangChain - most popular framework. Composable chains for retrieval + generation. - LlamaIndex - alternative, more retrieval-focused. - Haystack - pipeline-oriented, German-engineered.

These wrap the patterns above with batteries included. For learning, building from scratch (like this page) makes the mechanics clear; for production, frameworks save time.

Exercise

  1. Run the minimal RAG above. Confirm the answers use the retrieved context.

  2. Expand the corpus: add 20 more facts. Try a question that's ambiguous between two retrieved docs - see how the model handles it.

  3. Different embedder: swap all-MiniLM-L6-v2 for BAAI/bge-base-en-v1.5. Larger model; do retrieval results improve for tricky questions?

  4. Chunking exercise: download a Markdown doc (your README.md or any project's). Use RecursiveCharacterTextSplitter from langchain to chunk it. Index the chunks. Ask questions.

  5. (Stretch) Try Chroma instead of in-memory NumPy. Same RAG flow with persistent index.

What you might wonder

"What if the LLM ignores the retrieved context?" Happens. Make the prompt clearer: "Answer ONLY using the context above. If the context doesn't contain the answer, say 'I don't know.'" Smaller models ignore instructions more; bigger ones follow.

"Should I do RAG or fine-tuning?" Both, often. RAG for facts; fine-tuning for style + format. Don't pit them against each other.

"What's a 'retriever' vs an 'embedder'?" An embedder produces vectors. A retriever uses the embedder + a vector DB + post-processing to return passages. Same pipeline, different name for different layers.

"How do I evaluate a RAG?" Next page. Hardest part.

Done

  • Build a RAG pipeline end-to-end with embeddings + vector search + LLM.
  • Distinguish from fine-tuning (facts vs style).
  • Recognize vector DB options.
  • Apply basic quality knobs (chunking, k, re-ranking, hybrid search).
  • Know LangChain / LlamaIndex / Haystack exist.

Next: Evaluation →

11 - Evaluation

What this session is

About an hour. The hardest part of building with AI - and the one most beginner tutorials skip. By the end you'll know why "looks good" is not evaluation, and how to do it for real.

Why this page matters

Most "AI products" you'll see are evaluated by their authors clicking around and saying "yep, looks good." That's how launched products go viral with embarrassing failures the moment a user does something unexpected.

Good evaluation is what separates a demo from a system. Most engineers - even experienced ML practitioners - get this wrong. Take this page seriously.

The fundamental rule

You cannot iterate on what you cannot measure.

Without an objective evaluation, every change is a coin flip. Did this prompt change improve things or make them worse? You can't tell. Without a number, you'll convince yourself it's better - because you wrote it.

A measurable eval lets you see real improvement, A/B test prompts, catch regressions when you change models, ship confidently.

Types of evaluation

Different problems need different evals.

Classification - easy

If your output is a class (positive/negative, A/B/C, 0-9):

  • Accuracy - fraction correct.
  • Precision - of predicted positives, what fraction are actually positive.
  • Recall - of actual positives, what fraction did you find.
  • F1 - harmonic mean of precision and recall.
  • Confusion matrix - full breakdown of predicted vs actual.

scikit-learn:

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))

Done. Easy.

Free-form text generation - hard

If your output is a paragraph of text (LLM chatbot, summarizer):

  • Exact match - useless unless you're matching against a fixed answer.
  • BLEU, ROUGE, METEOR - n-gram overlap with a reference. Useful for translation; poor for chat (paraphrase = bad score).
  • Embedding similarity - cosine similarity between generated and reference embedding. Better than n-gram.
  • LLM-as-judge - use a strong model to grade outputs. Most-used in practice, with caveats below.
  • Human eval - gold standard, expensive.

For chat / RAG / summarization, LLM-as-judge is the practical default.

Retrieval - medium

If your problem is "did I retrieve the right passages":

  • Recall@K - of all relevant passages, how many appear in your top-K results.
  • MRR (Mean Reciprocal Rank) - average of 1 / rank-of-first-relevant.
  • nDCG - discounted gain over ranking quality.

You need a labeled dataset: each query has a known correct passage. Build this manually for ~100-1000 queries.

LLM-as-judge

You have outputs from your system. You want to know "are these good?"

from openai import OpenAI            # or any LLM client

judge_prompt = """You are grading an AI assistant's answer.

Question: {question}
Expected answer: {gold}
AI answer: {generated}

Grade the AI answer on:
- Correctness (0-5): does it match the expected answer in substance?
- Completeness (0-5): does it cover the key points?
- Conciseness (0-5): is it free of fluff?

Respond ONLY with JSON: {{"correctness": N, "completeness": N, "conciseness": N, "rationale": "..."}}
"""

def grade(question, gold, generated):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt.format(
            question=question, gold=gold, generated=generated
        )}],
    )
    import json
    return json.loads(response.choices[0].message.content)

Run your system against a held-out dataset of (question, gold-answer) pairs. Have the judge grade each. Aggregate the scores.

Caveats: - Use a more capable model for judging than for generating. Don't have GPT-3.5 grade GPT-3.5; have GPT-4 or Claude do it. - Judge bias. LLM judges have biases (preferring longer answers, preferring their own family's models). Counter with care. - Calibration. Run human-judged grades on a subset; check the LLM judge agrees with humans. - Pairwise > absolute. "Which of A and B is better" judgments are more stable than absolute 1-5 scores.

Even with caveats, LLM-as-judge is far better than "looks good to me."

Build an evaluation dataset

For LLM apps, this is the work you'll spend the most time on. Patterns:

  1. Production traces. Sample real user queries from your service logs. Manually label expected answers. ~100-1000 examples.
  2. Adversarial cases. Specifically construct queries that should fail or should succeed. Boundary cases, ambiguous queries, out-of-scope queries.
  3. Public benchmarks. MMLU (multitask), TruthfulQA, HumanEval (coding), GSM8K (math). Useful for "how does my model compare," less useful for "is my prompt better."

A good eval dataset is representative + adversarial + maintained. Production examples + manually-curated edge cases. Refresh as your product evolves.

A real workflow

Pattern that works:

  1. Build a small golden dataset. ~50-200 examples to start.
  2. Run your current system. Score with LLM-as-judge. Get a baseline number.
  3. Make a change (new prompt, new model, new retrieval).
  4. Re-run. Compare. If the number went up materially, ship; if it went down, revert; if it's noise, you didn't change enough.
  5. Expand the eval set as you discover failure modes in production.

This loop is the whole job. Every successful AI product team runs some version of it. Every failed one didn't.

Cost and latency are evaluation criteria

A model that's 5% better but 10× slower might be worse for your product. Track both quality and ops costs:

  • Per-request cost (tokens × model price).
  • p50, p95 latency.
  • Throughput (requests per second).

A useful "is the next model worth it" question: "for every 1% quality improvement, how much do cost/latency change?"

Bias, fairness, safety

Big and important; out of scope for a beginner page. The minimum:

  • Test on diverse inputs. Different demographics, languages, dialects, edge cases.
  • Test refusal behavior. Does it refuse harmful requests? Does it over-refuse benign ones?
  • Test on adversarial prompts (prompt injection).

Production teams have dedicated red-teamers. For your first project, manual sampling is fine.

Specific tools

  • evaluate (Hugging Face) - eval-metric library. Bundles many standard metrics.
  • langsmith (LangChain Labs) - tracing + evaluation platform.
  • promptfoo - open-source eval CLI for LLM prompts.
  • ragas - RAG-specific evaluation metrics (faithfulness, context relevancy).
  • lm-eval-harness (EleutherAI) - runs many academic benchmarks.

For learning: roll your own (the snippet above). For scale: pick one tool.

Exercise

  1. Build a tiny eval dataset. Use the RAG from page 10. Write 10 (question, expected-answer) pairs covering your corpus.

  2. Run the RAG. Score each output manually with a 1-5 score on correctness. Average the scores; that's your baseline.

  3. Make a change. Change the prompt, or k=2 → k=4, or use a bigger embedder. Re-score. Did it improve?

  4. Add LLM-as-judge. Have the model itself score the outputs against expected answers. Compare to your manual scores. How well do they agree?

  5. (Stretch) Use the ragas library on your RAG. Run its faithfulness + answer_relevancy metrics.

What you might wonder

"How big does my eval set need to be?" 50 is the minimum for noisy signal. 500+ is comfortable. 5000+ for academic-paper-strength results. For getting started, start at 50 and expand.

"Can I trust LLM-as-judge?" Mostly. Pair with human spot-checks (10% of examples reviewed by you). When LLM-as-judge says scores went up but you can't see the improvement, something's miscalibrated.

"What about RLHF / DPO / online evaluation?" Real production AI products have ongoing eval pipelines collecting user feedback, A/B testing changes, fine-tuning on preference data. Beyond beginner; mentioned for awareness.

"How does this compare to evaluating a 'normal' classifier?" Classifiers have ground truth + simple metrics. LLM outputs have ambiguity at every step - there's no single correct answer to "summarize this article." Evaluation gets correspondingly fuzzy. The discipline is the same; the metrics are softer.

Done

  • Recognize different eval types (classification, generation, retrieval).
  • Build an evaluation dataset.
  • Use LLM-as-judge correctly (with calibration awareness).
  • Run the build → measure → change → measure loop.
  • Track cost + latency alongside quality.

Next: Serving models →

12 - Serving Models

What this session is

About 45 minutes. How to expose your model as a service users can call. Local options (Ollama, llama.cpp), high-performance serving (vLLM), and rolling your own HTTP API.

The simplest possible serve: Ollama

Ollama is the easiest way to run an LLM locally.

# Install (macOS):
brew install ollama
ollama serve &

# Pull and run a model:
ollama pull llama3.2:3b
ollama run llama3.2:3b

You get an interactive chat. To use programmatically:

import httpx
r = httpx.post("http://localhost:11434/api/chat", json={
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": False,
})
print(r.json()["message"]["content"])

Ollama handles the model loading, quantization, GPU detection. For local dev, it's the easiest start.

llama.cpp - for tighter control

llama.cpp is a C++ inference engine that runs GGUF-quantized models on CPU + GPU. Lower-level than Ollama; faster; more configurable.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download a quantized model
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run
./llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -p "Hello!" -n 100

Or serve as HTTP:

./llama-server -m model.gguf -c 4096
# OpenAI-compatible API on http://localhost:8080

Many projects use llama.cpp under the hood (Ollama, LM Studio, etc.).

vLLM - high-throughput serving

For production-grade serving of large models with high concurrency: vLLM.

pip install vllm

Run a server:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 1

OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
r = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(r.choices[0].message.content)

vLLM's killer features: - PagedAttention - KV-cache management like virtual memory. Way better GPU utilization than naive serving. - Continuous batching - interleaves requests at the token level. Many concurrent users; high throughput. - OpenAI-compatible API - drop-in replacement for OpenAI client libraries.

Used by many production LLM deployments. Needs a GPU; doesn't run on CPU usefully.

A simple HTTP wrapper around your own model

If you want full control:

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()


class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 100
    temperature: float = 0.7


@app.post("/generate")
def generate(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=req.max_new_tokens,
            temperature=req.temperature,
            do_sample=True,
        )
    text = tok.decode(out[0], skip_special_tokens=True)
    return {"text": text[len(req.prompt):]}

Run: uvicorn server:app --host 0.0.0.0 --port 8080. Test:

curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "max_new_tokens": 30}'

For a small model and a few users, this works fine. For high concurrency, use vLLM.

Streaming responses

Users want to see tokens as they generate, not wait for the whole response. Implementation:

from fastapi.responses import StreamingResponse
from threading import Thread
from transformers import TextIteratorStreamer

@app.post("/stream")
def stream(req: GenerateRequest):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tok, skip_prompt=True, skip_special_tokens=True)
    Thread(target=model.generate, kwargs={
        **inputs, "max_new_tokens": req.max_new_tokens,
        "streamer": streamer, "do_sample": True, "temperature": req.temperature,
    }).start()

    def gen():
        for token in streamer:
            yield token
    return StreamingResponse(gen(), media_type="text/plain")

For production: use vLLM (streaming is built-in) rather than rolling your own threading.

Containerize for deployment

A Dockerfile for the FastAPI server:

FROM python:3.12-slim
WORKDIR /app

# CUDA wheel for PyTorch (adjust to your CUDA version)
RUN pip install --no-cache-dir torch fastapi uvicorn pydantic transformers

COPY server.py .
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

Build, push, run on Kubernetes (Containers + Kubernetes paths).

Deployment concerns

  • GPU scheduling. Kubernetes can schedule pods to GPU nodes (nvidia.com/gpu: 1 in resources). NVIDIA's GPU Operator manages drivers.
  • Cold start. Loading a 7B model takes 10-30 seconds. Don't scale to zero unless cold start is acceptable.
  • Model caching. Embed the model weights in the container image (huge), or mount as a PV (faster restarts).
  • Autoscaling. GPU pods are expensive. Scale based on request queue depth or GPU utilization, not CPU.
  • Observability. Latency per request, tokens/sec, GPU memory, queue depth.

Cost / latency calibration

Rough numbers for a single A100 GPU serving Llama-3.1-8B (FP16): - ~30-80 tokens/sec generation rate. - 5-15 GB GPU memory. - A few concurrent users; vLLM bumps this to dozens.

Quantized (4-bit) on a single consumer 24GB GPU: - ~40-100 tokens/sec. - Single users comfortable; concurrency lower.

For frontier models (70B+), you need multi-GPU or sharded serving. Beyond beginner.

OpenAI compatibility is a contract

Many tools (LangChain, llama-index, CLI tools) speak the OpenAI HTTP API. vLLM, llama.cpp's server, Ollama (with its /v1/... endpoints) all implement it. Building against the OpenAI API makes you portable across self-hosted and hosted backends.

from openai import OpenAI
# Same code works for openai.com, vLLM, Ollama, llama.cpp:
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

Exercise

  1. Install Ollama. Pull a small model. Chat.
  2. Call it from Python via its HTTP API.
  3. Build a tiny FastAPI server that wraps a small HF model (page 04's MLP works, or a Phi-3-mini for fun). Curl it.
  4. (Stretch - GPU helpful) Install vLLM. Serve microsoft/Phi-3-mini-4k-instruct. Use the OpenAI client to call it.
  5. (Stretch) Containerize your FastAPI server. Build the image. Run via docker run.

What you might wonder

"Should I serve via my own framework or vLLM?" For real production: vLLM (or TGI, Triton). The hand-rolled FastAPI version works fine for a hobby project but doesn't handle concurrency well.

"How do I keep the model warm?" Don't scale to zero. Have at least one replica always running. Health checks must respond fast (≤1s) without invoking the model.

"GPU memory keeps growing - what?" KV-cache (page 07) accumulates as context grows. Limit max_total_tokens in vLLM or chunk old context out.

"Open source vs API providers?" Both have a place. OpenAI/Anthropic/Google APIs are easy and powerful; you pay per token. Self-hosting is cheaper at scale but adds ops complexity. Most production teams use both - APIs for high-quality requests, self-hosted for high-volume cheaper requests.

Done

  • Run an LLM locally with Ollama or llama.cpp.
  • Serve high-throughput with vLLM.
  • Build a custom FastAPI wrapper.
  • Stream responses.
  • Containerize for deployment.

Next: Picking a project →

13 - Picking a Project

What this session is

About 30 minutes plus browsing. AI OSS that accepts first contributions, with specific candidates.

What kinds of AI projects fit beginners

Your toolkit so far: PyTorch, Hugging Face, RAG, evaluation, serving. Good targets:

  • Inference engines & serving (vLLM, llama.cpp, Ollama) - high-quality issue tickets across many difficulty levels.
  • Tokenization / data tooling (HuggingFace tokenizers, datasets).
  • Embedding / RAG libraries (sentence-transformers, llama-index, langchain).
  • Evaluation tools (lm-eval-harness, ragas, promptfoo).
  • Adjacent ML tools (numpy, scipy, scikit-learn).
  • Documentation - every major project has doc-improvement work.

For more research-y work (PyTorch core, training algorithms, model architectures), you'll need deeper expertise. Build on this path first.

10-minute evaluation

Same standard as other beginner paths:

Signal Target
Stars 100-50000
Last commit Within a month
Open PRs Some, not 300+
Recent PR merge time Under 14 days
good first issue count ≥5
CONTRIBUTING.md yes, readable
Tests pass on fresh clone yes

Candidates

Tier 1 - friendly, smaller scope

  • huggingface/transformers - yes the big one, BUT they have excellent issue triage. Many docs/examples PRs. Look at good first issue.
  • langchain-ai/langchain - chained LLM workflows. Large but very welcoming; tons of easy-mode integrations to add.
  • run-llama/llama_index - RAG-focused alternative to LangChain.
  • promptfoo/promptfoo - eval tool. Small enough to be approachable; very active.
  • huggingface/tokenizers - tokenizer library. Rust core + Python bindings.

Tier 2 - well-organized

  • vllm-project/vllm - production inference serving. Issues exist at all levels.
  • huggingface/peft - LoRA + friends. Smaller surface; active.
  • huggingface/datasets - data loading. Adding a new dataset adapter is a common first contribution.
  • sentence-transformers/sentence-transformers - embeddings library.
  • unslothai/unsloth - fast fine-tuning. Welcoming.

Tier 3 - bigger, more visible

After 1-2 PRs.

  • pytorch/pytorch - yes eventually. Excellent labels; SIG structure; well-shepherded contributors.
  • huggingface/transformers larger contributions (new model architectures, etc.).
  • triton-lang/triton - GPU programming DSL. Needs Triton + CUDA understanding.

Tier 4 - don't start here

  • Foundation model labs (OpenAI, Anthropic, DeepMind) - closed source.
  • PyTorch internals (autograd, distributed) - deep specialty.

Finding issues

Project's Issues tab. Filter by good first issue / documentation / help wanted.

Many AI projects label specific kinds of work: - enhancement: docs - models: add - integration: add - bug: confirmed

Read 5-10 issues; find one with clear repro and contained fix. Comment to claim; wait for maintainer.

What counts

For AI OSS work:

  • Fix a typo in a model's documentation.
  • Add a missing example in a tutorial notebook.
  • Fix a quantization bug for a specific GPU.
  • Add a new embedding model adapter.
  • Improve an evaluation metric's implementation.
  • Add a missing integration to a chain framework.
  • Add support for a new fine-tuning recipe.
  • Translate documentation.

All real. All count.

Specific recommendation: huggingface/transformers docs

For an easy first PR: open transformers/docs/source/en/ in a clone, find a doc page that's missing an example or has a confusing sentence. Submit a fix. The HF team is responsive and welcoming; you'll hear back within days. PR merges fast.

Exercise

  1. Browse three projects from Tier 1-2.
  2. Run the 10-minute eval on each.
  3. Pick the most responsive.
  4. Read CONTRIBUTING.md.
  5. Clone, install, run their tests:
    git clone https://github.com/<owner>/<repo>
    cd <repo>
    pip install -e .[dev]              # or pip install -r dev-requirements.txt
    pytest               # or whatever they use
    
  6. Browse open issues. Pick two candidates. Don't claim yet.

What you might wonder

"I want to contribute to PyTorch core. Can I?" Yes, eventually. Start with their docs/ work or with the smaller modules first. The bar is real but lower than the kernel project.

"I'm scared of touching ML papers' reference implementations." Don't be. Start with documentation. Reference implementations of papers (DeepMind's work, etc.) often have terse README, scattered hyperparameters, missing examples. Doc PRs are welcomed.

"What about Anthropic / OpenAI / Google work?" Their research models are mostly closed-source. Their tools (Anthropic's claudette library, OpenAI's cookbook) are public and accept PRs.

Done

  • Recognize AI-OSS contribution shapes.
  • Run a 10-minute evaluation.
  • Have specific projects in mind.

Next: Anatomy of an AI OSS project →

14 - Anatomy of an AI OSS Project

What this session is

Read a real AI OSS repo top to bottom so the next one feels familiar.

Case study: huggingface/peft

PEFT (Parameter-Efficient Fine-Tuning) implements LoRA, QLoRA, IA3, prefix-tuning, etc. Small enough to read in a sitting; well-maintained.

git clone https://github.com/huggingface/peft
cd peft
ls

Typical top level:

README.md
CONTRIBUTING.md
LICENSE
setup.py / pyproject.toml
src/peft/                # library code
tests/
examples/
docs/
.github/workflows/

What to read, in order

1. README.md (5 min)

What the project is. Quickstart example. Supported methods.

2. CONTRIBUTING.md (5 min)

How to set up the dev environment. Code style. Tests. PR rules.

3. setup.py / pyproject.toml (2 min)

Dependencies. Optional extras. Python version.

4. src/peft/ (15 min)

The package itself:

src/peft/
├── __init__.py             # public API
├── peft_model.py           # main PeftModel class
├── config.py               # config classes
├── tuners/
│   ├── lora.py             # LoRA implementation
│   ├── ia3.py
│   ├── prefix_tuning.py
│   └── ...
└── utils/

Read __init__.py first - it shows the public API surface. Then pick lora.py - that's the most-used technique and you understand it.

5. tests/ (10 min)

Pick test_lora.py (or similar). See how the team validates that LoRA still works across model architectures.

6. examples/ (10 min)

Working notebooks. Reproducible end-to-end runs.

7. .github/workflows/ (5 min)

tests.yml - runs pytest matrix. build_docs.yml - builds docs. release.yml - pushes to PyPI.

CI is the spec. What it runs, your PR must pass.

What to look for

  • Where does data flow? For a training-time library: model → tuner wrapper → optimizer → save. For RAG: query → embed → search → context → LLM → response.
  • Where's the public API? Usually __init__.py or a models.py / api.py.
  • Where are model architectures? Usually models/ or per-architecture files.
  • Where are tests? tests/. Match each test to a code file.
  • What's "magic"? Decorators that register models (@register_model), config classes that auto-load. Read the registration logic once.

Common AI-project patterns

  • Registry pattern. Models, tuners, integrations registered by string. New addition = add to registry + implement interface.
  • Hub integration. Models loaded from from_pretrained("model-id"). Look for _load_pretrained_model or similar.
  • Configuration as a dataclass. @dataclass class FooConfig. Serializes to JSON for reproducibility.
  • Mixed-precision and device handling. with torch.cuda.amp.autocast(): blocks; model.to(device).
  • Pipeline abstraction. High-level wrapper over tokenizer + model + generation logic.

Once you see these in one project, you see them everywhere.

Reading the test suite

Tests document expected behavior. For peft:

  • test_lora.py::test_lora_save_load - round-trip preservation.
  • test_lora.py::test_lora_target_modules - which layers get adapters.
  • test_lora.py::test_lora_merge - merging LoRA back into base weights.

Each test names the contract. To break the test is to break the contract.

Counter-example: pytorch/pytorch

Several million lines. C++/CUDA/Python. Build system alone is a project. Don't read top-to-bottom. Instead, find a specific module (torch/optim/, torch/utils/data/) and read just that.

Counter-example: langchain-ai/langchain

Monorepo with ~100 packages. Hundreds of integrations. Don't read top-to-bottom. Pick one integration package (e.g., libs/community/langchain_community/llms/anthropic.py) and read just that.

Exercise

  1. Clone huggingface/peft.
  2. Spend 45 minutes reading per the order above.
  3. After: explain to yourself, out loud:
  4. What does this project do?
  5. What's the public API?
  6. Where would a new technique (e.g., new LoRA variant) be added?
  7. How is it tested?
  8. Pick one open good first issue. Locate the code it concerns.

What you might wonder

"I read it. I don't fully understand it." That's fine. Goal is geography, not mastery. You should know roughly where things live. Mastery comes from changes.

"The code uses techniques I haven't learned (mixin classes, metaclasses, etc.)." Note them. Don't get stuck. Modify a small piece first.

"It uses CUDA / accelerate / DeepSpeed. I can't run on my laptop." You can still read and contribute. Many PRs are CPU-testable. Look for @require_torch_gpu decorators on tests - those are GPU-only; the rest you can run.

Done

  • Read a real AI OSS repo with a plan.
  • Know the typical layout.
  • Have a target issue.

Next: Your first contribution →

15 - Your First Contribution

What this session is

The whole thing. Walk through an AI OSS contribution end-to-end.

The workflow

  1. Fork on GitHub.
  2. Clone your fork.
  3. Add upstream as remote.
  4. Branch off main.
  5. Set up the dev environment (install with extras; run tests).
  6. Change the file(s).
  7. Run lint + tests locally.
  8. Push to your fork; open PR.

Step 1: Fork & clone

git clone git@github.com:<you>/peft.git
cd peft
git remote add upstream git@github.com:huggingface/peft.git
git fetch upstream

Step 2: Branch

git checkout -b docs/fix-lora-example

Always a fresh branch off main.

Step 3: Set up dev environment

For most HF projects:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,test]"
pytest tests/ -x -q

For projects requiring GPU, run only CPU tests first:

pytest tests/ -x -q -m "not gpu"

If anything fails on a fresh clone, fix that first or ask in the issue.

Step 4: Make the change

Small. Focused. Tested.

  • Docs typo / clarification - edit the .md file in docs/source/.
  • Add an example - add a new file under examples/.
  • Fix a bug - change the code; add or update a test that proves the fix.

For a first PR, prefer the first two. Bug fixes are great once you know the project.

Step 5: Re-run CI's commands locally

Look in .github/workflows/tests.yml. Typical:

make quality              # ruff / black / isort
make test                 # pytest
make docs                 # sphinx build

All green? Push. Red? Fix locally first.

Step 6: Commit and push

git add <files>
git commit -m "docs: fix LoRA target_modules example for Llama"

DCO if required (git commit -s).

git push origin docs/fix-lora-example

Step 7: Open the PR

On upstream repo, "Compare & pull request."

  • Title. Short, descriptive. Conventional-commit style if the project uses it.
  • Description. What changed, why, how tested. Closes #123 references the issue.
  • Checklist. Address every item in the PR template.

Submit. CI runs. Fix anything red by pushing more commits.

Worked example: typo in PEFT LoRA docs

Suppose you noticed docs/source/conceptual_guides/lora.md has an outdated target_modules=["query_key_value"] example that no longer applies to current Llama configs.

git clone git@github.com:<you>/peft.git
cd peft
git remote add upstream git@github.com:huggingface/peft.git
git fetch upstream

git checkout -b docs/lora-target-modules-llama

# Edit docs/source/conceptual_guides/lora.md
# Add a note: "For Llama-style models, use ['q_proj','v_proj']."

make quality
make docs

git add docs/source/conceptual_guides/lora.md
git commit -m "docs: clarify LoRA target_modules for Llama-style models"
git push origin docs/lora-target-modules-llama

Open PR. Wait for review.

What review looks like

  1. "LGTM, merging." Done.
  2. "Could you change these?" Address. Push commits to same branch.
  3. "Not quite - we already have a section for this." Update or close.
  4. Silence for a week → polite check-in comment.

HF teams are responsive (usually within days).

After the merge

  • Update your fork's main:
    git checkout main
    git fetch upstream
    git merge upstream/main
    git push origin main
    
  • Delete the branch.
  • Take a screenshot.
  • Sit with it.

After your first PR

  1. Pick another issue. Familiarity compounds - second is much easier.
  2. After 3-5 PRs in one project, become a regular. Review others' PRs.
  3. Pick a model architecture you care about. Contribute an integration.
  4. Move toward research code: paper implementations, training-script improvements.

What you might wonder

"PR sits for weeks?" HF responds fast. Other AI projects (research orgs, slower-paced labs) can take longer. Polite check-in after 7-10 days.

"What about PyTorch core?" Larger surface, more rigorous review. CLA required, RFCs for non-trivial changes. Start with the docs/ tree there.

"What about OpenAI / Anthropic SDKs?" Yes, they accept PRs to their clients (openai-python, anthropic-sdk-python). Closed-source models, open-source clients.

"Maintainer rude?" Disengage. Try another project. AI OSS has many welcoming homes.

Done with this path

You've: - Installed PyTorch and the AI Python stack. - Trained a small neural net on MNIST. - Used Hugging Face for text generation. - Fine-tuned a model with LoRA. - Built a small RAG pipeline. - Evaluated outputs honestly. - Served a model locally. - Read a real AI OSS project. - Submitted a PR.

What you should do next: build a small AI tool you actually want to exist. The technology rewards practice. Pick one problem, build the simplest possible solution, iterate.

Recommended next paths on this site:

Congratulations. You are no longer a beginner.