03 - Linear Algebra You Actually Need¶
What this session is¶
About 30 minutes. The math for neural networks at the intuitive level. No proofs. By the end you'll know what a dot product, matrix multiply, and gradient are - and what they mean in ML code.
Dot product¶
Two vectors a and b of the same length:
A single number. Measures how aligned the vectors are: large when they point the same way; zero when perpendicular; negative when opposite.
import torch
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])
print(torch.dot(a, b)) # 32.0
# (1*4 + 2*5 + 3*6 = 32)
Why it matters: the simplest neuron computes a dot product between its inputs and its weights, adds a bias, applies a nonlinearity. Every neural network is built up from this.
Matrix multiplication¶
Treat a matrix as a stack of row vectors (or column vectors). Matrix multiplication A @ B:
- The entry at row
i, columnjofA @ Bis the dot product of rowiofAand columnjofB.
Shape rule: (m, k) @ (k, n) = (m, n). The inner dimensions match; the outer dimensions become the result's shape.
A = torch.tensor([[1.0, 2.0],
[3.0, 4.0]])
B = torch.tensor([[5.0, 6.0],
[7.0, 8.0]])
print(A @ B)
# tensor([[19., 22.],
# [43., 50.]])
# 1*5 + 2*7 = 19, 1*6 + 2*8 = 22, etc.
Why it matters: an entire neural network layer is output = input @ weights + bias. Matmul is what GPUs are designed to accelerate; everything else is supporting infrastructure.
Transpose¶
Swap rows and columns:
A = torch.tensor([[1, 2, 3], [4, 5, 6]]) # shape (2, 3)
print(A.T) # shape (3, 2)
# tensor([[1, 4],
# [2, 5],
# [3, 6]])
Often used to make shapes line up for matrix multiplication.
A neuron¶
A single artificial neuron:
inputandweightsare vectors of the same length.biasis a single number.activationis a nonlinear function (relu,sigmoid,tanh, etc.).
A layer of n neurons is just n of these stacked - equivalent to one big matmul:
batch_size = 4
input_dim = 10
output_dim = 5
x = torch.randn(batch_size, input_dim) # (4, 10)
W = torch.randn(input_dim, output_dim) # (10, 5)
b = torch.randn(output_dim) # (5,)
out = x @ W + b # (4, 5)
x @ W is (4, 10) @ (10, 5) → (4, 5). The bias b broadcasts across the batch.
Welcome - that's what a dense layer (also called a "linear layer" or "fully-connected layer") does. Everything else is variations.
Nonlinearity¶
Without a nonlinearity between layers, stacking matmuls collapses to one matmul (matrix multiplication is linear). A non-linear function applied element-wise restores the network's power:
Common nonlinearities:
- ReLU - max(0, x). Cheap, effective, the default for hidden layers.
- GELU - smoother ReLU. Used heavily in transformers.
- Sigmoid - 1 / (1 + exp(-x)). Outputs in (0, 1). Used for binary classification outputs.
- Softmax - normalizes a vector to sum to 1. Used for multi-class classification outputs.
You'll mostly use ReLU or GELU in hidden layers; softmax in output.
Gradient (intuitively)¶
A gradient is "the slope of a function at a point, in N dimensions." For a single-variable function f(x), the gradient is the derivative f'(x). For a multi-variable function L(w₁, w₂, ..., wₙ), it's a vector - one partial derivative per variable.
Why it matters: training a network is "minimize the loss function." The gradient of the loss with respect to the weights tells you "if I nudge each weight in the direction opposite the gradient, the loss decreases." That's gradient descent.
You don't compute gradients by hand. PyTorch's autograd does it for you - every operation you do on tensors with requires_grad=True is tracked, and .backward() walks the graph computing gradients automatically.
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1
y.backward() # computes dy/dx
print(x.grad) # 2*x + 3 = 7
Page 05 uses this in a training loop. For now, just know: gradients let you adjust weights to reduce loss.
Vectors in geometry vs in ML¶
In math classes, vectors had geometric meaning - points, directions, magnitudes. In ML, a vector is just a list of features. A user's embedding might be 1536 numbers - no geometric interpretation, but the "directions" still capture meaningful similarities (cosine of the angle between two user embeddings = how similar they are in the model's learned space).
The math is the same - the interpretation is "feature-space similarity," not "physical space."
Cosine similarity¶
The dot product of two normalized vectors (each with length 1):
import torch.nn.functional as F
a = torch.randn(100)
b = torch.randn(100)
sim = F.cosine_similarity(a, b, dim=0)
print(sim) # between -1 and 1
A standard "how similar are these two embeddings" metric. Used heavily in RAG (page 10).
What you'll never need from a math course¶
- Eigenvalues / eigenvectors (occasionally relevant; not for daily work).
- Singular Value Decomposition (used in LoRA fine-tuning page 09; we'll cover what you need).
- Convex analysis. Calculus of variations. Differential geometry.
Don't get nerve-sniped by Twitter saying you need to "understand linear algebra before doing ML." You need the operations on this page. The rest is for research, not engineering.
Exercise¶
-
Dot product: create two random vectors of length 100. Compute their dot product manually (loop with sum) AND with
torch.dot. Verify they match. -
Matmul shape check: create
Aof shape(3, 5)andBof shape(5, 7). What's the shape ofA @ B? Verify in code. -
A neuron from scratch:
What's the value? Why? -
Batch: create
Xof shape(8, 3)(batch of 8 inputs, each 3-dim). CreateWof shape(3, 5). ComputeX @ Wand inspect the shape. What does each row represent? -
Gradient: define
f(x) = x³ - 4x² + 7x - 1. Atx = 2.0, compute the gradient using PyTorch. (Hint: usex.backward(). The math answer is3x² - 8x + 7 = 3 at x=2.)
What you might wonder¶
"I see lots of torch.bmm in code. What's that?"
Batched matrix multiplication - when you have a batch dimension. bmm is (B, m, k) @ (B, k, n) → (B, m, n). Common in transformers' attention.
"What's torch.einsum?"
Einstein summation notation - a powerful, terse way to express tensor operations. torch.einsum("ij,jk->ik", A, B) is matmul. Worth learning once you've seen the same matmul pattern enough times.
"How does a network 'know' which way to adjust weights?" The gradient gives the direction of steepest increase. Going opposite the gradient (gradient descent) decreases the loss locally. That's all. The magic is that this simple rule works in millions of dimensions.
Done¶
- Dot product, matrix multiplication, transpose.
- A single neuron and a dense layer.
- Common nonlinearities.
- What a gradient is and why it matters.
- Recognizing what's NOT essential math for ML engineering.
Next: Your first neural network →