01-Linear Algebra¶

Why this matters in the journey¶

Every modern ML model-from a linear regression to GPT-4-is, mechanically, a stack of matrix multiplications and elementwise nonlinearities. You don't need to be a mathematician, but you need a visual and computational grasp of vectors, matrices, dot products, projections, and matrix multiplication-as-composition. Without it, attention is a black box, embeddings are mysterious, and you'll plateau as a tinkerer who can't debug. The goal of this sequence is fluent intuition + comfortable computation-not proofs.

The rungs¶

Rung 01-Vectors as arrows AND as lists¶

What: A vector has two complementary mental models: a geometric arrow with magnitude and direction, and a list of numbers in some basis.
Why it earns its place: Embeddings are vectors. Token representations are vectors. You need to fluently switch between geometric ("similarity = cosine of angle") and computational ("similarity = sum of products") views.
Resource: 3Blue1Brown-Essence of Linear Algebra, episode 1 ("Vectors, what even are they?"). Search YouTube for "3blue1brown essence of linear algebra".
Done when: You can describe a 3D vector as both an arrow and a list, and explain why both views are useful.

Rung 02-Vector operations: addition, scalar multiplication, dot product¶

What: Add vectors tip-to-tail; scale by a number; dot product = sum of elementwise products = ‖a‖‖b‖cosθ.
Why it earns its place: Dot products are how attention scores tokens, how embeddings measure similarity, and how a single neuron computes its pre-activation.
Resource: 3Blue1Brown episodes 2 (linear combinations, span, basis) and 9 (dot products). Plus Khan Academy "Linear Algebra → Vectors and Spaces" for problems.
Done when: Given two vectors [1, 2, 3] and [4, 5, 6] you can compute the dot product by hand AND explain what it means geometrically.

Rung 03-Matrices as linear transformations¶

What: A matrix takes a vector and returns a new vector. It rotates, stretches, projects, or reflects space.
Why it earns its place: A neural network layer is a matrix multiplication followed by a nonlinearity. Once you see matrices as transformations, network architectures become geometric, not mysterious.
Resource: 3Blue1Brown episodes 3 ("Linear transformations and matrices") and 4 ("Matrix multiplication as composition"). This is the episode that unlocks deep learning intuition.
Done when: You can explain why a 2×2 matrix represents a linear transformation, and why matrix multiplication is composition of transformations (not "rows times columns" mechanically).

Rung 04-Matrix multiplication, fluently¶

What: C = AB means: apply transformation B, then A. Shape rule: (m×n)(n×p) = (m×p).
Why it earns its place: Every neural network forward pass is a chain of matrix multiplications. You need shape arithmetic in your bones to debug "why doesn't this fit."
Resource: 3Blue1Brown episode 4. For computational fluency: do 30 problems from Khan Academy "Matrix multiplication."
Done when: Given matrix shapes you can predict the output shape without thinking, and you can multiply 2×2 matrices by hand quickly.

Rung 05-Determinants, inverses, identity, transpose¶

What: Determinant = how much a matrix scales area/volume. Inverse undoes a transformation (when invertible). Transpose swaps rows and columns.
Why it earns its place: Transpose appears constantly in backprop and attention (Q · Kᵀ). Inverses come up in classical ML; determinants come up in probability and Jacobians.
Resource: 3Blue1Brown episodes 5 (determinant), 7 (inverse), and 8 (column space, null space).
Done when: You can explain why (AB)ᵀ = BᵀAᵀ, and why transpose appears in attention.

Rung 06-Vector spaces, basis, dimension, rank¶

What: A vector space is closed under addition and scaling. A basis is a minimal set of vectors that spans it. Rank = dimension of the column space.
Why it earns its place: Embedding dimension = the dimension of the vector space your tokens live in. LoRA fine-tuning is fundamentally a low-rank decomposition. Understanding rank is non-negotiable.
Resource: 3Blue1Brown episode 2 (revisit) and Gilbert Strang MIT 18.06 lecture 9 (search YouTube "Strang independence basis dimension").
Done when: You can explain what "low rank" means and why a low-rank update is cheap.

Rung 07-Eigenvalues and eigenvectors¶

What: An eigenvector of A is a direction that A only stretches (doesn't rotate); the stretch factor is its eigenvalue.
Why it earns its place: PCA is eigendecomposition of covariance. SVD generalizes eigenvectors to non-square matrices and is used everywhere from dimensionality reduction to model compression.
Resource: 3Blue1Brown episode 14 ("Eigenvectors and eigenvalues").
Done when: You can explain in one sentence why eigenvectors matter for PCA.

Rung 08-Singular Value Decomposition (SVD)¶

What: Any matrix can be decomposed as A = UΣVᵀ, with U and V orthogonal and Σ diagonal with the singular values.
Why it earns its place: SVD is the mathematical heart of low-rank approximation, recommendation systems, and is conceptually behind why LoRA works. Many modern compression / quantization tricks rest on SVD.
Resource: Strang MIT 18.06 lectures on SVD (search YouTube "Strang singular value decomposition"). Plus 3Blue1Brown ch. 16 (Abstract vector spaces) for a perspective shift.
Done when: You can sketch the geometric picture of SVD (rotate, scale, rotate) and explain what a "rank-k approximation" is.

Rung 09-Norms, distances, projections¶

What: L1, L2, L∞ norms; Euclidean distance; orthogonal projections.
Why it earns its place: L2 regularization, L1 sparsity, cosine distance for embeddings, projection in attention-these terms appear in every paper. You need them automatic.
Resource: Khan Academy "Vectors and Spaces → Vector dot and cross products" + the L1/L2 sections in Deep Learning (Goodfellow) chapter 2.
Done when: Without notes you can: define cosine similarity, define L2 norm, and explain why L2 regularization "shrinks" weights.

Rung 10-Tensors as the multi-dimensional generalization¶

What: A tensor is a multi-dimensional array. Rank-0 = scalar, rank-1 = vector, rank-2 = matrix, rank-3+ = tensor proper.
Why it earns its place: PyTorch programming is tensor programming. Your batch dimension, your sequence dimension, your hidden dimension-these are tensor axes.
Resource: PyTorch official tutorials, "Tensors" section. Search "pytorch tutorials tensors".
Done when: Given a tensor of shape (batch, seq, hidden), you can predict the shape after tensor.transpose(0, 1) or tensor.reshape(...) without running it.

Minimum required to leave this sequence¶

Compute a dot product by hand and explain it geometrically.
Multiply two matrices by hand for shapes up to 3×3.
Predict output shapes through a chain of 3 matrix multiplications.
Explain in one paragraph why attention uses Q · Kᵀ.
Explain "low rank" and why LoRA is parameter-efficient.
Define cosine similarity and explain when you'd use it instead of dot product.
Comfortable with PyTorch tensor reshape / transpose / matmul ops.

Going further (only after the minimum)¶

Gilbert Strang-Linear Algebra and Its Applications (book)-the canonical reference; do problems from chapters 1, 2, 5, 6.
MIT 18.06-Linear Algebra (full Strang lectures, free on MIT OCW).
Mathematics for Machine Learning-Deisenroth, Faisal, Ong (free PDF online)-chapters 2–4 are gold.
Computational linear algebra: fast.ai's "Computational Linear Algebra" course (free, Rachel Thomas)-ties theory to NumPy.

How this sequence connects to the year¶

Months 1–3: rungs 01–10 are the prerequisites for every foundation week. Don't skip.
Month 3: rungs 04, 05, 09 are what make attention (softmax(QKᵀ/√d)V) make sense.
Month 8: rungs 07–08 underpin LoRA / QLoRA fine-tuning.
Month 9: rungs 06, 08 underpin parts of distributed-training-and-quantization theory.