Month 1-Week 1: Vectors, dot products, and your first ML model from scratch¶
Week summary¶
- Goal: Build geometric intuition for vectors and dot products, internalize the cosine identity, and ship NumPy linear regression as your first ML artifact.
- Time: ~9–11 hours over 3 sessions.
- Output: Public repo
ml-from-scratchcontaining a working linear regression notebook with hand-derived loss gradient. - Sequences relied on: 01-linear-algebra rungs 01–04, 02-calculus rungs 01, 05, 04-python-for-ml rungs 01–02.
Why this week matters in your AI expert journey¶
Every model you'll ever encounter-from a humble logistic regressor to GPT-class transformers-is, mechanically, layered matrix multiplications and elementwise nonlinearities. The dot product is the atom. A single neuron is a dot product. An attention score is a dot product. Cosine similarity in embeddings is a dot product. If the dot product is automatic for you geometrically and computationally, the rest of the year compounds easily. If it's not, every later session will feel slightly mysterious.
The linear regression artifact is your end-of-week proof: you took a real dataset, defined a loss, derived its gradient on paper, implemented gradient descent in NumPy, and watched it converge. That's the loop all training does. Once it's tactile, you can read training scripts.
Prerequisites¶
- Comfortable Python (lists, dicts, list comprehensions, importing libraries).
- High-school algebra. We will refresh calculus from scratch.
- A working Python environment with NumPy, matplotlib, Jupyter. If not: install
uv, thenuv pip install numpy matplotlib jupyter.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h)
- Session B-Sat morning (~3–4 h)
- Session C-Sun afternoon (~2–3 h)
Session A-Vectors and the dot product, geometrically and algebraically¶
Goal: Build the two-views model for vectors (arrows + lists), internalize the cosine identity, and predict dot product signs without computing.
Pre-flight: None.
Arc: What is a vector → operations on vectors → the cosine identity that ties algebra and geometry → why this is the atom of every neural network.
Part 1-Two views of a vector (45 min)¶
A vector has two complementary mental models: 1. Geometric: an arrow with magnitude (length) and direction. Lives in space. 2. Algebraic: an ordered list of numbers-coordinates in some basis.
These are the same object viewed two ways. Real fluency means switching between them without thinking.
Why this matters for AI Every embedding-token, sentence, image-is a vector in a high-dimensional space. Geometrically, similar things are nearby. Algebraically, an embedding is just a list of floats. The famous "king − man + woman ≈ queen" is geometry done on the algebraic representation. You cannot do this work without holding both views simultaneously.
Watch - 3Blue1Brown-Essence of Linear Algebra, Episode 1 ("Vectors, what even are they?")-~10 min. - Episode 2 ("Linear combinations, span, and basis vectors")-~10 min. - Search YouTube for "3blue1brown essence of linear algebra".
Worked example
For v = [3, 4]:
- Geometric: arrow from origin pointing into the first quadrant.
- Length: ‖v‖ = √(3² + 4²) = √25 = 5.
- Both views give the same length. Verify: np.linalg.norm([3, 4]) == 5.0.
Self-check (must pass before continuing)
1. The vector [1, 0, 0, 0, 0] lives in what kind of space? What does "5-dimensional" mean intuitively?
2. What's the algebraic representation of an arrow at 60° from the x-axis with length 2?
3. Why are both representations needed? (Hint: think about doing the math vs. building intuition.)
Part 2-Vector operations (60 min)¶
Three operations matter:
1. Addition (a + b): tip-to-tail geometrically; elementwise algebraically. [1,2] + [3,4] = [4,6].
2. Scalar multiplication (c · a): geometric stretch by c; elementwise scaling.
3. Dot product (a · b): produces a scalar. Algebraically: Σᵢ aᵢbᵢ. Geometrically: see Part 3.
Why this matters for AI
Every neural network layer has the form output = σ(Wx + b). The Wx is a stack of dot products: each row of W dotted with x. So a single neuron's pre-activation is a dot product. Once you see this, every architecture diagram becomes literal arithmetic.
Watch - 3B1B Episode 9 ("Dot products and duality")-the critical episode. ~15 min.
Worked example
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Three equivalent computations
print(np.dot(a, b)) # 32
print(a @ b) # 32 (preferred modern syntax)
print(sum(a[i]*b[i] for i in range(len(a)))) # 32
Self-check
1. Compute [2, 0] · [0, 3] from geometry. (Hint: angle?) Verify with NumPy.
2. What's [1,1,1,1,1] · [2,2,2,2,2]? Predict before computing.
3. If a · b > 0, what does this say about the angle between them?
Part 3-The cosine identity (the level you need) (50 min)¶
The single equation that ties algebra to geometry:
whereθ is the angle between a and b.
Read this carefully. The left side is purely algebraic (sum of products). The right side is purely geometric (lengths and an angle). They are equal. This is the bridge between the two views of a vector.
Implications
- θ = 0 (parallel, same direction): cos(0) = 1 ⇒ a·b = ‖a‖‖b‖ (max possible).
- θ = 90° (perpendicular): cos(90°) = 0 ⇒ a·b = 0.
- θ = 180° (parallel, opposite): cos(180°) = -1 ⇒ a·b = -‖a‖‖b‖ (min).
So: the sign and magnitude of the dot product encode the angle. That's the full meaning of "alignment."
Cosine similarity (the metric you'll see daily)
Normalize both vectors to length 1, and the dot product equals cos(θ):
Why this matters for attention
The attention mechanism in transformers computes score(query, key) = (q · k) / √d. The score is high when q and k "point in the same direction." This is what attention "attends to." Without the dot product, no transformer.
Worked example
For a = [1, 0], b = [1, 1]:
- Algebraic: a·b = 1·1 + 0·1 = 1.
- Geometric: ‖a‖=1, ‖b‖=√2, angle = 45°, so 1·√2·cos(45°) = √2 · (1/√2) = 1. ✓
Self-check
1. Compute the cosine similarity of [1,2] and [2,4]. Predict before computing.
2. Two embedding vectors have a dot product of 0. What does that mean about the words?
3. Why does attention divide by √d? (Hint: as dimension grows, dot product magnitudes grow. We want a stable distribution before softmax.)
Common pitfalls in Session A¶
- Confusing dot product with elementwise multiplication. In NumPy:
a*bis elementwise;a@bornp.dot(a,b)is the dot product. - Treating "angle between vectors" as only meaningful in 2D/3D. It's well-defined in 768 dimensions too.
- Skipping the geometric view in favor of computation. The intuition is what compounds for the rest of your career.
Output of Session A¶
Append to notes/tutorials.ipynb:
- The cosine identity written down with worked example.
- A small experiment: compute cosine similarity for 4 vector pairs (parallel, perpendicular, opposite, 45°)-verify each matches the formula's prediction.
- Self-check answers in markdown.
Session B-Linear regression from scratch in NumPy¶
Goal: Implement linear regression with gradient descent, deriving the gradient by hand. End with a working 01-linear-regression.ipynb and convergence to known coefficients.
Pre-flight: Session A complete.
Arc: Derivative intuition → derive the loss gradient → implement gradient descent → verify convergence → build sensitivity to learning rate.
Part 1-Derivatives and gradient descent intuition (45 min)¶
A derivative df/dx is the slope of the tangent line at x - instantaneous rate of change. Gradient descent says: *to minimize*f, *step in the direction of*−df/dx`.
Watch - 3B1B Essence of Calculus, Episodes 1, 2, 3-~30 min total. - 3B1B Neural Networks, Episode 2 ("Gradient descent, how neural networks learn")-~20 min.
Code (warmup)
1D gradient descent on f(x) = (x−3)²:
import numpy as np
import matplotlib.pyplot as plt
x = 0.0
lr = 0.1
trajectory = [x]
for _ in range(50):
grad = 2 * (x - 3) # df/dx
x = x - lr * grad
trajectory.append(x)
plt.plot(trajectory)
plt.axhline(3, color='r', linestyle='--')
plt.title('Gradient descent on (x-3)²')
plt.xlabel('iteration'); plt.ylabel('x')
x → 3 after ~30 iterations.
Part 2-Derive the linear regression loss gradient (60 min)¶
The setup. We have data {(xᵢ, yᵢ)}. We want to fit ŷᵢ = w·xᵢ + b. Mean squared error loss:
The derivation (do this on paper).
For one term Lᵢ = (yᵢ − w·xᵢ − b)², let eᵢ = yᵢ − w·xᵢ − b (the residual). Then Lᵢ = eᵢ².
Apply the chain rule:
Average overN:
Photograph this derivation and commit it to your repo. This is the artifact that proves you understand backprop's simplest case.
Part 3-Implement and verify (75 min)¶
import numpy as np
import matplotlib.pyplot as plt
# Generate data: y = 2x + 3 + noise
rng = np.random.default_rng(42)
N = 100
X = rng.uniform(-5, 5, N)
y = 2 * X + 3 + rng.normal(0, 1, N)
# Initialize
w, b = 0.0, 0.0
lr = 0.01
losses = []
for step in range(1000):
y_pred = w * X + b
error = y - y_pred # residuals
loss = np.mean(error ** 2)
losses.append(loss)
grad_w = -2 * np.mean(X * error)
grad_b = -2 * np.mean(error)
w -= lr * grad_w
b -= lr * grad_b
if step % 100 == 0:
print(f"step={step:4d} loss={loss:.4f} w={w:.4f} b={b:.4f}")
print(f"Final: w={w:.4f}, b={b:.4f} (target: 2.0, 3.0)")
plt.subplot(1, 2, 1); plt.plot(losses); plt.title("Loss"); plt.xlabel("iteration")
plt.subplot(1, 2, 2); plt.scatter(X, y, alpha=0.3); plt.plot(X, w*X+b, 'r'); plt.title("Fit")
plt.tight_layout(); plt.show()
w → 2, b → 3, loss curve monotonically decreasing.
Sensitivity experiment: Run with lr ∈ {0.001, 0.01, 0.1, 0.5}. Plot loss curves on the same axes. Observe: too small = slow; too large = unstable; sweet spot exists.
Common pitfalls in Session B¶
- Sign error in gradients. If loss explodes, your sign is flipped-try
+= lr * grad. - Forgetting to average over
N. IfNis large, gradients are huge and you need tinylr. - Plotting after one step instead of in a loop. The journey is the diagnostic.
Output of Session B¶
ml-from-scratch/01-linear-regression.ipynbwith derivation, training, plots, sensitivity study.
Session C-Polish, push, and consolidate¶
Goal: Public repo with clean README and your first ML notebook. Internalize what you've learned by explaining it back.
Pre-flight: Sessions A and B complete.
Part 1-Repo polish (45 min)¶
# from the parent directory
mkdir ml-from-scratch && cd ml-from-scratch
git init
# move your notebook in
Create README.md:
# ml-from-scratch
A weekly journey building ML algorithms from scratch in NumPy as part of a 12-month AI engineer plan.
## Notebooks
- `01-linear-regression.ipynb - gradient descent on MSE loss; hand-derived gradient.
## Why
Frame each algorithm in its simplest computational form before reaching for a framework.
Push to a public GitHub repo.
Part 2-Self-explanation (45 min)¶
Open a fresh markdown file in the notebook. Without referencing your notes, write 300 words explaining:
1. What the dot product is (both views).
2. Why gradient descent moves opposite to the gradient.
3. Why the linear regression gradient has the form −2·x·error.
Compare your writing to your earlier notes. Where you drifted, re-read. This recall (not re-reading) is what cements knowledge.
Part 3-Forward-look + prep (45 min)¶
- Read M01-W02.md (next week). Note the prerequisites.
- Watch the first 10 minutes of 3B1B Episode 3 ("Linear transformations and matrices") as a primer.
- Update your
LEARNING_LOG.mdwith one paragraph: "biggest insight of the week."
Output of Session C¶
- Public GitHub repo
ml-from-scratchwith notebook, README, derivation photo. LEARNING_LOG.mdstarted with one weekly entry.
End-of-week artifact¶
Public GitHub repo ml-from-scratch containing:
- [ ] 01-linear-regression.ipynb running end-to-end.
- [ ] Hand-derived gradient (photo or LaTeX) embedded.
- [ ] Sensitivity study with 4 learning rates.
- [ ] Clean README.
- [ ] First entry in LEARNING_LOG.md.
End-of-week self-assessment¶
- I can sketch the geometric meaning of the dot product without notes.
- I can predict the sign of
a·bfrom the angle betweenaandb. - I can derive the MSE gradient on a blank piece of paper.
- I can explain why we step in the direction of
−∇L. - My linear regression converges to known coefficients.
Common failure modes for this week¶
- Treating the math as decoration around the code. It isn't. The math IS the model. The code merely automates it.
- Skipping the hand derivation because "I get it." The derivation is the test. If you can't write it cold, you don't get it.
- Pushing a private repo or no repo at all. Public from day 1. The artifact only compounds if it's seen.
What's next (preview of M01-W02)¶
Calculus refresh + logistic regression-your first encounter with cross-entropy, the loss that powers every LLM. You will derive the binary cross-entropy gradient by hand and observe its elegant σ(z) − y form.