Month 3-Week 1: Karpathy makemore-bigrams to MLP, with diagnostic discipline¶
Week summary¶
- Goal: Complete Karpathy Zero to Hero lectures 2, 3, 4 (
makemoreparts 1–3): bigram model, MLP language model, and the diagnostic + initialization deep-dive. Build a character-level LM that produces plausible name-like strings. - Time: ~10–12 h over 3 sessions (this is intentionally heavy; transformer week is next).
- Output:
transformer-from-scratchupdated with bigram + MLP-LM + initialization-experiments notebooks. - Sequences relied on: 08-transformers rungs 01, 02; 07-deep-learning rungs 01, 03, 04; 03-probability-statistics rungs 04, 06.
Why this week matters¶
Karpathy's Zero to Hero is the best language-modeling pedagogy that exists. Lectures 2–4 are the runway to lecture 6 (the transformer build). Skipping them and jumping directly to attention causes confusion that cascades for weeks. This week pays the runway tax.
The diagnostic discipline taught in lecture 4-looking at activations, gradients, weight statistics during training to catch problems early-is what separates engineers who ship working models from those who train mysteries. You'll need it for every transformer you train.
Prerequisites¶
- M01 complete (manual MLP backprop).
- M02 complete (PyTorch fluency).
- M02-W04 lecture 2 may be done already; otherwise start there.
Recommended cadence¶
- Session A-Tue/Wed evening (~3.5 h): finish lectures 2 + 3
- Session B-Sat morning (~4 h): lecture 4 (initialization deep-dive)
- Session C-Sun afternoon (~3 h): diagnostic experiments + ship
Session A-Bigram + MLP language model¶
Goal: Finish Karpathy lectures 2 and 3; ship a character-level LM that produces non-random samples.
Part 1-Lecture 2 finish (60 min)¶
If you started in M02-W04, finish; if not, do the whole thing now.
Karpathy Zero to Hero Lecture 2: makemore part 1-bigram.
- A bigram is P(next_char | current_char) - avocab × vocab` count matrix turned into probabilities.
- The "model" is just a lookup table. No weights, no training. Yet it captures meaningful structure.
- Why this matters: it sets the bar for what a "real" model needs to beat.
Run the notebook. Sample 20 names. Notice they're better than random but obviously bad.
Part 2-Lecture 3 (90 min)¶
Karpathy lecture 3: makemore part 2-MLP language model.
- Switch from a bigram to a context window: predict next char from previous N chars (e.g., 3 chars).
- Use embeddings (a learnable lookup table mapping char → vector). This is the first time you use them.
- Concatenate context embeddings → MLP → softmax over vocabulary.
Type along. Train. Sample.
Why embeddings matter (the rung you need to internalize): in a bigram, "a" and "b" are unrelated atoms. With learned embeddings, similar characters get similar vectors. Generalization improves. Token embeddings in transformers are the same idea, scaled up.
Part 3-Sampling experiments (45 min)¶
Add temperature and top-k sampling to your model:
def sample_with_temp(probs, temperature=1.0, top_k=None):
logits = torch.log(probs + 1e-12) / temperature
if top_k:
topk_vals, topk_idx = logits.topk(top_k)
mask = torch.full_like(logits, float('-inf'))
mask.scatter_(0, topk_idx, topk_vals)
logits = mask
return torch.softmax(logits, dim=-1).multinomial(1)
Sample 20 names at temperature 0.5, 1.0, 2.0. Notice: - Low temp: repetitive, conservative, often boring. - Temp 1: balanced. - High temp: creative, sometimes nonsense.
Output of Session A¶
02-bigram.ipynband03-mlp-lm.ipynbintransformer-from-scratch.- Sample names at multiple temperatures committed.
Session B-Initialization, batch norm, and diagnostic discipline¶
Goal: Watch and implement Karpathy lecture 4. Internalize what to look at during training.
Part 1-Lecture 4 watch + code along (120 min)¶
Karpathy lecture 4: makemore part 3-building activation, gradients, BatchNorm.
This is one of the most important pedagogical hours on the internet for ML engineers. Pay attention.
Key concepts:
1. Activation distribution: healthy nets have activation magnitudes that don't explode or vanish.
2. Gradient distribution: same-gradients should not blow up or die.
3. Saturation: when tanh/sigmoid activations are stuck near ±1, gradient ≈ 0 → no learning.
4. Initialization scaling: weight standard deviation should scale with 1/√fan_in (Xavier/Glorot) or √(2/fan_in) (Kaiming for ReLU). Otherwise activations explode through layers.
5. Batch normalization: re-center and re-scale activations within a mini-batch. Stabilizes training; allows larger learning rates.
Type along. Build the diagnostics into your training loop.
Part 2-Run experiments (90 min)¶
Experiment 1: bad init
Initialize all weights from N(0, 1) (no scaling). Train. Watch loss explode or stagnate. Plot activation histograms across layers-observe saturation.
Experiment 2: scaled init Re-init with Kaiming. Train. Loss decreases stably. Histograms healthy.
Experiment 3: batch norm
Add nn.BatchNorm1d in your MLP. Compare convergence speed to no-BN.
# Diagnostic during training
def log_diagnostics(model, x, y, step):
out = model(x)
# activation stats
for name, layer in model.named_modules():
if hasattr(layer, '_activation'):
a = layer._activation
print(f"{step} {name}: mean={a.mean():.3f} std={a.std():.3f} "
f"saturated={(a.abs() > 0.99).float().mean():.3f}")
# gradient stats after backward
out.backward(torch.ones_like(out))
for name, p in model.named_parameters():
if p.grad is not None:
print(f" grad {name}: mean={p.grad.mean():.3e} "
f"std={p.grad.std():.3e}")
Part 3-Reflect (30 min)¶
Write 200 words in your repo: "What I now check first when training feels off."
Likely answers: - Activation magnitudes per layer. - Gradient magnitudes per parameter group. - Initialization scale. - Loss curve smoothness. - Learning rate magnitude.
This is the diagnostic toolkit you'll bring to every future training run.
Output of Session B¶
04-init-and-batchnorm.ipynbwith three experiments.- Diagnostic helper function reusable.
- Reflection note.
Session C-Polish, recall, ship¶
Goal: Polish notebooks. Self-test that the lectures stuck. Push.
Part 1-Notebook polish (45 min)¶
Add markdown explaining each cell: - The goal of the experiment. - What you observed. - What you learned.
The notebook should read as a self-contained tutorial a stranger could learn from.
Part 2-Recall test (60 min)¶
No peeking. On paper.
1. Sketch the architecture of a 3-character-context MLP language model. Label every shape.
2. Why does Kaiming init use √(2/fan_in) and not √(1/fan_in)? (Hint: ReLU kills half the activations.)
3. What does batch norm do during training? During eval? (They differ.)
4. Why does softmax + cross-entropy combine elegantly?
Compare to your notes. Where you drifted, re-watch the relevant lecture clip.
Part 3-Push + Q3 transformer prep (45 min)¶
- Push everything to repo. README updated to reflect lectures 2–4 done.
- Pre-watch Jay Alammar's Illustrated Transformer. One pass, no notes-just orient. (~30 min read.)
- Skim
Attention Is All You Needabstract + sections 3.1–3.2 only. Set expectation that next week is intense.
Output of Session C¶
- Polished notebooks committed.
- Recall test answers.
- Pre-read of next week's material done.
End-of-week artifact¶
-
02-bigram.ipynb,03-mlp-lm.ipynb,04-init-and-batchnorm.ipynbcomplete - Sample outputs from your character LM in README
- Diagnostic helper function reusable
- Reflection note on training diagnostics
End-of-week self-assessment¶
- I can sketch an MLP language model from a blank page.
- I can explain why initialization scaling matters.
- I know what to check first when training is unstable.
- I am mentally ready for the transformer build week.
Common failure modes for this week¶
- Watching without typing. The whole point is muscle memory.
- Skipping lecture 4 because "init seems boring." It's the diagnostic education that pays back forever.
- Not sampling from your model. Sampling is the most fun part of LM work; do it.
What's next (preview of M03-W02)¶
The single most important week of your year. Karpathy lecture 6-building GPT from scratch. By Sunday you implement self-attention, multi-head attention, and a working transformer language model. Block your calendar accordingly.