Month 3-Week 1: Karpathy makemore-bigrams to MLP, with diagnostic discipline¶

Week summary¶

Goal: Complete Karpathy Zero to Hero lectures 2, 3, 4 (makemore parts 1–3): bigram model, MLP language model, and the diagnostic + initialization deep-dive. Build a character-level LM that produces plausible name-like strings.
Time: ~10–12 h over 3 sessions (this is intentionally heavy; transformer week is next).
Output: transformer-from-scratch updated with bigram + MLP-LM + initialization-experiments notebooks.
Sequences relied on: 08-transformers rungs 01, 02; 07-deep-learning rungs 01, 03, 04; 03-probability-statistics rungs 04, 06.

Why this week matters¶

Karpathy's Zero to Hero is the best language-modeling pedagogy that exists. Lectures 2–4 are the runway to lecture 6 (the transformer build). Skipping them and jumping directly to attention causes confusion that cascades for weeks. This week pays the runway tax.

The diagnostic discipline taught in lecture 4-looking at activations, gradients, weight statistics during training to catch problems early-is what separates engineers who ship working models from those who train mysteries. You'll need it for every transformer you train.

Prerequisites¶

M01 complete (manual MLP backprop).
M02 complete (PyTorch fluency).
M02-W04 lecture 2 may be done already; otherwise start there.

Recommended cadence¶

Session A-Tue/Wed evening (~3.5 h): finish lectures 2 + 3
Session B-Sat morning (~4 h): lecture 4 (initialization deep-dive)
Session C-Sun afternoon (~3 h): diagnostic experiments + ship

Session A-Bigram + MLP language model¶

Goal: Finish Karpathy lectures 2 and 3; ship a character-level LM that produces non-random samples.

Part 1-Lecture 2 finish (60 min)¶

If you started in M02-W04, finish; if not, do the whole thing now.

Karpathy Zero to Hero Lecture 2: makemore part 1-bigram. - A bigram is P(next_char | current_char) - avocab × vocab` count matrix turned into probabilities. - The "model" is just a lookup table. No weights, no training. Yet it captures meaningful structure. - Why this matters: it sets the bar for what a "real" model needs to beat.

Run the notebook. Sample 20 names. Notice they're better than random but obviously bad.

Part 2-Lecture 3 (90 min)¶

Karpathy lecture 3: makemore part 2-MLP language model. - Switch from a bigram to a context window: predict next char from previous N chars (e.g., 3 chars). - Use embeddings (a learnable lookup table mapping char → vector). This is the first time you use them. - Concatenate context embeddings → MLP → softmax over vocabulary.

Type along. Train. Sample.

Why embeddings matter (the rung you need to internalize): in a bigram, "a" and "b" are unrelated atoms. With learned embeddings, similar characters get similar vectors. Generalization improves. Token embeddings in transformers are the same idea, scaled up.

Part 3-Sampling experiments (45 min)¶

Add temperature and top-k sampling to your model:

def sample_with_temp(probs, temperature=1.0, top_k=None):
    logits = torch.log(probs + 1e-12) / temperature
    if top_k:
        topk_vals, topk_idx = logits.topk(top_k)
        mask = torch.full_like(logits, float('-inf'))
        mask.scatter_(0, topk_idx, topk_vals)
        logits = mask
    return torch.softmax(logits, dim=-1).multinomial(1)

Sample 20 names at temperature 0.5, 1.0, 2.0. Notice: - Low temp: repetitive, conservative, often boring. - Temp 1: balanced. - High temp: creative, sometimes nonsense.

Output of Session A¶

02-bigram.ipynb and 03-mlp-lm.ipynb in transformer-from-scratch.
Sample names at multiple temperatures committed.

Session B-Initialization, batch norm, and diagnostic discipline¶

Goal: Watch and implement Karpathy lecture 4. Internalize what to look at during training.

Part 1-Lecture 4 watch + code along (120 min)¶

Karpathy lecture 4: makemore part 3-building activation, gradients, BatchNorm.

This is one of the most important pedagogical hours on the internet for ML engineers. Pay attention.

Key concepts: 1. Activation distribution: healthy nets have activation magnitudes that don't explode or vanish. 2. Gradient distribution: same-gradients should not blow up or die. 3. Saturation: when tanh/sigmoid activations are stuck near ±1, gradient ≈ 0 → no learning. 4. Initialization scaling: weight standard deviation should scale with 1/√fan_in (Xavier/Glorot) or √(2/fan_in) (Kaiming for ReLU). Otherwise activations explode through layers. 5. Batch normalization: re-center and re-scale activations within a mini-batch. Stabilizes training; allows larger learning rates.

Type along. Build the diagnostics into your training loop.

Part 2-Run experiments (90 min)¶

Experiment 1: bad init Initialize all weights from N(0, 1) (no scaling). Train. Watch loss explode or stagnate. Plot activation histograms across layers-observe saturation.

Experiment 2: scaled init Re-init with Kaiming. Train. Loss decreases stably. Histograms healthy.

Experiment 3: batch norm Add nn.BatchNorm1d in your MLP. Compare convergence speed to no-BN.

# Diagnostic during training
def log_diagnostics(model, x, y, step):
    out = model(x)
    # activation stats
    for name, layer in model.named_modules():
        if hasattr(layer, '_activation'):
            a = layer._activation
            print(f"{step} {name}: mean={a.mean():.3f} std={a.std():.3f} "
                  f"saturated={(a.abs() > 0.99).float().mean():.3f}")
    # gradient stats after backward
    out.backward(torch.ones_like(out))
    for name, p in model.named_parameters():
        if p.grad is not None:
            print(f"  grad {name}: mean={p.grad.mean():.3e} "
                  f"std={p.grad.std():.3e}")

Part 3-Reflect (30 min)¶

Write 200 words in your repo: "What I now check first when training feels off."

Likely answers: - Activation magnitudes per layer. - Gradient magnitudes per parameter group. - Initialization scale. - Loss curve smoothness. - Learning rate magnitude.

This is the diagnostic toolkit you'll bring to every future training run.

Output of Session B¶

04-init-and-batchnorm.ipynb with three experiments.
Diagnostic helper function reusable.
Reflection note.

Session C-Polish, recall, ship¶

Goal: Polish notebooks. Self-test that the lectures stuck. Push.

Part 1-Notebook polish (45 min)¶

Add markdown explaining each cell: - The goal of the experiment. - What you observed. - What you learned.

The notebook should read as a self-contained tutorial a stranger could learn from.

Part 2-Recall test (60 min)¶

No peeking. On paper. 1. Sketch the architecture of a 3-character-context MLP language model. Label every shape. 2. Why does Kaiming init use √(2/fan_in) and not √(1/fan_in)? (Hint: ReLU kills half the activations.) 3. What does batch norm do during training? During eval? (They differ.) 4. Why does softmax + cross-entropy combine elegantly?

Compare to your notes. Where you drifted, re-watch the relevant lecture clip.

Part 3-Push + Q3 transformer prep (45 min)¶

Push everything to repo. README updated to reflect lectures 2–4 done.
Pre-watch Jay Alammar's Illustrated Transformer. One pass, no notes-just orient. (~30 min read.)
Skim Attention Is All You Need abstract + sections 3.1–3.2 only. Set expectation that next week is intense.

Output of Session C¶

Polished notebooks committed.
Recall test answers.
Pre-read of next week's material done.

End-of-week artifact¶

02-bigram.ipynb, 03-mlp-lm.ipynb, 04-init-and-batchnorm.ipynb complete
Sample outputs from your character LM in README
Diagnostic helper function reusable
Reflection note on training diagnostics

End-of-week self-assessment¶

I can sketch an MLP language model from a blank page.
I can explain why initialization scaling matters.
I know what to check first when training is unstable.
I am mentally ready for the transformer build week.

Common failure modes for this week¶

Watching without typing. The whole point is muscle memory.
Skipping lecture 4 because "init seems boring." It's the diagnostic education that pays back forever.
Not sampling from your model. Sampling is the most fun part of LM work; do it.

What's next (preview of M03-W02)¶

The single most important week of your year. Karpathy lecture 6-building GPT from scratch. By Sunday you implement self-attention, multi-head attention, and a working transformer language model. Block your calendar accordingly.