Saltar a contenido

Month 3-Week 1: Karpathy makemore-bigrams to MLP, with diagnostic discipline

Week summary

  • Goal: Complete Karpathy Zero to Hero lectures 2, 3, 4 (makemore parts 1–3): bigram model, MLP language model, and the diagnostic + initialization deep-dive. Build a character-level LM that produces plausible name-like strings.
  • Time: ~10–12 h over 3 sessions (this is intentionally heavy; transformer week is next).
  • Output: transformer-from-scratch updated with bigram + MLP-LM + initialization-experiments notebooks.
  • Sequences relied on: 08-transformers rungs 01, 02; 07-deep-learning rungs 01, 03, 04; 03-probability-statistics rungs 04, 06.

Why this week matters

Karpathy's Zero to Hero is the best language-modeling pedagogy that exists. Lectures 2–4 are the runway to lecture 6 (the transformer build). Skipping them and jumping directly to attention causes confusion that cascades for weeks. This week pays the runway tax.

The diagnostic discipline taught in lecture 4-looking at activations, gradients, weight statistics during training to catch problems early-is what separates engineers who ship working models from those who train mysteries. You'll need it for every transformer you train.

Prerequisites

  • M01 complete (manual MLP backprop).
  • M02 complete (PyTorch fluency).
  • M02-W04 lecture 2 may be done already; otherwise start there.
  • Session A-Tue/Wed evening (~3.5 h): finish lectures 2 + 3
  • Session B-Sat morning (~4 h): lecture 4 (initialization deep-dive)
  • Session C-Sun afternoon (~3 h): diagnostic experiments + ship

Session A-Bigram + MLP language model

Goal: Finish Karpathy lectures 2 and 3; ship a character-level LM that produces non-random samples.

Part 1-Lecture 2 finish (60 min)

If you started in M02-W04, finish; if not, do the whole thing now.

Karpathy Zero to Hero Lecture 2: makemore part 1-bigram. - A bigram is P(next_char | current_char) - avocab × vocab` count matrix turned into probabilities. - The "model" is just a lookup table. No weights, no training. Yet it captures meaningful structure. - Why this matters: it sets the bar for what a "real" model needs to beat.

Run the notebook. Sample 20 names. Notice they're better than random but obviously bad.

Part 2-Lecture 3 (90 min)

Karpathy lecture 3: makemore part 2-MLP language model. - Switch from a bigram to a context window: predict next char from previous N chars (e.g., 3 chars). - Use embeddings (a learnable lookup table mapping char → vector). This is the first time you use them. - Concatenate context embeddings → MLP → softmax over vocabulary.

Type along. Train. Sample.

Why embeddings matter (the rung you need to internalize): in a bigram, "a" and "b" are unrelated atoms. With learned embeddings, similar characters get similar vectors. Generalization improves. Token embeddings in transformers are the same idea, scaled up.

Part 3-Sampling experiments (45 min)

Add temperature and top-k sampling to your model:

def sample_with_temp(probs, temperature=1.0, top_k=None):
    logits = torch.log(probs + 1e-12) / temperature
    if top_k:
        topk_vals, topk_idx = logits.topk(top_k)
        mask = torch.full_like(logits, float('-inf'))
        mask.scatter_(0, topk_idx, topk_vals)
        logits = mask
    return torch.softmax(logits, dim=-1).multinomial(1)

Sample 20 names at temperature 0.5, 1.0, 2.0. Notice: - Low temp: repetitive, conservative, often boring. - Temp 1: balanced. - High temp: creative, sometimes nonsense.

Output of Session A

  • 02-bigram.ipynb and 03-mlp-lm.ipynb in transformer-from-scratch.
  • Sample names at multiple temperatures committed.

Session B-Initialization, batch norm, and diagnostic discipline

Goal: Watch and implement Karpathy lecture 4. Internalize what to look at during training.

Part 1-Lecture 4 watch + code along (120 min)

Karpathy lecture 4: makemore part 3-building activation, gradients, BatchNorm.

This is one of the most important pedagogical hours on the internet for ML engineers. Pay attention.

Key concepts: 1. Activation distribution: healthy nets have activation magnitudes that don't explode or vanish. 2. Gradient distribution: same-gradients should not blow up or die. 3. Saturation: when tanh/sigmoid activations are stuck near ±1, gradient ≈ 0 → no learning. 4. Initialization scaling: weight standard deviation should scale with 1/√fan_in (Xavier/Glorot) or √(2/fan_in) (Kaiming for ReLU). Otherwise activations explode through layers. 5. Batch normalization: re-center and re-scale activations within a mini-batch. Stabilizes training; allows larger learning rates.

Type along. Build the diagnostics into your training loop.

Part 2-Run experiments (90 min)

Experiment 1: bad init Initialize all weights from N(0, 1) (no scaling). Train. Watch loss explode or stagnate. Plot activation histograms across layers-observe saturation.

Experiment 2: scaled init Re-init with Kaiming. Train. Loss decreases stably. Histograms healthy.

Experiment 3: batch norm Add nn.BatchNorm1d in your MLP. Compare convergence speed to no-BN.

# Diagnostic during training
def log_diagnostics(model, x, y, step):
    out = model(x)
    # activation stats
    for name, layer in model.named_modules():
        if hasattr(layer, '_activation'):
            a = layer._activation
            print(f"{step} {name}: mean={a.mean():.3f} std={a.std():.3f} "
                  f"saturated={(a.abs() > 0.99).float().mean():.3f}")
    # gradient stats after backward
    out.backward(torch.ones_like(out))
    for name, p in model.named_parameters():
        if p.grad is not None:
            print(f"  grad {name}: mean={p.grad.mean():.3e} "
                  f"std={p.grad.std():.3e}")

Part 3-Reflect (30 min)

Write 200 words in your repo: "What I now check first when training feels off."

Likely answers: - Activation magnitudes per layer. - Gradient magnitudes per parameter group. - Initialization scale. - Loss curve smoothness. - Learning rate magnitude.

This is the diagnostic toolkit you'll bring to every future training run.

Output of Session B

  • 04-init-and-batchnorm.ipynb with three experiments.
  • Diagnostic helper function reusable.
  • Reflection note.

Session C-Polish, recall, ship

Goal: Polish notebooks. Self-test that the lectures stuck. Push.

Part 1-Notebook polish (45 min)

Add markdown explaining each cell: - The goal of the experiment. - What you observed. - What you learned.

The notebook should read as a self-contained tutorial a stranger could learn from.

Part 2-Recall test (60 min)

No peeking. On paper. 1. Sketch the architecture of a 3-character-context MLP language model. Label every shape. 2. Why does Kaiming init use √(2/fan_in) and not √(1/fan_in)? (Hint: ReLU kills half the activations.) 3. What does batch norm do during training? During eval? (They differ.) 4. Why does softmax + cross-entropy combine elegantly?

Compare to your notes. Where you drifted, re-watch the relevant lecture clip.

Part 3-Push + Q3 transformer prep (45 min)

  • Push everything to repo. README updated to reflect lectures 2–4 done.
  • Pre-watch Jay Alammar's Illustrated Transformer. One pass, no notes-just orient. (~30 min read.)
  • Skim Attention Is All You Need abstract + sections 3.1–3.2 only. Set expectation that next week is intense.

Output of Session C

  • Polished notebooks committed.
  • Recall test answers.
  • Pre-read of next week's material done.

End-of-week artifact

  • 02-bigram.ipynb, 03-mlp-lm.ipynb, 04-init-and-batchnorm.ipynb complete
  • Sample outputs from your character LM in README
  • Diagnostic helper function reusable
  • Reflection note on training diagnostics

End-of-week self-assessment

  • I can sketch an MLP language model from a blank page.
  • I can explain why initialization scaling matters.
  • I know what to check first when training is unstable.
  • I am mentally ready for the transformer build week.

Common failure modes for this week

  • Watching without typing. The whole point is muscle memory.
  • Skipping lecture 4 because "init seems boring." It's the diagnostic education that pays back forever.
  • Not sampling from your model. Sampling is the most fun part of LM work; do it.

What's next (preview of M03-W02)

The single most important week of your year. Karpathy lecture 6-building GPT from scratch. By Sunday you implement self-attention, multi-head attention, and a working transformer language model. Block your calendar accordingly.

Comments