Month 3-Week 3: nanoGPT, BPE, sampling strategies¶

Week summary¶

Goal: Reproduce nanoGPT end-to-end on TinyStories. Watch Karpathy's tokenizer lecture and implement BPE on a small corpus. Implement and compare top-k, top-p (nucleus), and temperature sampling.
Time: ~10 h over 3 sessions.
Output: Trained nanoGPT on TinyStories with samples; from-scratch BPE; sampling experiments.
Sequences relied on: 08-transformers rungs 01, 08, 09; 05-pytorch rungs 06, 09; 03-probability-statistics rung 08.

Why this week matters¶

nanoGPT is the production-style transformer reference implementation that thousands of researchers use as their starting point. Reading it teaches you research-grade PyTorch: efficient masking, weight tying, FlashAttention integration, distributed training hooks. Reading research code is a skill that compounds for years.

The tokenizer lecture is one of those "secret unlocks"-it explains why LLMs fail at character-counting, why some prompts produce surprising outputs, and why "tokens" are not "characters." Most engineers never learn this; you will.

Prerequisites¶

M03-W02 complete (transformer from scratch).
Self-attention implementable from blank file.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): nanoGPT walkthrough + train
Session B-Sat morning (~3.5 h): tokenizer lecture + BPE
Session C-Sun afternoon (~3 h): sampling strategies + ship

Session A-nanoGPT: read and reproduce¶

Goal: Read nanoGPT source code carefully. Train it on TinyStories.

Part 1-Read nanoGPT (60 min)¶

git clone https://github.com/karpathy/nanoGPT

Read these files in this order, line-by-line: 1. model.py - the architecture. Compare to your W02 implementation. 2.train.py - the training loop. Note: gradient accumulation, AMP, distributed support. 3. `data/shakespeare_char/prepare.py - tokenization pipeline.

Take notes on what nanoGPT does that your W02 implementation didn't: - Weight tying (input embedding ≡ output projection)? - FlashAttention via F.scaled_dot_product_attention? - Mixed precision via torch.amp.autocast? - Gradient accumulation for effective large batches?

These are the production differences.

Part 2-Set up TinyStories (30 min)¶

TinyStories is a synthetic dataset of children's stories. Crucially, it produces coherent text even from small models-perfect for a reproduction with limited compute.

from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")
# Save text as a single file for nanoGPT's prepare script
with open('input.txt', 'w') as f:
    for item in ds['train']:
        f.write(item['text'] + '\n')

Part 3-Train nanoGPT (90 min)¶

Configure for a small run that fits in <2 hours on a single GPU (Colab T4 fine): - 6 layers, 6 heads, 384 embedding dim - block_size=256 - ~30M parameters

# config/train_tinystories.py
out_dir = 'out-tinystories'
n_layer = 6
n_head = 6
n_embd = 384
block_size = 256
batch_size = 64
max_iters = 5000
learning_rate = 6e-4

Run:

python train.py config/train_tinystories.py

Watch loss go down. After training, sample:

python sample.py --out_dir=out-tinystories --start="Once upon a time"

Are the stories coherent? They should be small-but-coherent (TinyStories is designed for this).

Output of Session A¶

Trained nanoGPT model on TinyStories.
Sample stories committed to README.
Notes on what nanoGPT does differently from your W02 build.

Session B-Karpathy tokenizer lecture + BPE from scratch¶

Goal: Watch lecture 7. Implement BPE on a small corpus. Understand why "How many es in 'cheese'?" is hard for LLMs.

Part 1-Watch lecture 7 (~120 min)¶

Karpathy Zero to Hero Lecture 7: "Let's build the GPT Tokenizer".

Key concepts: 1. Why tokens, not characters: shorter sequences, faster training, better generalization. 2. Why tokens, not words: handle any input including unseen words. 3. BPE algorithm: iteratively merge the most frequent pair of adjacent tokens. 4. Special tokens (BOS, EOS, PAD). 5. Common pitfalls: Unicode handling, byte-level vs char-level, gpt-4 vs gpt-2 tokenizer differences.

Part 2-Implement BPE (75 min)¶

Type along Karpathy's mini-implementation. Apply to a small corpus (a paragraph of Wikipedia, or your own writing).

# pseudocode
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, new_id):
    new = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
            new.append(new_id); i += 2
        else:
            new.append(ids[i]); i += 1
    return new

# train
ids = list(text.encode('utf-8'))
vocab_size = 300
n_merges = vocab_size - 256
merges = {}
for i in range(n_merges):
    stats = get_stats(ids)
    best = max(stats, key=stats.get)
    new_id = 256 + i
    ids = merge(ids, best, new_id)
    merges[best] = new_id

Apply two trained tokenizers (vocab 256 vs 4096) to the same string. Notice the difference.

Part 3-Why LLMs can't count characters (15 min)¶

Take the string "strawberry". Tokenize with tiktoken (the GPT-4 tokenizer):

import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
tokens = enc.encode("strawberry")
print(tokens, len(tokens))   # likely 2 tokens

The model never sees individual characters; it sees subword units. Asking "how many 'r's in strawberry?" is asking the model to perform arithmetic on character composition that's hidden from it.

This is the explanation for many "weird LLM behaviors."

Output of Session B¶

Implemented BPE on a small corpus.
Comparison of vocab=256 vs vocab=4096 tokenization.
One-paragraph note: "Why LLMs fail at character counting."

Session C-Sampling strategies + ship¶

Goal: Implement and compare top-k, top-p, temperature sampling. Polish notebooks. Push.

Part 1-Sampling implementations (75 min)¶

Temperature. Divide logits by T before softmax. T → 0 = argmax (deterministic); T → ∞ = uniform.

Top-k. Keep the k highest-probability tokens; redistribute mass among them.

Top-p (nucleus). Keep the smallest set of tokens whose cumulative probability exceeds p; redistribute among them.

def sample_top_p(logits, p=0.9, temperature=1.0):
    logits = logits / temperature
    probs = torch.softmax(logits, dim=-1)
    sorted_probs, sorted_idx = probs.sort(descending=True)
    cumulative = sorted_probs.cumsum(dim=-1)
    mask = cumulative > p
    mask[..., 1:] = mask[..., :-1].clone()
    mask[..., 0] = False
    sorted_probs[mask] = 0.0
    sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
    sample = torch.multinomial(sorted_probs, 1)
    return sorted_idx.gather(-1, sample)

Read Hugging Face blog "How to generate text" (search "huggingface how to generate text"). Note: the same techniques apply to your tiny model and to GPT-4.

Part 2-Sampling experiments (45 min)¶

For your trained nanoGPT, sample with: - Temperature ∈ {0.5, 0.8, 1.0, 1.2} - Top-k ∈ {10, 50, 200} - Top-p ∈ {0.7, 0.9, 0.95}

Report 5 samples per config in a table. Qualitative observations: - High temp, low top-k: incoherent. - Low temp, high top-k: repetitive. - Top-p is more adaptive (varies with output entropy).

Part 3-Polish + push (60 min)¶

Notebook polish for nanoGPT, BPE, sampling.
README with sample stories, BPE comparison, sampling table.
Push.

Read M03-W04.md to prep the Q1 capstone week.

Output of Session C¶

Top-k and top-p sampling implementations.
Sampling experiment table.
Repo updated.

End-of-week artifact¶

Trained nanoGPT on TinyStories with coherent samples
BPE implementation on a small corpus
Top-k and top-p sampling implemented and compared
One-paragraph note on character-counting failure mode

End-of-week self-assessment¶

I can read research-grade PyTorch (e.g., nanoGPT) without confusion.
I can implement BPE training in Python.
I can implement top-p sampling.
I can explain LLM character-counting failures.

Common failure modes for this week¶

Skipping the source-reading. nanoGPT is small; read every file.
Tokenizer lecture skipped because "I'll never need this." You will. Tokenization is the substrate of every LLM-related bug.
Sampling parameters as folklore. Pick by data, not by tradition.

What's next (preview of M03-W04)¶

Q1 capstone-modify the transformer architecturally (RMSNorm, RoPE, SwiGLU, or GQA), compare to baseline with seed variance, and publish a long-form Q1 retrospective post.