Appendix B-Build-From-Scratch Reference¶

A working AI systems engineer should have implemented each of the following at least once. These are the building blocks from which every modern transformer is assembled.

B.1 Tokenizer (BPE)¶

When: foundational; everything downstream assumes one.

Design: - Train a byte-pair-encoding (BPE) tokenizer on a small corpus (TinyShakespeare or a Wikipedia subset). - Implement train, encode, decode. Save/load merges. - Test round-trip on held-out text.

Reference: Karpathy's minbpe repo. ~300 lines of Python, well-commented.

Lab outcome: internalize the input layer of every modern LLM. Understand why context-length and tokenization are coupled.

B.2 Multi-Head Attention¶

When: the heart of the transformer.

Design: - Standard Q, K, V = x @ W_qkv; O = softmax(Q K^T / √d) V; out = O @ W_out. - With causal mask for autoregressive use. - With GQA (group K/V heads). - With KV-cache for decode.

Reference: nanoGPT's model.py. PyTorch reference; ~50 lines.

Lab outcome: understand the operation that defines this era. Read every variant (sliding window, ALiBi, RoPE) against this baseline.

B.3 RMSNorm¶

When: replaces LayerNorm in modern LLMs (Llama, Qwen, etc.).

Design:

def rmsnorm(x, weight, eps=1e-6):
    rms = (x.pow(2).mean(-1, keepdim=True) + eps).rsqrt()
    return x * rms * weight

Triton-fused version: Month 3, week 12. Be able to write both.

B.4 Rotary Position Embedding (RoPE)¶

When: every modern open LLM since 2022.

Design: - Precompute cos/sin tables for all positions × half head dim. - Apply to Q and K after the projection, before attention. - Position-shift extension: linear interpolation, NTK-aware, YaRN-variants for context extension.

Reference: Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021).

B.5 Mixture-of-Experts (MoE) Layer¶

When: increasingly used (Mixtral, DeepSeek-V3, GPT-MoE).

Design: - Top-k gating: gate_logits = x @ W_gate; topk_weights, topk_idx = topk(gate_logits). - Dispatch each token to its top-k experts (typically k=2 of 8 experts). - Combine outputs weighted by gate. - Load balancing loss to prevent expert collapse.

Reference: Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017). Fedus et al., Switch Transformers (2021).

Lab outcome: MoE is the primary architectural lever for parameter scaling without proportional FLOP scaling.

B.6 FlashAttention (Tiled Attention)¶

When: any context-length-bound workload.

Design (sketch): - Tile Q in blocks of B_q. - For each Q-tile, stream K and V in blocks of B_k. - Maintain running max, running sum, running output (online softmax). - Output block when done.

Implementation: in Triton, ~150 lines. Beating FA-2 / FA-3 is hard; matching them on common shapes is achievable.

Reference: Dao et al., FlashAttention (2022) and FlashAttention-2 (2023).

B.7 Optimizer (AdamW)¶

When: foundational.

Design:

m = β1*m + (1-β1)*g
v = β2*v + (1-β2)*g²
m_hat = m / (1 - β1^t)
v_hat = v / (1 - β2^t)
p = p - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * p)

- Per-parameter state (m, v)-2x model size in FP32. - foreach and fused variants in modern PyTorch fuse the per-parameter loops.

Variants worth understanding: Lion, AdEMAMix, Sophia, Muon (the Distributed Shampoo lineage). Most aim to halve optimizer-state memory or improve convergence.

B.8 Data Loader¶

When: every training run.

Design: - Memory-mapped binary token files (one giant np.uint16 or np.uint32 array). - Random sampling: pick a position uniformly, slice context_length tokens. - Cross-worker / cross-rank determinism: identical seed → identical sample order. - Multi-process via torch.utils.data.DataLoader(num_workers=N).

Reference: nanoGPT's data prep + loader. Production scale: Mosaic StreamingDataset, FFCV.

B.9 Paged KV-Cache (Mini)¶

When: serving track capstone.

Design: - Block pool: [num_blocks, num_heads, block_size, head_dim] for K and same for V. - Free-block stack. - Per-request page table: list of physical block indices. - On decode step, gather indexed blocks for attention. - On request completion, return blocks to free pool.

Reference: vLLM's core/block_manager.py. ~400 lines.

Lab outcome: this is the mini-vLLM capstone's foundation.

B.10 Continuous Batching Scheduler (Mini)¶

When: serving track.

Design: - Iteration-level scheduling loop:

while True:
    batch = pick_runnable_requests()    # respect KV-budget
    logits = model.forward(batch)        # one decode step
    tokens = sample(logits)
    for req, tok in zip(batch, tokens):
        req.append(tok)
        if req.is_done(): finalize(req); free_blocks(req)

- Admission policy when memory tight: preempt the longest-running request, or evict + restart.

Reference: Yu et al., Orca (OSDI 2022).

B.11 Speculative Decoding Loop¶

When: high-priority inference latency wins.

Design:

while not done:
    drafts = draft_model.generate_k(state, k=K)
    target_logits = target_model.parallel_verify(state, drafts)  # one forward of K tokens
    accepted = longest_match(target_logits, drafts)
    state.append(accepted)
    if len(accepted) < K:
        state.append(sample(target_logits[len(accepted)]))  # fallback

- Tune K per workload.

Reference: Leviathan et al., Speculative Decoding (ICML 2023).

B.12 NCCL-Collective From Scratch (Bonus)¶

When: training-systems track capstone.

Design: - Implement ring-allreduce yourself using point-to-point send/recv (NCCL or MPI). - Compare bandwidth to dist.all_reduce.

Lab outcome: internalize why ring-allreduce achieves bandwidth-optimal scaling.

Difficulty Ranking¶

Tier	Builds
Warmup	Tokenizer (BPE), AdamW, RMSNorm
Intermediate	Multi-head attention with KV-cache, RoPE, Data loader, MoE
Advanced	FlashAttention in Triton, Paged KV-cache, Continuous batching scheduler
Expert	NCCL ring-allreduce, mini-vLLM end-to-end, FP8 training stable for >100k steps

Pick at least one from each tier. Ship with benchmarks, profiling artifacts, and a writeup.

Appendix B-Build-From-Scratch Reference¶

B.1 Tokenizer (BPE)¶

B.2 Multi-Head Attention¶

B.3 RMSNorm¶

B.4 Rotary Position Embedding (RoPE)¶

B.5 Mixture-of-Experts (MoE) Layer¶

B.6 FlashAttention (Tiled Attention)¶

B.7 Optimizer (AdamW)¶

B.8 Data Loader¶

B.9 Paged KV-Cache (Mini)¶

B.10 Continuous Batching Scheduler (Mini)¶

B.11 Speculative Decoding Loop¶

B.12 NCCL-Collective From Scratch (Bonus)¶

Difficulty Ranking¶

Comments