Appendix B-Build-From-Scratch Reference¶
A working AI systems engineer should have implemented each of the following at least once. These are the building blocks from which every modern transformer is assembled.
B.1 Tokenizer (BPE)¶
When: foundational; everything downstream assumes one.
Design:
- Train a byte-pair-encoding (BPE) tokenizer on a small corpus (TinyShakespeare or a Wikipedia subset).
- Implement train, encode, decode. Save/load merges.
- Test round-trip on held-out text.
Reference: Karpathy's minbpe repo. ~300 lines of Python, well-commented.
Lab outcome: internalize the input layer of every modern LLM. Understand why context-length and tokenization are coupled.
B.2 Multi-Head Attention¶
When: the heart of the transformer.
Design:
- Standard Q, K, V = x @ W_qkv; O = softmax(Q K^T / √d) V; out = O @ W_out.
- With causal mask for autoregressive use.
- With GQA (group K/V heads).
- With KV-cache for decode.
Reference: nanoGPT's model.py. PyTorch reference; ~50 lines.
Lab outcome: understand the operation that defines this era. Read every variant (sliding window, ALiBi, RoPE) against this baseline.
B.3 RMSNorm¶
When: replaces LayerNorm in modern LLMs (Llama, Qwen, etc.).
Design:
def rmsnorm(x, weight, eps=1e-6):
rms = (x.pow(2).mean(-1, keepdim=True) + eps).rsqrt()
return x * rms * weight
B.4 Rotary Position Embedding (RoPE)¶
When: every modern open LLM since 2022.
Design:
- Precompute cos/sin tables for all positions × half head dim.
- Apply to Q and K after the projection, before attention.
- Position-shift extension: linear interpolation, NTK-aware, YaRN-variants for context extension.
Reference: Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021).
B.5 Mixture-of-Experts (MoE) Layer¶
When: increasingly used (Mixtral, DeepSeek-V3, GPT-MoE).
Design:
- Top-k gating: gate_logits = x @ W_gate; topk_weights, topk_idx = topk(gate_logits).
- Dispatch each token to its top-k experts (typically k=2 of 8 experts).
- Combine outputs weighted by gate.
- Load balancing loss to prevent expert collapse.
Reference: Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017). Fedus et al., Switch Transformers (2021).
Lab outcome: MoE is the primary architectural lever for parameter scaling without proportional FLOP scaling.
B.6 FlashAttention (Tiled Attention)¶
When: any context-length-bound workload.
Design (sketch): - Tile Q in blocks of B_q. - For each Q-tile, stream K and V in blocks of B_k. - Maintain running max, running sum, running output (online softmax). - Output block when done.
Implementation: in Triton, ~150 lines. Beating FA-2 / FA-3 is hard; matching them on common shapes is achievable.
Reference: Dao et al., FlashAttention (2022) and FlashAttention-2 (2023).
B.7 Optimizer (AdamW)¶
When: foundational.
Design:
m = β1*m + (1-β1)*g
v = β2*v + (1-β2)*g²
m_hat = m / (1 - β1^t)
v_hat = v / (1 - β2^t)
p = p - lr * (m_hat / (sqrt(v_hat) + eps) + weight_decay * p)
foreach and fused variants in modern PyTorch fuse the per-parameter loops.
Variants worth understanding: Lion, AdEMAMix, Sophia, Muon (the Distributed Shampoo lineage). Most aim to halve optimizer-state memory or improve convergence.
B.8 Data Loader¶
When: every training run.
Design:
- Memory-mapped binary token files (one giant np.uint16 or np.uint32 array).
- Random sampling: pick a position uniformly, slice context_length tokens.
- Cross-worker / cross-rank determinism: identical seed → identical sample order.
- Multi-process via torch.utils.data.DataLoader(num_workers=N).
Reference: nanoGPT's data prep + loader. Production scale: Mosaic StreamingDataset, FFCV.
B.9 Paged KV-Cache (Mini)¶
When: serving track capstone.
Design:
- Block pool: [num_blocks, num_heads, block_size, head_dim] for K and same for V.
- Free-block stack.
- Per-request page table: list of physical block indices.
- On decode step, gather indexed blocks for attention.
- On request completion, return blocks to free pool.
Reference: vLLM's core/block_manager.py. ~400 lines.
Lab outcome: this is the mini-vLLM capstone's foundation.
B.10 Continuous Batching Scheduler (Mini)¶
When: serving track.
Design: - Iteration-level scheduling loop:
while True:
batch = pick_runnable_requests() # respect KV-budget
logits = model.forward(batch) # one decode step
tokens = sample(logits)
for req, tok in zip(batch, tokens):
req.append(tok)
if req.is_done(): finalize(req); free_blocks(req)
Reference: Yu et al., Orca (OSDI 2022).
B.11 Speculative Decoding Loop¶
When: high-priority inference latency wins.
Design:
while not done:
drafts = draft_model.generate_k(state, k=K)
target_logits = target_model.parallel_verify(state, drafts) # one forward of K tokens
accepted = longest_match(target_logits, drafts)
state.append(accepted)
if len(accepted) < K:
state.append(sample(target_logits[len(accepted)])) # fallback
Reference: Leviathan et al., Speculative Decoding (ICML 2023).
B.12 NCCL-Collective From Scratch (Bonus)¶
When: training-systems track capstone.
Design:
- Implement ring-allreduce yourself using point-to-point send/recv (NCCL or MPI).
- Compare bandwidth to dist.all_reduce.
Lab outcome: internalize why ring-allreduce achieves bandwidth-optimal scaling.
Difficulty Ranking¶
| Tier | Builds |
|---|---|
| Warmup | Tokenizer (BPE), AdamW, RMSNorm |
| Intermediate | Multi-head attention with KV-cache, RoPE, Data loader, MoE |
| Advanced | FlashAttention in Triton, Paged KV-cache, Continuous batching scheduler |
| Expert | NCCL ring-allreduce, mini-vLLM end-to-end, FP8 training stable for >100k steps |
Pick at least one from each tier. Ship with benchmarks, profiling artifacts, and a writeup.