Month 2-Week 2: Experiment tracking, ablations, and seed variance¶
Week summary¶
- Goal: Continue the course. Add Weights & Biases experiment tracking. Run an ablation study with 3 variants × 3 seeds and report results with proper variance estimates.
- Time: ~9 h over 3 sessions.
- Output: Course week 3 notebooks; a documented ablation study with seed-variance bars.
- Sequences relied on: 06-classical-ml rungs 04–08; 03-probability-statistics rung 09; 05-pytorch rung 06.
Why this week matters¶
Most ML papers and most engineering teams ship noise. They run a model once, see a number better than baseline, and claim improvement. With seed variance often as large as the "improvement," the claim is meaningless. Building the discipline of seed-variance-aware comparison is what separates rigorous AI engineers from the rest. You'll apply this same discipline in Q2 to LLM evaluation, where it matters even more-LLM outputs are noisy, judges are noisy, and "this prompt is better" without variance bars is folklore.
Experiment tracking (W&B or MLflow) is also the cheapest engineering improvement you'll make this year. Once it's a habit, you'll never lose a result again.
Prerequisites¶
- M01 + M02-W01 complete.
- Course path locked.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): course + W&B setup
- Session B-Sat morning (~3.5 h): ablation study (3 × 3)
- Session C-Sun afternoon (~2.5 h): analysis + writeup
Session A-Course week 2 + Weights & Biases¶
Goal: Continue course. Set up W&B and integrate it into your training loop.
Part 1-Course material (90 min)¶
fast.ai Lesson 3-"Neural net foundations": - Watch. - Key concepts: SGD from scratch, learning rate finder, fine-tuning vs feature extraction. - Work the "from scratch" notebook.
Ng path: Course 1 weeks 2–3 (multivariate regression, regularization).
Part 2-W&B setup (45 min)¶
Integrate into training:
import wandb
wandb.init(project="classical-ml", name="baseline-run-1",
config={"lr": 0.01, "batch_size": 64, "model": "resnet34"})
for epoch in range(epochs):
# ... train ...
wandb.log({"epoch": epoch, "train_loss": train_loss,
"val_loss": val_loss, "val_acc": val_acc})
Run a baseline. One full training run with logging. Verify charts appear in the W&B UI.
Part 3-Read the course content actively (45 min)¶
Whichever course: pick one concept from this week's lecture you didn't fully grasp. Read about it from a second source (Goodfellow chapter, blog post, paper). Synthesize a 200-word note in your repo.
Output of Session A¶
- Course week 3 notebook in repo.
- One W&B baseline run.
- Synthesis note on a single concept.
Session B-Ablation study: 3 variants × 3 seeds¶
Goal: Run 9 training runs (3 variants × 3 seeds) and capture results in W&B.
Part 1-Pick the dataset and the variants (30 min)¶
Dataset: Whatever your course week is using (CIFAR-10 / Fashion-MNIST / Pets).
Variants:
1. Baseline: vanilla architecture, no augmentation, no regularization.
2. + Data augmentation: add RandomCrop and RandomHorizontalFlip.
3. + Dropout: add Dropout(p=0.2) to the model head.
(or another set of variants relevant to the course material).
Part 2-Run the 9 experiments (120 min)¶
Write a single script with seed control:
import torch
def set_seed(s):
torch.manual_seed(s)
import random; random.seed(s)
import numpy as np; np.random.seed(s)
variants = ["baseline", "augment", "dropout"]
seeds = [0, 1, 2]
for variant in variants:
for seed in seeds:
set_seed(seed)
wandb.init(project="ablation", name=f"{variant}-seed{seed}",
config={"variant": variant, "seed": seed})
model = build_model(variant)
train(model)
wandb.finish()
Estimated time: depends on dataset / hardware. Use a small dataset/model so 9 runs fit in 90 min.
Part 3-Aggregate results (30 min)¶
Pull from W&B (CSV export or wandb.Api()):
Answer the question: Is the gap between augment and baseline larger than the seed-to-seed variance? Use a rule of thumb: if mean_diff > 2 · combined_std, plausibly significant.
Output of Session B¶
- 9 W&B runs.
- A summary table with mean and std per variant.
Session C-Bootstrap CIs, write up, push¶
Goal: Compute proper bootstrap confidence intervals. Write up the analysis honestly.
Part 1-Bootstrap CIs (60 min)¶
Bootstrap CI = "what would the mean look like if we re-sampled?"-a non-parametric way to estimate uncertainty.
import numpy as np
def bootstrap_ci(samples, n=10000, alpha=0.05):
samples = np.array(samples)
boot_means = [np.random.choice(samples, len(samples), replace=True).mean()
for _ in range(n)]
return np.percentile(boot_means, [100*alpha/2, 100*(1-alpha/2)])
baseline_accs = [0.842, 0.838, 0.851]
ci_low, ci_high = bootstrap_ci(baseline_accs)
print(f"baseline: 95% CI = [{ci_low:.4f}, {ci_high:.4f}]")
Apply to all 3 variants. Plot a bar chart with error bars.
Part 2-Write up the result honestly (60 min)¶
Add a results.md to your repo:
# Ablation: data augmentation and dropout on Fashion-MNIST
## Setup
- Architecture: ResNet18 fine-tuned
- Optimizer: SGD lr=0.01
- 3 seeds per variant
## Results (3 seeds, 95% bootstrap CI)
| Variant | Mean | 95% CI |
|---|---|---|
| Baseline | 0.842 | [0.838, 0.851] |
| + Augment | 0.861 | [0.852, 0.870] |
| + Dropout | 0.849 | [0.842, 0.858] |
## Conclusion
Augmentation produced a real lift (CIs barely overlap). Dropout's effect was within seed-noise-could not conclude it helped at this scale.
Key discipline: be willing to say "the data does not support a difference." This is what rigor looks like.
Part 3-Push + forward look (30 min)¶
- Push results to repo.
- Update
LEARNING_LOG.mdwith one paragraph: "Why I now distrust 1-seed result claims." - Read M02-W03.md.
Output of Session C¶
results.mdin repo with bootstrap CIs.- Bar chart with error bars committed.
End-of-week artifact¶
- 9 W&B runs in a project
-
results.mdwith bootstrap CIs - Bar chart visualization
- Course week 3 notebook
End-of-week self-assessment¶
- I can write the seed-control boilerplate from memory.
- I can compute a bootstrap CI from scratch.
- I can read a paper and ask "did they report seed variance?"
Common failure modes for this week¶
- Running 1 seed and claiming improvement. This is the failure mode the week is designed against.
- Skipping bootstrap because t-tests feel cleaner. Bootstrap is more robust and doesn't assume normality.
- Hiding the negative result. Reporting "no effect" is more valuable than fake positive.
What's next (preview of M02-W03)¶
Tabular ML and gradient-boosted trees. You'll train an XGBoost model and beat a simple neural net on tabular data-a maturity marker that separates real practitioners from hipsters.