Month 2-Week 2: Experiment tracking, ablations, and seed variance¶

Week summary¶

Goal: Continue the course. Add Weights & Biases experiment tracking. Run an ablation study with 3 variants × 3 seeds and report results with proper variance estimates.
Time: ~9 h over 3 sessions.
Output: Course week 3 notebooks; a documented ablation study with seed-variance bars.
Sequences relied on: 06-classical-ml rungs 04–08; 03-probability-statistics rung 09; 05-pytorch rung 06.

Why this week matters¶

Most ML papers and most engineering teams ship noise. They run a model once, see a number better than baseline, and claim improvement. With seed variance often as large as the "improvement," the claim is meaningless. Building the discipline of seed-variance-aware comparison is what separates rigorous AI engineers from the rest. You'll apply this same discipline in Q2 to LLM evaluation, where it matters even more-LLM outputs are noisy, judges are noisy, and "this prompt is better" without variance bars is folklore.

Experiment tracking (W&B or MLflow) is also the cheapest engineering improvement you'll make this year. Once it's a habit, you'll never lose a result again.

Prerequisites¶

M01 + M02-W01 complete.
Course path locked.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): course + W&B setup
Session B-Sat morning (~3.5 h): ablation study (3 × 3)
Session C-Sun afternoon (~2.5 h): analysis + writeup

Session A-Course week 2 + Weights & Biases¶

Goal: Continue course. Set up W&B and integrate it into your training loop.

Part 1-Course material (90 min)¶

fast.ai Lesson 3-"Neural net foundations": - Watch. - Key concepts: SGD from scratch, learning rate finder, fine-tuning vs feature extraction. - Work the "from scratch" notebook.

Ng path: Course 1 weeks 2–3 (multivariate regression, regularization).

Part 2-W&B setup (45 min)¶

pip install wandb
wandb login   # one-time auth

Integrate into training:

import wandb

wandb.init(project="classical-ml", name="baseline-run-1",
           config={"lr": 0.01, "batch_size": 64, "model": "resnet34"})

for epoch in range(epochs):
    # ... train ...
    wandb.log({"epoch": epoch, "train_loss": train_loss,
               "val_loss": val_loss, "val_acc": val_acc})

Run a baseline. One full training run with logging. Verify charts appear in the W&B UI.

Part 3-Read the course content actively (45 min)¶

Whichever course: pick one concept from this week's lecture you didn't fully grasp. Read about it from a second source (Goodfellow chapter, blog post, paper). Synthesize a 200-word note in your repo.

Output of Session A¶

Course week 3 notebook in repo.
One W&B baseline run.
Synthesis note on a single concept.

Session B-Ablation study: 3 variants × 3 seeds¶

Goal: Run 9 training runs (3 variants × 3 seeds) and capture results in W&B.

Part 1-Pick the dataset and the variants (30 min)¶

Dataset: Whatever your course week is using (CIFAR-10 / Fashion-MNIST / Pets).

Variants: 1. Baseline: vanilla architecture, no augmentation, no regularization. 2. + Data augmentation: add RandomCrop and RandomHorizontalFlip. 3. + Dropout: add Dropout(p=0.2) to the model head.

(or another set of variants relevant to the course material).

Part 2-Run the 9 experiments (120 min)¶

Write a single script with seed control:

import torch
def set_seed(s):
    torch.manual_seed(s)
    import random; random.seed(s)
    import numpy as np; np.random.seed(s)

variants = ["baseline", "augment", "dropout"]
seeds = [0, 1, 2]
for variant in variants:
    for seed in seeds:
        set_seed(seed)
        wandb.init(project="ablation", name=f"{variant}-seed{seed}",
                   config={"variant": variant, "seed": seed})
        model = build_model(variant)
        train(model)
        wandb.finish()

Estimated time: depends on dataset / hardware. Use a small dataset/model so 9 runs fit in 90 min.

Part 3-Aggregate results (30 min)¶

Pull from W&B (CSV export or wandb.Api()):

variant     mean_acc   std_acc
baseline    0.8421     0.0125
augment     0.8612     0.0091
dropout     0.8489     0.0148

Answer the question: Is the gap between augment and baseline larger than the seed-to-seed variance? Use a rule of thumb: if mean_diff > 2 · combined_std, plausibly significant.

Output of Session B¶

9 W&B runs.
A summary table with mean and std per variant.

Session C-Bootstrap CIs, write up, push¶

Goal: Compute proper bootstrap confidence intervals. Write up the analysis honestly.

Part 1-Bootstrap CIs (60 min)¶

Bootstrap CI = "what would the mean look like if we re-sampled?"-a non-parametric way to estimate uncertainty.

import numpy as np

def bootstrap_ci(samples, n=10000, alpha=0.05):
    samples = np.array(samples)
    boot_means = [np.random.choice(samples, len(samples), replace=True).mean()
                  for _ in range(n)]
    return np.percentile(boot_means, [100*alpha/2, 100*(1-alpha/2)])

baseline_accs = [0.842, 0.838, 0.851]
ci_low, ci_high = bootstrap_ci(baseline_accs)
print(f"baseline: 95% CI = [{ci_low:.4f}, {ci_high:.4f}]")

Apply to all 3 variants. Plot a bar chart with error bars.

Part 2-Write up the result honestly (60 min)¶

Add a results.md to your repo:

# Ablation: data augmentation and dropout on Fashion-MNIST

## Setup
- Architecture: ResNet18 fine-tuned
- Optimizer: SGD lr=0.01
- 3 seeds per variant

## Results (3 seeds, 95% bootstrap CI)
| Variant | Mean | 95% CI |
|---|---|---|
| Baseline | 0.842 | [0.838, 0.851] |
| + Augment | 0.861 | [0.852, 0.870] |
| + Dropout | 0.849 | [0.842, 0.858] |

## Conclusion
Augmentation produced a real lift (CIs barely overlap). Dropout's effect was within seed-noise-could not conclude it helped at this scale.

Key discipline: be willing to say "the data does not support a difference." This is what rigor looks like.

Part 3-Push + forward look (30 min)¶

Push results to repo.
Update LEARNING_LOG.md with one paragraph: "Why I now distrust 1-seed result claims."
Read M02-W03.md.

Output of Session C¶

results.md in repo with bootstrap CIs.
Bar chart with error bars committed.

End-of-week artifact¶

9 W&B runs in a project
results.md with bootstrap CIs
Bar chart visualization
Course week 3 notebook

End-of-week self-assessment¶

I can write the seed-control boilerplate from memory.
I can compute a bootstrap CI from scratch.
I can read a paper and ask "did they report seed variance?"

Common failure modes for this week¶

Running 1 seed and claiming improvement. This is the failure mode the week is designed against.
Skipping bootstrap because t-tests feel cleaner. Bootstrap is more robust and doesn't assume normality.
Hiding the negative result. Reporting "no effect" is more valuable than fake positive.

What's next (preview of M02-W03)¶

Tabular ML and gradient-boosted trees. You'll train an XGBoost model and beat a simple neural net on tabular data-a maturity marker that separates real practitioners from hipsters.