06-Classical ML¶

Why this matters in the journey¶

You can do modern AI without classical ML, but you can't do it well. The discipline of train/val/test splits, baseline-first thinking, regularization, bias-variance tradeoff, ROC curves, and feature engineering-these all show up in eval design, fine-tuning data curation, and reading literature. Skipping classical ML is the most common reason "AI engineers" plateau as prompt tinkerers.

The rungs¶

Rung 01-Supervised learning framing¶

What: Given (x, y) pairs, learn f(x) → y. Classification vs regression. Train/val/test split.
Why it earns its place: This is the framing under every model, including LLMs (next-token prediction is supervised on shifted text).
Resource: Andrew Ng's ML Specialization Course 1 (Coursera) or fast.ai lessons 1–2.
Done when: You can explain the difference between train, val, and test, and why each exists.

Rung 02-Linear regression¶

What: Fit a line (or hyperplane) y = wᵀx + b minimizing mean squared error.
Why it earns its place: Simplest possible model-everything else is generalizations of this. It's also the perfect exercise for tying together calculus, linear algebra, and code.
Resource: Implement from scratch in NumPy. Reference: Andrew Ng course 1, week 1.
Done when: You can derive the closed-form solution (XᵀX)⁻¹Xᵀy AND implement gradient-descent linear regression.

Rung 03-Logistic regression and the sigmoid¶

What: A linear model squashed through sigmoid for binary classification. Trained with cross-entropy.
Why it earns its place: First encounter with cross-entropy as MLE. The "neuron" you'll see in NN intros is a logistic regression.
Resource: Andrew Ng course 1, week 3. Implement from scratch.
Done when: You can derive the gradient of the binary cross-entropy w.r.t. weights.

Rung 04-Softmax regression (multiclass)¶

What: Logistic regression generalized to K classes. Softmax converts logits to a probability distribution.
Why it earns its place: Softmax is the output of every transformer LM. Cross-entropy + softmax is the LLM training objective.
Resource: Andrew Ng + the "softmax + cross-entropy" derivation in Goodfellow chapter 6.
Done when: You can implement softmax + cross-entropy in NumPy and explain why their combined gradient is softmax(z) − y.

Rung 05-Bias-variance tradeoff and overfitting¶

What: A model can fail by being too simple (high bias) or too complex (high variance). Train vs val gap diagnoses which.
Why it earns its place: Diagnosing training is 80% of ML practice. Eval gap analysis is the same skill applied to LLM evals.
Resource: Andrew Ng's ML diagnostic lectures. Plus The Hundred-Page Machine Learning Book (Burkov) chapter 5.
Done when: Given a learning curve, you can diagnose under- vs over-fitting.

Rung 06-Regularization (L1, L2, dropout, early stopping)¶

What: Penalties on weights or activations that prevent overfitting.
Why it earns its place: Every transformer training run uses weight decay (= L2). Dropout is in many architectures. Early stopping is a default discipline.
Resource: Deep Learning (Goodfellow) chapter 7.
Done when: You can explain why L2 regularization is equivalent to a Gaussian prior on weights.

Rung 07-Decision trees, random forests, gradient boosting¶

What: Tree-based models-still SOTA on tabular data.
Why it earns its place: You'll be tempted to use neural nets for everything. Knowing when XGBoost is the right answer is a maturity marker.
Resource: Andrew Ng course 2 + the XGBoost paper (arxiv.org/abs/1603.02754) skim.
Done when: You can train an XGBoost model on a tabular dataset and beat a simple neural net on it.

Rung 08-Evaluation metrics¶

What: Accuracy, precision, recall, F1, AUC, log loss, calibration. Each measures something different.
Why it earns its place: Picking the wrong metric is how teams optimize for the wrong thing. LLM evals are metric design problems in disguise.
Resource: scikit-learn metrics docs. Plus The Hundred-Page ML Book eval chapter.
Done when: Given a class-imbalanced problem, you can choose an appropriate metric and justify it.

Rung 09-Cross-validation and statistical comparison¶

What: k-fold CV for stable estimates. Paired t-tests or bootstrap to compare models meaningfully.
Why it earns its place: Comparing two LLM prompts on 50 examples and declaring a winner is how teams ship noise. CV discipline transfers directly.
Resource: scikit-learn cross-validation docs. Plus Sebastian Raschka's "Model Evaluation, Model Selection, and Algorithm Selection" paper (arxiv.org/abs/1811.12808).
Done when: You can defend a model comparison with proper variance estimates.

Rung 10-Feature engineering and data quality¶

What: Cleaning, normalizing, encoding categoricals, dealing with missing values, leakage.
Why it earns its place: "Better data > better model" applies at every level-including LLM fine-tuning datasets and RAG corpora.
Resource: Kaggle's "Intermediate Machine Learning" course. Plus the AI Engineering (Huyen) chapter on data.
Done when: You can identify a data leakage bug in a contrived dataset.

Minimum required to leave this sequence¶

Implement linear, logistic, and softmax regression from scratch.
Diagnose under- vs overfitting from a learning curve.
Train an XGBoost model on a tabular dataset.
Pick a metric appropriate to a problem and defend the choice.
Set up k-fold CV with a statistical comparison.

Going further¶

The Elements of Statistical Learning (Hastie, Tibshirani, Friedman; free PDF)-the canonical reference. Hard but rewarding.
Pattern Recognition and Machine Learning (Bishop)-older but still gold.
Hands-On Machine Learning (Géron)-practical, sklearn + Keras.

How this sequence connects to the year¶

Month 2: rungs 01–06 are the build-from-scratch month.
Month 6: rungs 08–09 are the foundation for LLM eval rigor.
Month 8: rung 10 (data quality) is the make-or-break of fine-tuning.