Saltar a contenido

Deep Dive 03-Classical ML Rigor

The discipline LLM engineers keep skipping, and what it costs them


0. Why this chapter exists

If you are a backend or SRE engineer pivoting to applied AI, there is a tempting story you can tell yourself:

"Classical ML is for the previous decade. We just call LLMs now. I do not need to know about logistic regression, ROC curves, or Brier scores. I need to learn prompt engineering, retrieval, evals, and agents."

This story is false in a specific and dangerous way. It is false because every one of the things you actually do with LLMs in production is a classical-ML problem wearing a different coat.

Consider what your real workload looks like once an LLM feature ships:

  1. You build an LLM-as-judge to score model outputs at scale. That judge is a classifier. Possibly multi-class, possibly ordinal. Everything that classical ML knows about classifiers-calibration, precision/recall trade-offs, class imbalance, threshold tuning-applies to it.
  2. You compute calibration on that judge's confidence. A judge that says "9/10" but is actually right 60% of the time is worse than useless: it actively misranks systems. Calibration is a 1990s topic. It is also the central topic of LLM evaluation in 2026.
  3. You detect drift on incoming traffic-distribution shift in user prompts, in retrieved documents, in agent action sequences. Drift detection is classical statistics applied to features. The features happen to be embeddings now, but the math is unchanged.
  4. You A/B-test an LLM feature against a baseline. Sample-size formulas, multiple-comparisons corrections, and confidence intervals come straight out of frequentist hypothesis testing. The fact that the "treatment" is "GPT-class model with retrieval" does not change the statistics.
  5. You build honest baselines for new features. The right baseline for "smart semantic search" is BM25 plus a small reranker. The right baseline for "AI classification" is a logistic regression on embeddings. If you cannot build those baselines, you cannot defend your LLM feature against a skeptical engineering manager.

So this chapter is not nostalgia. It is the foundation under everything you will be paid to do as an AI engineer. The reader should leave able to: derive the loss functions they reach for; compute calibration error by hand on a worked example; defend an A/B test result with a confidence interval; and recognize when an LLM "win" is actually a baseline they forgot to run.

We will go in roughly this order: data discipline, loss derivations, regularization, bias-variance, cross-validation, calibration, evaluation metrics, class imbalance, the classifier zoo, feature engineering, the classical-to-LLM bridge, statistical testing, A/B testing, baselines, and finally six worked exercises.

The math is plain Unicode. Where derivation matters, we derive. Where there is a numerical example to ground a metric, we run it.


1. Train / val / test discipline

The single most consequential thing you do in any ML project-classical or LLM-era-is split your data correctly. Almost every published "win" that fails to reproduce comes back to a split error. Almost every production model that degraded faster than expected comes back to a split error.

Why three splits, not two

You need three disjoint sets:

  • Train-the model fits its parameters here.
  • Validation (dev)-you make modeling decisions here: hyperparameters, model class, prompt template, retrieval strategy, judge rubric.
  • Test-you touch this once, at the end, to report a number. If you tune anything based on test, the test set is contaminated and you have effectively turned it into a validation set.

The reason for three is straightforward: every time you select among options based on a metric, you are fitting to that metric's noise. If you select hyperparameters on the test set, your test number becomes optimistically biased by an amount proportional to how many things you compared. After picking among 50 prompts on the same eval set, the best one looks roughly σ·√(2·ln 50) better than its true mean by chance alone, where σ is the metric's per-eval standard deviation. The validation set absorbs that selection bias so the test set does not.

Typical splits: 60/20/20 or 70/15/15 with thousands of examples; 80/10/10 with tens of thousands; 90/5/5 or even 98/1/1 once you are at hundreds of thousands and the test set is statistically large enough to detect the lift you care about.

Stratification

When the label distribution is not roughly uniform, random splitting will give you splits whose label proportions differ. If 5% of your data is positive, a random 1,000-example test set might contain 30 positives or 70 positives by chance, and that variance dominates your metric noise.

Stratified split: partition by label first, then split each stratum proportionally. The validation and test sets then have the same class balance as the population.

For LLM-as-judge work, stratify by the thing you care about distinguishing. If you are evaluating refusal behavior, stratify by whether the prompt is harmful. If you are evaluating retrieval, stratify by query type.

Temporal splits

Random splitting is wrong whenever there is time-ordered structure and the deployed model will see future data. Two cases dominate:

  1. Recommender systems and any model whose features are user-history-derived. Random splitting causes future user behavior to leak into training features. The model learns to predict the past from the future. It looks fantastic in eval and degrades brutally in production. The fix: split by time-train on data before T₁, validate on (T₁, T₂], test on (T₂, T₃].
  2. LLM evaluation, especially when prompts come from real users. User behavior drifts. Topics come and go. If you randomly split a prompt log into train/test, the test set may contain prompts whose topic appears 30 times in train. A temporal split-first 80% by timestamp for train/dev, last 20% for test-is a more honest estimate of how the system will perform on tomorrow's traffic.

The leakage failure modes that will bite you

These are the patterns that cause "92% accuracy in eval, 71% in production":

  • Target leakage in features. A feature is computed using information that would not be available at prediction time. Classic example: "average past purchase value" computed including the current purchase. Subtle example for LLMs: retrieving from a corpus that includes the gold answer document.
  • Group leakage. Multiple rows from the same entity (user, document, conversation) split across train and test. The model memorizes the entity. Fix: split by group, not by row.
  • Duplicate leakage. Near-duplicates across splits-paraphrases, the same document with different timestamps, scraped pages with boilerplate text. With LLM data this is endemic. Use exact-hash, MinHash, or embedding-similarity dedup before splitting.
  • Pre-processing leakage. You fit a scaler, vocabulary, or imputer on the whole dataset before splitting. Now the test set has influenced the train set. Fix: fit pre-processing on train only, then transform val and test.
  • Hyperparameter leakage on the test set. The most common one. You ran 200 prompt variants and reported the best one's test score. That score is biased upward.

For LLM-as-judge work specifically: leakage of the judge's training data into your eval set is real. If your eval prompts came from a public benchmark, the judge has seen them. Use private, recently created eval data for any number you care about defending.

This is why classical ML rigor matters before you do anything fancy: every advanced technique is built on top of these splits, and if the splits are wrong, no advanced technique can save the result.


2. Loss functions, derived

Loss functions are not arbitrary. Almost every loss you see in deep learning is the negative log-likelihood under some assumed noise model. Once you see this, you stop memorizing and start choosing.

2.1 MSE from Gaussian noise

Assume y = f(x; θ) + ε where ε ~ N(0, σ²). Then

p(y | x, θ) = (1 / √(2πσ²)) · exp( -(y - f(x; θ))² / (2σ²) )

The log-likelihood of the dataset is

log L(θ) = Σᵢ [ -½ log(2πσ²) - (yᵢ - f(xᵢ; θ))² / (2σ²) ]

Maximizing log-likelihood with respect to θ is equivalent to minimizing

L_MSE(θ) = (1/n) Σᵢ (yᵢ - f(xᵢ; θ))²

The factor 1/(2σ²) and the constant drop out because they do not depend on θ. So mean squared error is the maximum-likelihood loss when you believe noise is Gaussian with constant variance. If you have heteroscedastic noise, you should weight each squared error by `1/σᵢ² - that is exactly what weighted regression does.

2.2 MAE from Laplace noise

Assume ε ~ Laplace(0, b). The Laplace density is

p(y | x, θ) = (1/(2b)) · exp( -|y - f(x; θ)| / b )

Negative log-likelihood is, up to constants,

L_MAE(θ) = (1/n) Σᵢ |yᵢ - f(xᵢ; θ)|

So mean absolute error is the MLE under Laplace noise. The Laplace distribution has fatter tails than Gaussian, so MAE is robust to outliers: an outlier contributes linearly rather than quadratically. The price is non-differentiability at zero (use subgradients) and the fact that MAE optimizes for the conditional median, not the conditional mean.

Pick MAE when the noise model genuinely has heavy tails or when a single bad label should not pull the entire prediction. Pick MSE when noise is roughly symmetric and bounded and you want the conditional mean.

2.3 Cross-entropy from categorical MLE

For multiclass classification, model p(y = k | x; θ) = softmaxₖ(z(x; θ)) where z is the logit vector. The likelihood of one example with one-hot label y is

p(y | x, θ) = Πₖ p(y = k | x; θ)^{y_k}

Negative log-likelihood is

- log p(y | x, θ) = - Σₖ y_k · log p(y = k | x; θ)

Summed over the dataset and divided by n, this is the categorical cross-entropy:

L_CE(θ) = -(1/n) Σᵢ Σₖ y_{i,k} · log p_{i,k}

Equivalently, cross-entropy is the KL divergence between the empirical label distribution and the model distribution, plus the entropy of the empirical distribution (which is constant in θ):

KL(q || p) = Σ_k q_k · log(q_k / p_k) = Σ q_k log q_k - Σ q_k log p_k
                                       = -H(q) + CE(q, p)

So minimizing cross-entropy = minimizing KL divergence from the data to the model. This is the deep reason cross-entropy is the right loss for classification: it is the unique loss that is consistent with the categorical noise model, and it is a proper scoring rule-it is uniquely minimized when p matches the true label distribution.

2.4 Binary cross-entropy

The two-class special case. With p = σ(z) and label y ∈ {0, 1}:

L_BCE = -[ y · log p + (1 - y) · log(1 - p) ]

This is simply the multi-class cross-entropy with K = 2. Notice it penalizes a confident wrong answer extremely heavily: as p → 0 and y = 1, L → ∞. This is desired behavior-it says "do not be confidently wrong"-and it is also why a single label-flip in your training data can blow up the gradient.

2.5 Hinge loss

The classical SVM loss. For y ∈ {-1, +1} and decision function f(x):

L_hinge = max(0, 1 - y · f(x))

The intuition: as long as the example is correctly classified with margin ≥ 1, the loss is zero. Inside the margin, the loss grows linearly. Hinge is the loss that gives SVMs their large-margin geometry. It is non-probabilistic-there is no maximum-likelihood interpretation-and it is rarely the right choice once you want calibrated probabilities for downstream ranking. We mention it because you will see it in older codebases and because its margin idea reappears in contrastive learning losses.

The headline: the loss tells you what noise model you are committing to. Pick deliberately.


3. Regularization

Regularization adds a penalty to the loss that biases the model toward simpler solutions. From a Bayesian perspective, regularization is a prior on parameters.

3.1 L2 (weight decay) as a Gaussian prior

With prior θ ~ N(0, τ²I), the log-prior is

log p(θ) = -‖θ‖² / (2τ²) + const

The maximum a posteriori (MAP) estimate maximizes log p(θ | data) = log p(data | θ) + log p(θ). With Gaussian-noise likelihood, this becomes

L(θ) = MSE(θ) + (1 / (2τ²·n)) · ‖θ‖²

which, letting λ = 1/(2τ²·n), is MSE + λ‖θ‖². So L2 regularization is MAP estimation under a Gaussian prior on weights. Smaller τ² (tighter prior) means larger λ (stronger pull to zero).

For deep networks, L2 has additional consequences: it bounds the Lipschitz constant of each layer, which improves stability and generalization in ways the MAP story alone does not capture.

3.2 L1 (lasso) as a Laplace prior

With prior θⱼ ~ Laplace(0, b), the log-prior is - ‖θ‖₁ / b + const`. MAP gives

L(θ) = MSE(θ) + λ · ‖θ‖₁

L1 has a key geometric property: the ‖θ‖₁ ball has corners on the axes, so the MAP solution often lands on a corner-meaning some θⱼ = 0 exactly. This is sparsity: L1 selects features. L2 shrinks coefficients but rarely drives them to zero.

3.3 Elastic net

A convex combination:

L = MSE + λ₁ · ‖θ‖₁ + λ₂ · ‖θ‖²

Useful when you want sparsity (L1) but also stable handling of correlated features (which L2 provides-pure L1 picks one of a correlated group arbitrarily).

3.4 Dropout as an ensemble approximation

Dropout randomly zeros each activation with probability p during training. The standard interpretation: at each training step you are training a different sub-network, and the deployed network at test time (with weights scaled by 1-p) approximates the geometric mean prediction over the exponential number of sub-networks.

The Bayesian interpretation (Gal & Ghahramani): dropout at inference time, run many times, gives Monte Carlo samples from an approximate posterior over weights. This is one source of uncertainty estimates for neural networks-and it is the conceptual cousin of bagging in random forests.

3.5 Early stopping as implicit regularization

Stop training when validation loss stops improving. Equivalent to constraining the parameter trajectory: you never get far from the initialization, so you never overfit. For linear models with gradient descent on MSE, early stopping is exactly equivalent to L2 regularization with a particular λ that depends on the number of steps and the learning rate. For nonlinear models the equivalence is only approximate, but the intuition holds: stopping early = staying simple.

3.6 Why AdamW's weight decay is not L2 in SGD (the AdamW paper insight)

In SGD with L2 regularization, the gradient step is

θ ← θ - η · (∇L(θ) + λ · θ)

The λ·θ term is part of the gradient and gets the same scaling as the data gradient. Now consider Adam: gradients are normalized by their running second moment, so the effective step on the L2 term is η · λ·θ / √v̂, which means parameters that have small (have not been updated much) get decayed more than parameters with large . The decay strength becomes parameter-dependent in a way you did not ask for.

AdamW decouples the decay from the gradient:

θ ← θ - η · (Adam_update(∇L(θ))) - η · λ · θ

Now decay is applied directly, uniformly, after the adaptive update. This recovers the original "shrink toward zero" intent. The practical result reported by Loshchilov & Hutter is consistently better generalization-and this is why every modern transformer training script uses AdamW, not Adam plus L2.

The lesson generalizes: optimizer choice and regularization interact in non-obvious ways. When in doubt, decouple.


4. Bias / variance

The bias-variance decomposition is a clean identity that explains why every model class has a sweet spot of capacity.

For a fixed test point x, with target y = f(x) + ε and prediction f̂(x; D) learned from a random dataset D, expected squared error decomposes as:

E_D,ε[ (y - f̂(x; D))² ]
   = ( E_D[f̂(x; D)] - f(x) )²        ← Bias²
   + E_D[ (f̂(x; D) - E_D[f̂(x; D)])² ] ← Variance
   + Var(ε)                              ← Irreducible error
  • Bias² measures how far the average model (across draws of training data) is from the truth. Increases when the model is too simple to represent f.
  • Variance measures how much the model fluctuates with the training data. Increases when the model is so flexible it fits noise.
  • Irreducible error is the noise floor: even the optimal model cannot do better than Var(ε).

Worked capacity example

Imagine fitting polynomials of degree d to 30 points sampled from y = sin(x) + ε with ε ~ N(0, 0.1²):

degree d bias² variance total error
1 0.45 0.01 0.47
3 0.03 0.05 0.09
9 0.005 0.18 0.20
15 0.001 0.55 0.56

(These are stylized numbers, not from a specific paper, but the shape is robust.) The U-shape is the point: too little capacity → high bias; too much capacity → high variance; optimum somewhere in between. Cross-validation and learning curves are the diagnostic tools.

Learning curves

Plot training loss and validation loss vs training set size (or vs training steps).

  • High bias (underfitting): both curves plateau at a high value, close to each other. More data does not help. Solution: more capacity, better features.
  • High variance (overfitting): training loss is low, validation loss is much higher. The gap is the variance. More data helps. Solution: more data, regularization, less capacity.

For LLMs the same picture holds, but the curves are usually drawn against compute or tokens rather than examples. The diagnostic question-"is the gap closing?"-is unchanged.

Modern caveat: double descent

For very over-parameterized models (the regime LLMs live in), the classical U-curve gets a second descent: error first rises as you cross the interpolation threshold, then falls again as capacity grows further. This does not invalidate bias-variance-it just says that in the over-parameterized regime, the variance term behaves non-monotonically because of the geometry of the loss landscape. For day-to-day classical-ML work, the U-curve picture is still the right mental model.


5. Cross-validation

The basic idea: when data is scarce, a single train/val split is too noisy. Use multiple splits and average.

k-fold CV

Partition data into k disjoint folds. For each fold i: train on the other k-1 folds, evaluate on fold i. Average the k metrics. Typical k is 5 or 10.

  • Variance reduction: the metric estimate has roughly 1/k the variance of a single split, at the cost of k times the training compute.
  • Bias: each model is trained on (k-1)/k of the data, so the metric slightly underestimates the performance of a model trained on all the data. Bigger k → less bias, more compute.

Stratified k-fold

Same as k-fold, but each fold preserves the class proportions. Mandatory for imbalanced classification. Use this whenever your label distribution is skewed.

Leave-one-out (LOOCV)

k = n. Each fold has one example. Maximally low bias. Maximally high variance and high compute. Useful only for very small datasets (n < 100) or when the model has a closed-form leave-one-out estimator (e.g., ridge regression has O(1) LOOCV via the hat matrix).

Group / time-series CV

When rows are not independent, vanilla k-fold leaks. Use:

  • GroupKFold: ensures all rows from the same group land in the same fold.
  • TimeSeriesSplit: each fold's training set is a prefix in time, validation set is the next chunk. Never includes future data in training.

When CV beats a single val split

  • Small data (under ~10k examples).
  • High metric variance per split.
  • You need confidence in model selection, not just a point estimate.

When not to use CV: large datasets, long-training models (don't k-fold a foundation-model fine-tune; you cannot afford it), and any time you have a natural temporal split that you should respect anyway.


6. Calibration

This is the section that most directly carries to LLM-eval work. Read it twice.

What "calibrated" means

A classifier outputs a probability p for each prediction. The classifier is calibrated if, among all predictions with confidence ≈ p, the empirical accuracy is also ≈ p. Concretely: of all predictions with p ∈ [0.7, 0.8], about 75% should be correct.

A model can be highly accurate but miscalibrated, and a poorly accurate model can still be perfectly calibrated. They are orthogonal properties.

Reliability diagrams

Bin predictions by predicted probability (e.g., 10 equal-width bins on [0, 1]). For each bin, plot:

  • x-axis: average predicted probability in the bin.
  • y-axis: empirical accuracy in the bin (fraction correct).

A perfectly calibrated model has all bins on the y = x diagonal. Bins above the diagonal mean under-confidence (model says 60%, is right 80% of the time). Below means over-confidence (model says 90%, is right 70% of the time). Modern deep networks and LLMs are typically over-confident.

Expected Calibration Error (ECE)

The standard scalar summary. With M bins, n total predictions, B_m predictions in bin m, average confidence conf(B_m), and empirical accuracy acc(B_m):

ECE = Σ_{m=1..M} (|B_m| / n) · | acc(B_m) - conf(B_m) |

Lower is better. ECE = 0 means perfectly calibrated. Caveats: ECE depends on bin choice (equal-width vs equal-frequency), is biased downward for small samples, and is not a proper scoring rule. People still use it because it is intuitive. If you need a single number that combines calibration and accuracy, use Brier score or log loss (Section 8).

Why LLM probabilities are typically miscalibrated

Three forces stack:

  1. Cross-entropy training over-confidently fits the training distribution. A network minimizing cross-entropy is rewarded for pushing probability to 0 or 1; the limiting solutions are over-confident on shifted distributions.
  2. RLHF and instruction tuning collapse uncertainty. A model trained to "give a confident, helpful answer" learns to express high subjective certainty even when it should not.
  3. The token-level probabilities are not the right calibration target. When you ask an LLM "rate this output 1-10," the produced number is a token sample from a heavily post-trained distribution; it is not a probability estimate of correctness in the classical sense.

The practical consequence: an LLM-as-judge that says "9/10" might be right anywhere from 60% to 95% of the time, and the mapping varies by domain. You must measure and recalibrate.

Calibration techniques

Temperature scaling. Train the base model normally. On a held-out calibration set, find a single scalar T > 0 that minimizes negative log-likelihood when logits are scaled by 1/T:

p_calibrated = softmax(z / T)

T > 1 spreads the distribution (corrects over-confidence). T < 1 sharpens (corrects under-confidence). One parameter, no change to accuracy (since argmax is preserved), and remarkably effective. This is the default for deep classifiers and the right default for LLM-as-judge confidence.

Platt scaling. Fit a logistic regression on top of the raw model score:

p_calibrated = σ(a · score + b)

Two parameters (a, b). Designed for SVMs. Works well when the miscalibration is approximately a sigmoid distortion. Less flexible than isotonic regression but more data-efficient.

Isotonic regression. Fit a non-decreasing piecewise-constant function from raw score to calibrated probability. Non-parametric, so it can correct any monotonic miscalibration, but needs more data-typically a few thousand calibration examples. Risk of overfitting when the calibration set is small.

When to pick which:

  • Small calibration set (~hundreds): temperature scaling if multi-class, Platt if binary.
  • Medium (~thousands): Platt for binary.
  • Large (~10k+): isotonic if you suspect non-sigmoidal miscalibration.

Why this matters for LLM-as-judge

Suppose you have two systems A and B and you score 1000 outputs from each with an LLM judge. The judge produces scores in {1..10}. You want to claim "B is better." Two failure modes:

  1. The judge is biased: it gives higher scores to longer outputs regardless of quality. You have measured length, not quality.
  2. The judge is miscalibrated: a score of 9 means "right 70% of the time," a score of 7 means "right 65% of the time," and the gap is well within sampling noise.

Without calibration, you cannot tell whether a 0.3-point average score lift is a real win or a recalibration of the judge's emotional tone. With calibration, you can convert each judge score to an actual probability of correctness, then aggregate, and report a defensible number.

This is also why a good evaluation setup includes a gold subset-a few hundred examples scored by humans-used purely to calibrate the judge. Without that gold subset, your "judge says A scores 8.4 and B scores 8.7" is a vibes-based number.


7. Evaluation metrics, derived

7.1 Confusion matrix

For binary classification with labels {0, 1}:

                  predicted=1   predicted=0
actual=1            TP            FN
actual=0            FP            TN

Almost every metric is a ratio of cells in this table.

7.2 Accuracy, precision, recall

accuracy  = (TP + TN) / (TP + FP + TN + FN)
precision = TP / (TP + FP)        ← of the things I called positive, how many were?
recall    = TP / (TP + FN)        ← of the actual positives, how many did I catch?

The asymmetry: precision penalizes false positives; recall penalizes false negatives. Which matters depends on the cost structure. Spam filter: false positive (real mail in spam) is costly; you want high precision. Cancer screening: false negative (missed disease) is costly; you want high recall.

7.3 F1 and Fβ

F1 is the harmonic mean of precision and recall:

F1 = 2 · precision · recall / (precision + recall)

The harmonic mean punishes the weaker of the two, so F1 is high only when both are high. Fβ generalizes to weight recall β times more heavily:

F_β = (1 + β²) · precision · recall / (β² · precision + recall)

β = 1 → F1. β = 2 → recall weighted 4× as much as precision. β = 0.5 → precision weighted 4× as much as recall.

7.4 ROC curve, AUC

For each threshold τ on the score, compute:

TPR(τ) = TP / (TP + FN)        ← recall
FPR(τ) = FP / (FP + TN)        ← false positive rate

The ROC curve plots TPR vs FPR as τ sweeps from -∞ to +∞. The diagonal is the random baseline. The upper-left corner is perfect.

AUC (area under ROC curve) has a beautiful interpretation: it is the probability that a randomly chosen positive scores higher than a randomly chosen negative.

AUC = Pr(score(x_pos) > score(x_neg))

Equivalently, using the Mann-Whitney U statistic, AUC equals the average rank of positives in the combined sorted list, normalized appropriately. AUC is threshold-free and scale-invariant-it depends only on ranking, not raw scores.

Computation in O(n log n): sort all examples by score; AUC is the count of (positive, negative) pairs in correct order divided by the total such pairs.

7.5 PR curve, AUPRC

Plot precision vs recall as the threshold sweeps. AUPRC is the area under this curve.

When PR beats ROC: imbalanced classes. With 0.1% positives, even a model that flags every example as positive gets a low FPR (≈ 0%) on the negatives and looks deceptively good on ROC. The PR curve, by focusing on precision and recall-both tied to the rare positive class-does not have this problem. Default rule: under ~10% positive rate, prefer PR over ROC for headline numbers.

7.6 Log loss as an eval metric

Log loss is binary cross-entropy on the held-out set:

log_loss = -(1/n) Σ [ yᵢ log pᵢ + (1 - yᵢ) log(1 - pᵢ) ]

Properties:

  • Proper scoring rule: uniquely minimized at p = true probability of y = 1 | x.
  • Penalizes confident wrong predictions extremely heavily.
  • Sensitive to calibration: a perfectly accurate but miscalibrated model has higher log loss than the same model after temperature scaling.

When you want a single number that combines calibration and accuracy, log loss is the right pick.

7.7 Brier score

Brier = (1/n) Σ (pᵢ - yᵢ)²

Where pᵢ is the predicted probability of class 1 and yᵢ ∈ {0, 1}. Properties:

  • Proper scoring rule. Minimized at the true conditional distribution.
  • Bounded in [0, 1], unlike log loss.
  • More forgiving of confident wrong predictions than log loss (quadratic vs unbounded log).
  • Famously decomposable into reliability + resolution + uncertainty terms-directly tied to calibration.

For LLM-judge calibration work, Brier is often the better default than log loss because a single badly-calibrated example does not blow up the metric.


8. Class imbalance

When positives are rare (fraud, cancer, rare-event detection, "is this answer hallucinated"), naive training and naive metrics both mislead.

Why "97% accuracy" can be a trap

If 3% of examples are positive, predicting "negative" for every example yields 97% accuracy. The model has learned nothing. This is the canonical reason to never report accuracy as your only metric on imbalanced data.

Resampling

  • Random oversampling: duplicate positive examples until the class ratio is balanced. Risk: overfitting to those duplicates.
  • Random undersampling: drop negatives until balanced. Risk: throwing away signal.
  • SMOTE (Synthetic Minority Over-sampling Technique): for each minority example, pick a random nearest minority neighbor, generate a new synthetic example on the line segment between them. Reduces the duplicate-overfitting problem of plain oversampling. Works in feature space, so the synthetic examples need to be in a space where linear interpolation is meaningful (raw images: no; embeddings: usually yes).

Class-weighted loss

Re-weight the loss so each class contributes equally regardless of count:

L = -(1/n) Σ w_yᵢ · [ yᵢ log pᵢ + (1 - yᵢ) log(1 - pᵢ) ]

with w_1 = n / (2·n_1) and w_0 = n / (2·n_0), for example. Equivalent to oversampling in expectation, but with no duplicate-overfitting risk.

Threshold tuning vs threshold-free metrics

You can also leave training alone and pick a non-default threshold at inference. Train normally, then pick the threshold τ that maximizes F1 (or your preferred metric) on the validation set. This is often the simplest fix.

Threshold-free metrics-AUC, AUPRC, log loss, Brier-sidestep the threshold issue entirely and are the right things to report on imbalanced data.

LLM-era version

The LLM-era class-imbalance problem is "rare, expensive failures": hallucination, refusal-leak, jailbreak success. Positives are rare. Random sampling of evals will undercount. The fix is the same: stratified sampling by failure type, active sampling of likely-positive examples for the eval set (e.g., scan production logs for outputs that an auxiliary classifier flags as suspicious), and PR-style metrics rather than accuracy.


9. The classifier zoo (operational depth)

9.1 Logistic regression-the baseline

The decision rule is p(y = 1 | x) = σ(wᵀx + b). Training minimizes binary cross-entropy. The MLE has no closed form but is convex in (w, b), so any reasonable solver finds the global optimum.

Why it is the baseline:

  • Linear in features, so interpretable (coefficient signs tell you what the model uses).
  • Fast to train, fast to score.
  • Calibrated by construction (when fit by NLL on a representative sample).
  • A surprising number of "AI features" are within a few percent of a logistic regression on good features-including, often, embeddings.

If you cannot beat a logistic regression on embeddings of your input, your fancy model is not earning its keep.

9.2 Random forests-bagging trees

A random forest is an ensemble of decision trees, each trained on:

  1. A bootstrap sample of the training data.
  2. A random subset of features at each split.

Predictions are averaged (regression) or voted (classification). The bagging averages out the high variance of individual deep trees; the random feature subsets decorrelate the trees so that the average actually helps.

Feature importance comes for free: average reduction in impurity per split, or permutation importance (shuffle a feature, measure accuracy drop). Permutation importance is more honest because impurity-based importance is biased toward high-cardinality features.

When to reach for RF:

  • Tabular data with mixed feature types.
  • You want a robust baseline with minimal tuning.
  • You need feature importance for explanation.

9.3 Gradient boosting-XGBoost, LightGBM, CatBoost

Unlike RF (parallel ensemble of full-depth trees), gradient boosting builds trees sequentially, each one fitting the residuals of the current ensemble. The objective is a Taylor expansion of the loss, with regularization on tree complexity.

Why it is still SOTA on tabular:

  • Sequential fitting captures interactions that single trees miss.
  • Strong regularization (depth limits, leaf weights, learning rate) controls overfitting tightly.
  • Engineered for speed: histogram-based splits (LightGBM), GPU support, sparse-aware splits.
  • Native handling of missing values and (CatBoost) categorical features.

Tuning matrix in rough order of importance: learning rate × num_estimators (joint), max_depth or num_leaves, subsample / colsample_bytree, regularization (lambda, alpha), min_child_weight.

9.4 When tree models beat neural nets

On most tabular datasets, gradient boosting outperforms tabular MLPs. The reasons (well-discussed in tabular-DL literature):

  • Trees handle heterogeneous, irregular feature distributions natively. NNs need careful normalization.
  • Trees are insensitive to monotonic transforms of features. NNs can be sensitive.
  • Trees handle categorical features without forcing them into a continuous space.
  • Tabular data is usually small (thousands to millions of rows). NNs need more data to beat the inductive biases of trees.

Where neural nets win: very large tabular datasets with sequential or relational structure (e.g., user-event sequences), and any data where representation learning matters (text, image, audio).

The implication for LLM engineers: when your problem reduces to "classify this structured context," think hard before reaching for an LLM. A LightGBM on engineered features is often cheaper, faster, more accurate, and easier to debug.


10. Feature engineering, briefly

The cliché says "deep learning made feature engineering obsolete." For text and images, that is largely true: a frozen embedding model captures what hand-crafted features used to. For tabular data, it remains crucial.

Where you still need it

  • Categorical encoding. One-hot, target encoding, hash encoding, embedding lookups. Target encoding (replace category with the mean target on training data) is powerful but leaks: do it inside cross-validation folds, never on the full dataset. The "out-of-fold target encoding" pattern is the leakage-free version.
  • Time features. Day of week, hour of day, time since last event, rolling means. Trees do not derive these on their own.
  • Interaction features. When you know two features matter jointly, multiply or concatenate them. Trees can learn this but more slowly.
  • Domain ratios. "transactions per day," "click rate this week vs all-time," "doc length normalized by query length."

Where LLMs absorb it

For text and increasingly for images, an embedding model encodes the input into a vector that subsumes most hand-crafted text features (length, n-grams, sentiment). You feed the embedding to a downstream classifier and the result usually beats hand-crafted features.

Where the line is

  • Pure text, pure image, pure audio: embeddings dominate. Skip feature engineering.
  • Tabular: feature engineering still wins.
  • Mixed (text fields plus categorical and numeric columns): hybrid wins. Embed the text, hand-engineer the rest, concatenate, gradient-boost.

The judgment call: how much signal lives in unstructured fields vs structured ones?


11. The classical → LLM bridge

Now we cash in.

LLM-as-judge IS a classifier

When you prompt an LLM to score "is this answer correct, 1-10," you have built a classifier. It has:

  • Inputs: the (prompt, answer, reference) tuple.
  • Outputs: a label or score.
  • A confusion matrix against gold labels.
  • Calibration, drift, class imbalance, threshold-tuning concerns.

Every section of this chapter applies. If you have not measured the judge's accuracy, precision, recall, calibration, and inter-rater agreement against humans, your eval pipeline is unverified.

Embeddings are features

An embedding e(x) ∈ ℝᵈ is a feature vector. Cosine similarity is a feature transform. Classical-ML rules apply:

  • Normalize before distance computation if you want cosine semantics.
  • Reduce dimensionality (UMAP, PCA) for visualization, never for distance-embeddings are designed for the original space.
  • Cluster (HDBSCAN, k-means) to find structure. Same caveats as classical clustering: pick k honestly, validate with held-out data.
  • Train downstream classifiers on embeddings as features. A logistic regression on top of an embedding is often the fastest, cheapest baseline classifier you can build.

RAG-as-classification

Retrieval-Augmented Generation reduces, at every step, to classification:

  • "Is this query in my knowledge base?" → classifier.
  • "Is this retrieved doc relevant to the query?" → reranker, which is a regression / classification.
  • "Did the answer ground in the retrieved evidence?" → classifier (groundedness judge).

Each of these is independently measurable, with precision/recall, calibration, and threshold tuning. A RAG system that has not measured retrieval recall@k, reranker AUC, and groundedness ECE is a black box you cannot debug.

Drift detection on embeddings

The classical drift methods-Kolmogorov-Smirnov on each feature, Population Stability Index, Maximum Mean Discrepancy-apply to embeddings. Practical recipe:

  1. Snapshot a baseline distribution of input embeddings during model launch.
  2. Daily, compute MMD or KS on each embedding dimension (or on principal components) between baseline and recent traffic.
  3. Alert when the metric exceeds a threshold tuned on historical baselines.

This is classical drift detection. The features are now learned, not engineered. The math has not changed since 2010.

LLM features are classical-ML features

Pull these threads together: everything you ship around an LLM is classical ML. The LLM is a complicated featurizer and a complicated decoder. The wrapper is classifiers, regressors, A/B tests, calibration, drift detection-the whole 1990s and 2000s curriculum, applied to richer features.


12. Statistical hypothesis testing for ML evaluation

When two models differ by 1% on a 1000-example test set, is it real?

The naive question

You compare model A (76% accuracy) and model B (77% accuracy) on n = 1000. Is B genuinely better?

The standard error of a proportion from n samples is √(p̂(1-p̂)/n). For p̂ = 0.77, n = 1000:

SE ≈ √(0.77 · 0.23 / 1000) ≈ √(0.000177) ≈ 0.0133

A 1% gap is well inside one standard error. It is not significant. You would need n on the order of 4·p(1-p) / Δ² ≈ 7,000 examples to detect a 1% absolute lift with reasonable power, and that is for independent samples. For paired comparisons (same examples scored by both models), see McNemar below-you can do better.

Bootstrap confidence intervals

The modern default. Procedure for CI of a metric M:

  1. Sample (with replacement) n examples from the test set.
  2. Compute M on the resample.
  3. Repeat B times (B = 1000 to 10,000).
  4. The 2.5th and 97.5th percentiles of the resampled metric are the 95% CI.

Pros:

  • Works for any metric-accuracy, F1, AUC, ECE, custom-without distributional assumptions.
  • Handles paired comparisons: bootstrap (A_score - B_score) directly to get a CI on the difference.

Cons:

  • O(B · n) compute. Trivial for tabular metrics; expensive for full LLM rollouts (so you bootstrap the scores, not the rollouts).
  • The bootstrap CI is asymptotically correct; for small samples and skewed metrics, BCa (bias-corrected accelerated) bootstrap is more accurate.

For day-to-day ML eval, the percentile bootstrap is the right default.

McNemar's test for paired comparisons

When the same examples are scored by both A and B, the right test is McNemar's. Build a 2x2 table:

                     B correct   B wrong
A correct              n_11        n_10
A wrong                n_01        n_00

The off-diagonal cells n_10 (A correct, B wrong) and n_01 (A wrong, B correct) are the disagreements. Under the null hypothesis that A and B have the same accuracy, those disagreements should split 50/50.

The test statistic (with continuity correction):

χ² = (|n_10 - n_01| - 1)² / (n_10 + n_01)

This is χ² with 1 degree of freedom; reject H₀ if χ² > 3.84 (p < 0.05).

McNemar is far more powerful than the unpaired comparison because most examples are scored the same way by both models, so the variance comes only from disagreements. For LLM A/B comparisons on a fixed eval set, this is the right test.

Multiple comparisons (p-hacking in ML)

You ran 50 prompt variants, picked the best, and reported its accuracy as "p < 0.01 vs baseline." This is wrong: the per-test α of 0.01 over 50 tests gives a family-wise probability of false discovery near 1 - 0.99⁵⁰ ≈ 39%.

Corrections:

  • Bonferroni: divide α by the number of tests. Conservative but bulletproof.
  • Holm: stepwise version of Bonferroni; less conservative.
  • Benjamini-Hochberg: controls False Discovery Rate (expected proportion of false positives among rejections), not family-wise error. Most useful when running many tests and willing to tolerate some false positives.

In LLM evaluation, the multiple-comparisons problem is endemic: every "let's try N prompts and report the best" is implicit p-hacking. The honest version: pick the prompt on a separate validation set, report only the test-set number of the chosen prompt. This is exactly the train/val/test discipline of Section 1, in statistical clothing.


13. A/B testing for LLM features

You shipped a feature behind a flag. 50% of users get treatment (LLM-powered), 50% control. After a week, treatment converts at 12%, control at 10%. Real or noise?

Sample size formula

For a binary metric with baseline rate p, detecting an absolute lift Δ with significance level α and power 1-β, the required sample size per arm is roughly:

n ≈ (z_{α/2} + z_β)² · 2 · p · (1-p) / Δ²

With α = 0.05 (z = 1.96) and power 0.8 (z_β = 0.84):

n ≈ (1.96 + 0.84)² · 2 · p · (1-p) / Δ²
  ≈ 7.85 · 2 · p · (1-p) / Δ²
  ≈ 15.7 · p · (1-p) / Δ²

Hence the rule of thumb n ≈ 16 · p(1-p) / Δ² per arm.

For p = 0.10 and Δ = 0.02 (relative lift of 20%, absolute lift of 2 percentage points):

n ≈ 16 · 0.10 · 0.90 / 0.0004 = 16 · 0.09 / 0.0004 = 3,600

So 3,600 per arm-7,200 total-to detect a 2-point lift with 80% power and 95% confidence.

If you push to detect a 1-point lift, n quadruples to ~14,400 per arm. This is why detecting small lifts requires big traffic, and why most "I tried it on 200 users and it looked great" claims are statistically invisible.

Bayesian A/B testing

An alternative framing: model the conversion rate of each arm with a Beta distribution. Beta(α, β) with α = 1 + conversions, β = 1 + non-conversions. Posterior:

  • P(treatment > control) is computable by Monte Carlo from the posteriors.
  • Stop when this probability exceeds, say, 95%.

Pros: continuous monitoring without inflating false-positive rate (if you have honest priors); intuitive output ("90% chance treatment is better"); decision-theoretic clarity (combine with cost/benefit to decide).

Cons: choice of prior matters (a flat Beta(1,1) is often fine but not always); the "stop when >95%" rule is not the same as a fixed-horizon test; you need to be clear about whether you are doing a Bayesian decision or smuggled-in early stopping.

Sequential testing pitfalls

Stopping early when significance is reached is the classic sin. If you peek at the test every day for 14 days and stop at the first significant day, the family-wise error rate is far higher than the nominal 5%.

Two safe options:

  1. Pre-commit a fixed sample size based on the formula above and run to completion. Boring, but correct.
  2. Sequential probability ratio tests (SPRT) or always-valid p-values (Howard et al.): designed specifically to allow continuous monitoring with controlled error rates. Cost: you need more total samples than a fixed-horizon test if the truth is truly null.

For LLM features specifically: be aware that the metric you A/B test on (engagement, conversion, retention) may not be the metric you care about (quality, helpfulness, hallucination rate). You almost always need both: an offline eval against gold labels for quality, and an online A/B test for behavior. Either alone is misleading.


14. The honest baseline anti-pattern

This is the section your engineering manager wishes everyone read.

Every claim of the form "our LLM feature improves X" should be tested against, at minimum:

  • Random. The trivial baseline. Astonishingly often, "AI feature" beats random by less than people expect.
  • A heuristic. Hand-coded rules from a domain expert. Frequently within a few percent of the LLM.
  • A keyword/BM25 baseline (for retrieval). BM25 is 50 years old and beats many "semantic search" launches.
  • A linear classifier on embeddings. Logistic regression on top of a frozen embedding model. Cheap, fast, calibrated, often within a percent of a fine-tuned LLM.
  • A small fine-tuned encoder. A DistilBERT or similar fine-tuned on your task. The right baseline for "we used an LLM for classification."

Common pattern that disappears under proper baselines:

"We replaced our regex with GPT-class extraction; F1 improved from 0.78 to 0.84."

Then you run BM25 + a small reranker and get 0.83. The "AI win" was 1 point of F1 at 100x the cost. Sometimes the LLM is genuinely worth it; sometimes the regex was just due for a tune-up. You only know which by running the baselines.

The honest engineer's checklist before shipping an LLM feature:

  1. Does the LLM beat random by enough to matter?
  2. Does it beat a hand-coded heuristic written in an afternoon?
  3. Does it beat BM25 (for search) or logistic-regression-on-embeddings (for classification)?
  4. Does it beat a small fine-tuned encoder?
  5. Does the lift survive bootstrap CI on a held-out test set?
  6. Does the lift survive a real A/B test with sufficient sample size?
  7. Is the inference cost defensible at deployed scale?

If the answer to any of (1)-(6) is "I haven't measured," the feature is not ready to ship. If (7) is "no," the feature is not ready to scale.

This is what classical ML rigor produces: the discipline to ask these questions before shipping, not after the next quarterly review.


15. Practical exercises (worked)

Exercise 1-F1 and F2 from precision/recall

Given precision = 0.8, recall = 0.5.

F1 = 2 · 0.8 · 0.5 / (0.8 + 0.5) = 0.8 / 1.3 = 0.6154
F2 = (1 + 4) · 0.8 · 0.5 / (4 · 0.8 + 0.5) = 5 · 0.4 / 3.7 = 2.0 / 3.7 = 0.5405
F0.5 = (1 + 0.25) · 0.8 · 0.5 / (0.25 · 0.8 + 0.5) = 1.25 · 0.4 / 0.7 = 0.5 / 0.7 = 0.7143

Reading: F1 = 0.615 (balanced view); F2 = 0.541 (recall-weighted, lower because recall is weak); F0.5 = 0.714 (precision-weighted, higher because precision is strong). The metric you pick communicates a value judgment.

Exercise 2-ECE on a small example

100 predictions, 5 equal-width bins on [0, 1].

bin range count avg confidence empirical accuracy gap weighted gap
1 [0.0, 0.2) 10 0.10 0.20 0.10 (10/100)·0.10 = 0.010
2 [0.2, 0.4) 20 0.30 0.35 0.05 (20/100)·0.05 = 0.010
3 [0.4, 0.6) 30 0.50 0.40 0.10 (30/100)·0.10 = 0.030
4 [0.6, 0.8) 25 0.70 0.60 0.10 (25/100)·0.10 = 0.025
5 [0.8, 1.0] 15 0.90 0.73 0.17 (15/100)·0.17 = 0.0255
ECE = 0.010 + 0.010 + 0.030 + 0.025 + 0.0255 = 0.1005

Reading: ECE ≈ 0.10. The model is materially miscalibrated, especially in the upper bins where it claims 0.7-0.9 confidence but is right only 60-73% of the time. This is the over-confident pattern typical of deep classifiers and LLM judges. Temperature scaling would compress the logits and likely cut ECE roughly in half.

Exercise 3-MLE for logistic regression on a 2-point dataset

Data: (x₁ = -1, y₁ = 0), (x₂ = +1, y₂ = 1). Model: p(y=1 | x) = σ(w·x + b).

Likelihood:

L(w, b) = (1 - σ(-w + b)) · σ(w + b)
       = σ(w - b) · σ(w + b)              [using 1 - σ(z) = σ(-z)]

Negative log-likelihood:

ℓ(w, b) = -log σ(w - b) - log σ(w + b)
       = log(1 + e^{-(w-b)}) + log(1 + e^{-(w+b)})

Gradients:

∂ℓ/∂w = -σ(-(w - b)) - σ(-(w + b)) = -(1 - σ(w-b)) - (1 - σ(w+b))
       = σ(w-b) + σ(w+b) - 2
∂ℓ/∂b = (1 - σ(w-b)) - (1 - σ(w+b)) = σ(w+b) - σ(w-b)

Setting ∂ℓ/∂b = 0 gives σ(w + b) = σ(w - b), hence b = 0.

Setting ∂ℓ/∂w = 0 with b = 0 gives 2σ(w) = 2, so σ(w) = 1, which requires w → ∞.

The MLE diverges. The data is linearly separable, so the likelihood has no finite maximum: pushing w → ∞ drives both σ(w) and 1 - σ(-w) to 1, so the likelihood tends to 1.

Lesson: linearly separable data + logistic regression + no regularization = unbounded weights. This is exactly why L2 regularization is non-optional in practice. With penalty λw², the regularized objective has a finite minimum at some finite w that depends on λ.

Exercise 4-Bootstrap CI for accuracy difference

Two models, each evaluated on the same 200 examples. Model A correct on 158, model B correct on 165.

Naive: A = 0.79, B = 0.825, lift = 0.035. Significant?

Paired bootstrap procedure:

  1. Build a length-200 vector d where dᵢ = 1[B correct on i] - 1[A correct on i]. So dᵢ ∈ {-1, 0, +1}. The mean of d is 0.035.
  2. Resample (with replacement) 200 indices, compute the mean of d on the resample, store. Repeat B = 10,000 times.
  3. The 2.5th and 97.5th percentiles of the resampled means are the 95% CI on the lift.

Approximate analytic answer (for sanity): Var(d) = E[d²] - (E[d])². With around 30 disagreements (rough estimate from the marginals), E[d²] ≈ 30/200 = 0.15, and (E[d])² ≈ 0.001, so Var(d) ≈ 0.149, and SE(mean d) ≈ √(0.149/200) ≈ 0.0273.

So 95% CI ≈ 0.035 ± 1.96 · 0.0273 = 0.035 ± 0.0535 = [-0.018, +0.089].

The CI includes zero. The lift is not significant at the 95% level given this sample size. To call it real, you would need either more data or McNemar on the disagreement pattern (which uses the same information more efficiently). McNemar with n_10 = 5, n_01 = 12 would give χ² = (|12-5|-1)² / 17 = 36/17 ≈ 2.12-still not significant. With n_10 = 4, n_01 = 11, χ² = (|11-4|-1)² / 15 = 36/15 = 2.4-still not significant.

This is the discipline: a 3.5-point lift on n = 200 is noise.

Exercise 5-Temperature scaling

You have a deep classifier on 4 classes. On a held-out calibration set, you observe over-confidence. Logits z ∈ ℝ⁴. Calibrate by finding a single scalar T > 0 that minimizes negative log-likelihood:

T* = argmin_T  -(1/n) Σᵢ log p̂_{i, yᵢ}(T)
       where  p̂_{i, k}(T) = exp(z_{i,k} / T) / Σⱼ exp(z_{i,j} / T)

The gradient with respect to T (chain rule on softmax):

d/dT log p̂_{i, yᵢ} = (1/T²) · ( z_{i, yᵢ} - Σ_k p̂_{i, k}(T) · z_{i, k} )
                  = (1/T²) · ( z_{i, yᵢ} - E_p̂[z_{i, ·}] )

So the loss gradient is

∂L/∂T = -(1/n) Σᵢ (1/T²) · (z_{i, yᵢ} - E_p̂[z_i])

This is a one-dimensional convex problem. Solve it with bisection on the interval [0.5, 5] (a wide search range that brackets typical answers). 20 bisection steps gets you T* to four decimal places. No retraining of the base model needed. Argmax is preserved, so accuracy is unchanged. Calibration improves dramatically when the only miscalibration is over-confidence-which it usually is.

Concrete numerical example: a single example with logits z = (4, 2, 1, 0) and true class 0.

  • T = 1: p̂ = softmax(4,2,1,0) ≈ (0.864, 0.117, 0.043, 0.016). Confidence on class 0: 0.864.
  • T = 2: scaled logits (2, 1, 0.5, 0); softmax ≈ (0.620, 0.228, 0.139, 0.083)-still correct, less over-confident.
  • T = 4: scaled logits (1, 0.5, 0.25, 0); softmax ≈ (0.387, 0.235, 0.183, 0.143)-much flatter.

You pick the T that on the calibration set as a whole minimizes NLL. Anywhere from 1.3 to 2.5 is typical for deep classifiers.

Exercise 6-Sample size for a 2% lift

Baseline conversion rate p = 0.10. We want to detect Δ = 0.02 (so treatment rate p + Δ = 0.12) at α = 0.05 (two-sided), power = 0.80.

n_per_arm ≈ 16 · p(1-p) / Δ²
         = 16 · 0.10 · 0.90 / (0.02)²
         = 16 · 0.09 / 0.0004
         = 1.44 / 0.0004
         = 3,600

So 3,600 per arm, 7,200 total. At a more conservative power = 0.90 (z_β = 1.28):

n_per_arm ≈ (1.96 + 1.28)² · 2 · p(1-p) / Δ²
         = 10.5 · 0.18 / 0.0004
         ≈ 4,725

So roughly 4,700 per arm for 90% power.

Key sanity check: if traffic to the feature is 500 users/day, you need 7,200 / 500 = ~15 days at minimum to read the test, and you must not peek-and-stop earlier. If traffic is 50/day, you need 144 days, and the right move is probably to either (a) increase traffic to the test, (b) use a more sensitive metric, or (c) reduce the question to an offline eval that needs less data.

This is the most-skipped calculation in product-LLM work. Do it before you launch the experiment, not after.


16. Closing-what you take away

If you have absorbed this chapter, you should now be able to:

  • Diagnose splits. Spot the leakage modes, defend the choice between random and temporal splitting, and explain why three splits beat two.
  • Choose loss functions deliberately, knowing each is an MLE under a particular noise model.
  • Pick regularization consistent with your prior beliefs, and explain why AdamW exists.
  • Read learning curves and tell a high-bias problem from a high-variance one.
  • Compute calibration by hand: build a reliability diagram, calculate ECE, recommend temperature scaling vs Platt vs isotonic.
  • Choose evaluation metrics that match the cost structure, the class balance, and whether you care about ranking, threshold-tuned decisions, or calibrated probabilities.
  • Handle imbalance by combining stratified splits, weighted loss, threshold tuning, and threshold-free metrics.
  • Reach for the right baseline-logistic regression, random forest, gradient boosting-when the LLM is overkill or when you need a defensible reference point.
  • Connect classical and LLM work: judges as classifiers, embeddings as features, RAG as classification, drift as feature-distribution monitoring.
  • Run statistical tests: bootstrap CIs as the default, McNemar for paired comparisons, multiple-comparisons corrections when running many variants.
  • Design A/B tests with honest sample sizes and refuse to peek-and-stop.
  • Demand baselines before believing any "LLM win."

The thread running through every section: LLMs do not replace classical ML rigor. They demand more of it. The new models are richer, the wrappers around them are larger, and the ways they can fail silently are more numerous. The discipline that produced trustworthy spam filters in 2005 is the discipline that produces trustworthy LLM features in 2026-applied to richer features, evaluated against richer baselines, calibrated on richer judge signals.

If you skipped classical ML to start with LLMs, this is the chapter to come back to before each major launch. The math is not new; it is non-negotiable.

Comments