03-Probability & Statistics¶

Why this matters in the journey¶

Machine learning is applied probability. A model is a probability distribution p(y | x) you fit to data. Cross-entropy is a likelihood. Sampling from an LLM is sampling from a distribution over tokens. Every eval metric (precision, recall, AUC, perplexity, accuracy with confidence intervals) is statistics. You need probabilistic intuition, not measure theory.

The rungs¶

Rung 01-Sample space, events, probability axioms¶

What: A probability is a number in [0, 1] assigned to events in a sample space. P(A or B) = P(A) + P(B) − P(A and B).
Why it earns its place: You can't reason about anything below without this floor. Stats jargon assumes it.
Resource: Khan Academy "Statistics and probability" intro; or Introduction to Probability (Blitzstein, Stat 110, lectures free on YouTube-search "Stat 110").
Done when: You can compute the probability of "at least one head in 3 flips" without confusion.

Rung 02-Conditional probability and Bayes' rule¶

What: P(A|B) = P(A∩B)/P(B) and Bayes: P(A|B) = P(B|A)P(A)/P(B).
Why it earns its place: Naive Bayes, language modeling (P(word | context)), and most generative model thinking is Bayesian. Posterior reasoning is fundamental.
Resource: 3Blue1Brown's "Bayes theorem" video. Plus Stat 110 lectures 3–5.
Done when: You can solve the "disease test with 1% prior, 99% sensitivity, 95% specificity" problem and explain why the result is counterintuitive.

Rung 03-Random variables, expectation, variance¶

What: Random variable maps outcomes to numbers. E[X] is the average. Var(X) = E[(X−E[X])²].
Why it earns its place: Loss is E[loss(x, y)]. Generalization is about expectation under the data distribution. Variance shows up in regularization and exploration.
Resource: Stat 110 lectures 6–10. Or Mathematics for ML chapter 6.
Done when: You can compute mean and variance for a binomial and a discrete distribution by hand.

Rung 04-Common distributions¶

What: Bernoulli, Binomial, Categorical, Gaussian, Uniform. PDFs, PMFs, parameters.
Why it earns its place: A token distribution is Categorical. Weights are often initialized from Gaussian. Sampling temperature controls a Categorical's sharpness. You need these names automatic.
Resource: Khan Academy + Stat 110 distribution lectures. Plus the torch.distributions PyTorch docs.
Done when: You can sample from a Categorical in PyTorch and explain what temperature does to it.

Rung 05-Joint, marginal, conditional distributions¶

What: For two random variables: joint P(X,Y), marginal P(X) = ΣP(X,Y), conditional P(Y|X) = P(X,Y)/P(X).
Why it earns its place: Generative models model joint distributions. Discriminative models model conditionals. Knowing the difference is foundational.
Resource: Stat 110 joint distribution lectures.
Done when: You can explain in one sentence the difference between a generative and a discriminative classifier.

Rung 06-Maximum likelihood estimation¶

What: Pick parameters θ that maximize the probability of the observed data: θ̂ = argmax Πᵢ p(xᵢ; θ). Equivalently, minimize negative log-likelihood.
Why it earns its place: Cross-entropy loss is exactly negative log-likelihood for a categorical distribution. Every LLM is trained by maximum likelihood.
Resource: Pattern Recognition and Machine Learning (Bishop) chapter 1.2, or Deep Learning (Goodfellow) chapter 5.5. Plus Karpathy's makemore lecture 1-it derives MLE for a bigram model from scratch.
Done when: You can derive cross-entropy as negative log-likelihood of a Categorical and explain why this is the natural training objective.

Rung 07-KL divergence, entropy, cross-entropy¶

What: Entropy H(p) = −Σp log p measures uncertainty. KL D(p||q) = Σp log(p/q) measures distance between distributions. Cross-entropy H(p, q) = H(p) + D(p||q).
Why it earns its place: KL appears in DPO, knowledge distillation, RL with KL penalty (RLHF), and mutual information. Cross-entropy is the LLM training loss. These are not optional.
Resource: Cover & Thomas Elements of Information Theory chapter 2 (selected sections). Or this excellent post: search "KL divergence intuition"-Will Kurt and Jay Alammar both have good ones.
Done when: You can sketch KL divergence in a picture (two overlapping distributions) and explain why it's not symmetric.

Rung 08-Sampling: how to draw from a distribution¶

What: Inverse CDF, rejection sampling, ancestral sampling for discrete; reparameterization for continuous.
Why it earns its place: LLM decoding is sampling. Top-k, top-p (nucleus), temperature, beam search-all are sampling strategies. Reparameterization shows up in VAEs.
Resource: Deep Learning chapter 17 selectively. Plus Hugging Face blog posts on "How to generate text"-search "huggingface how to generate".
Done when: You can implement top-k and top-p sampling in NumPy.

Rung 09-Hypothesis testing and confidence intervals¶

What: Null hypothesis, p-value, t-test, bootstrap intervals.
Why it earns its place: When you say "model A is better than model B" with eval numbers, you need to know whether the difference is significant. Otherwise you ship noise.
Resource: Khan Academy "Inferential statistics." Plus Allen Downey's Think Stats (free PDF).
Done when: You can compute a bootstrap confidence interval on an eval metric and report it correctly.

Rung 10-Bayesian thinking (light touch)¶

What: Prior × Likelihood ∝ Posterior. Belief updating with evidence.
Why it earns its place: Bayesian reasoning is how good engineers reason about model uncertainty in production. Useful framing for evals and red-teaming.
Resource: 3Blue1Brown Bayes' theorem video; Stat 110 Bayes lectures. Optional: Bayesian Methods for Hackers (Cam Davidson-Pilon, free online).
Done when: You can explain why a 99%-accurate test for a rare disease still produces mostly false positives.

Minimum required to leave this sequence¶

Solve a Bayes' rule word problem without help.
Compute expectation and variance of common distributions.
Explain MLE and why it gives cross-entropy.
Define KL divergence and explain its asymmetry.
Implement top-k and top-p sampling.
Compute a bootstrap CI on a model accuracy and report it.

Going further (only after the minimum)¶

Joe Blitzstein-Stat 110 (Harvard)-full lectures, free; the canonical undergrad probability course.
Cover & Thomas-Elements of Information Theory-chapters 1–2 are foundational for anyone serious about LLMs.
Bishop-Pattern Recognition and Machine Learning-older but still the best probabilistic-ML book.

How this sequence connects to the year¶

Month 2: rungs 03–06 are needed to understand what you're optimizing when you train a classifier.
Month 3: rungs 06–08 are essential for understanding LLM training (cross-entropy as MLE) and decoding (top-k, top-p).
Month 6: rung 09 is required to report eval numbers honestly with confidence intervals.
Month 8: rung 07 (KL divergence) is required to read DPO / GRPO papers.