03-Probability & Statistics¶
Why this matters in the journey¶
Machine learning is applied probability. A model is a probability distribution p(y | x) you fit to data. Cross-entropy is a likelihood. Sampling from an LLM is sampling from a distribution over tokens. Every eval metric (precision, recall, AUC, perplexity, accuracy with confidence intervals) is statistics. You need probabilistic intuition, not measure theory.
The rungs¶
Rung 01-Sample space, events, probability axioms¶
- What: A probability is a number in
[0, 1]assigned to events in a sample space. P(A or B) = P(A) + P(B) − P(A and B). - Why it earns its place: You can't reason about anything below without this floor. Stats jargon assumes it.
- Resource: Khan Academy "Statistics and probability" intro; or Introduction to Probability (Blitzstein, Stat 110, lectures free on YouTube-search "Stat 110").
- Done when: You can compute the probability of "at least one head in 3 flips" without confusion.
Rung 02-Conditional probability and Bayes' rule¶
- What:
P(A|B) = P(A∩B)/P(B)and Bayes:P(A|B) = P(B|A)P(A)/P(B). - Why it earns its place: Naive Bayes, language modeling (
P(word | context)), and most generative model thinking is Bayesian. Posterior reasoning is fundamental. - Resource: 3Blue1Brown's "Bayes theorem" video. Plus Stat 110 lectures 3–5.
- Done when: You can solve the "disease test with 1% prior, 99% sensitivity, 95% specificity" problem and explain why the result is counterintuitive.
Rung 03-Random variables, expectation, variance¶
- What: Random variable maps outcomes to numbers.
E[X]is the average.Var(X) = E[(X−E[X])²]. - Why it earns its place: Loss is
E[loss(x, y)]. Generalization is about expectation under the data distribution. Variance shows up in regularization and exploration. - Resource: Stat 110 lectures 6–10. Or Mathematics for ML chapter 6.
- Done when: You can compute mean and variance for a binomial and a discrete distribution by hand.
Rung 04-Common distributions¶
- What: Bernoulli, Binomial, Categorical, Gaussian, Uniform. PDFs, PMFs, parameters.
- Why it earns its place: A token distribution is Categorical. Weights are often initialized from Gaussian. Sampling temperature controls a Categorical's sharpness. You need these names automatic.
- Resource: Khan Academy + Stat 110 distribution lectures. Plus the
torch.distributionsPyTorch docs. - Done when: You can sample from a Categorical in PyTorch and explain what temperature does to it.
Rung 05-Joint, marginal, conditional distributions¶
- What: For two random variables: joint
P(X,Y), marginalP(X) = ΣP(X,Y), conditionalP(Y|X) = P(X,Y)/P(X). - Why it earns its place: Generative models model joint distributions. Discriminative models model conditionals. Knowing the difference is foundational.
- Resource: Stat 110 joint distribution lectures.
- Done when: You can explain in one sentence the difference between a generative and a discriminative classifier.
Rung 06-Maximum likelihood estimation¶
- What: Pick parameters
θthat maximize the probability of the observed data:θ̂ = argmax Πᵢ p(xᵢ; θ). Equivalently, minimize negative log-likelihood. - Why it earns its place: Cross-entropy loss is exactly negative log-likelihood for a categorical distribution. Every LLM is trained by maximum likelihood.
- Resource: Pattern Recognition and Machine Learning (Bishop) chapter 1.2, or Deep Learning (Goodfellow) chapter 5.5. Plus Karpathy's
makemorelecture 1-it derives MLE for a bigram model from scratch. - Done when: You can derive cross-entropy as negative log-likelihood of a Categorical and explain why this is the natural training objective.
Rung 07-KL divergence, entropy, cross-entropy¶
- What: Entropy
H(p) = −Σp log pmeasures uncertainty. KLD(p||q) = Σp log(p/q)measures distance between distributions. Cross-entropyH(p, q) = H(p) + D(p||q). - Why it earns its place: KL appears in DPO, knowledge distillation, RL with KL penalty (RLHF), and mutual information. Cross-entropy is the LLM training loss. These are not optional.
- Resource: Cover & Thomas Elements of Information Theory chapter 2 (selected sections). Or this excellent post: search "KL divergence intuition"-Will Kurt and Jay Alammar both have good ones.
- Done when: You can sketch KL divergence in a picture (two overlapping distributions) and explain why it's not symmetric.
Rung 08-Sampling: how to draw from a distribution¶
- What: Inverse CDF, rejection sampling, ancestral sampling for discrete; reparameterization for continuous.
- Why it earns its place: LLM decoding is sampling. Top-k, top-p (nucleus), temperature, beam search-all are sampling strategies. Reparameterization shows up in VAEs.
- Resource: Deep Learning chapter 17 selectively. Plus Hugging Face blog posts on "How to generate text"-search "huggingface how to generate".
- Done when: You can implement top-k and top-p sampling in NumPy.
Rung 09-Hypothesis testing and confidence intervals¶
- What: Null hypothesis, p-value, t-test, bootstrap intervals.
- Why it earns its place: When you say "model A is better than model B" with eval numbers, you need to know whether the difference is significant. Otherwise you ship noise.
- Resource: Khan Academy "Inferential statistics." Plus Allen Downey's Think Stats (free PDF).
- Done when: You can compute a bootstrap confidence interval on an eval metric and report it correctly.
Rung 10-Bayesian thinking (light touch)¶
- What: Prior × Likelihood ∝ Posterior. Belief updating with evidence.
- Why it earns its place: Bayesian reasoning is how good engineers reason about model uncertainty in production. Useful framing for evals and red-teaming.
- Resource: 3Blue1Brown Bayes' theorem video; Stat 110 Bayes lectures. Optional: Bayesian Methods for Hackers (Cam Davidson-Pilon, free online).
- Done when: You can explain why a 99%-accurate test for a rare disease still produces mostly false positives.
Minimum required to leave this sequence¶
- Solve a Bayes' rule word problem without help.
- Compute expectation and variance of common distributions.
- Explain MLE and why it gives cross-entropy.
- Define KL divergence and explain its asymmetry.
- Implement top-k and top-p sampling.
- Compute a bootstrap CI on a model accuracy and report it.
Going further (only after the minimum)¶
- Joe Blitzstein-Stat 110 (Harvard)-full lectures, free; the canonical undergrad probability course.
- Cover & Thomas-Elements of Information Theory-chapters 1–2 are foundational for anyone serious about LLMs.
- Bishop-Pattern Recognition and Machine Learning-older but still the best probabilistic-ML book.
How this sequence connects to the year¶
- Month 2: rungs 03–06 are needed to understand what you're optimizing when you train a classifier.
- Month 3: rungs 06–08 are essential for understanding LLM training (cross-entropy as MLE) and decoding (top-k, top-p).
- Month 6: rung 09 is required to report eval numbers honestly with confidence intervals.
- Month 8: rung 07 (KL divergence) is required to read DPO / GRPO papers.