04 - ML mental model in one page¶

What this session is¶

The whole of "what is machine learning" in one page. The mental model you'll keep coming back to. Not a course; a frame.

The one-sentence definition¶

Machine learning is fitting a function to data, where the function has lots of knobs ("parameters"), and the fitting is done by an optimizer that adjusts the knobs to make the function's outputs closer to known answers on training examples - and you hope it generalizes to new examples.

That's it. Everything else is variations.

The four pieces¶

Every ML system has these four:

1. Data¶

Examples. Inputs paired (usually) with desired outputs.

Supervised: labeled (input → output known).
Unsupervised: no labels (find structure in the inputs).
Self-supervised: the input is the label, in a clever way (e.g., predict next token of text).
Reinforcement: no labels; reward signal from the environment.

2. Model¶

A function with parameters. For neural nets, "parameters" means "the weights." A model with 7 billion parameters is a function with 7 billion knobs.

Bigger models can fit more complex functions. They also need more data and compute.

3. Loss function¶

A number that measures "how wrong is the model right now." Lower is better.

For regression (predict a number): mean squared error.
For classification (predict a category): cross-entropy.
For generation (predict a sequence): next-token prediction loss.

4. Optimizer¶

The algorithm that adjusts the knobs to reduce the loss. Usually a variant of gradient descent (SGD, Adam, AdamW). It looks at the gradient of the loss with respect to each knob and nudges that knob in the loss-reducing direction.

That's training: data → model predicts → loss measures wrongness → optimizer adjusts knobs → repeat.

What "learning" really is¶

Imagine you're trying to fit a curve through dots on a graph. The curve has 1000 wiggles you can adjust. Each step:

The curve makes a guess at each dot.
You measure how far off it is.
You nudge each wiggle a tiny bit to reduce the error.
Repeat 10,000 times.

That's it. Neural network training is this, with billions of wiggles instead of 1000, and the "dots" being images, text tokens, or actions.

Why it works (mostly)¶

Two reasons:

Universal approximation: big enough neural nets can fit (approximate) any reasonable function.
Gradient descent finds good-enough minima: in high dimensions, the loss landscape has many decent solutions and the optimizer usually finds one.

It works better than people predicted in the 2010s. Nobody fully understands why so much of it generalizes. The empirical answer: "scale + transformers + lots of compute."

The three common shapes¶

Classification¶

Inputs → one of N categories. "Is this email spam?" Output: probabilities over categories.

Regression¶

Inputs → a number. "What price will this house sell for?" Output: a continuous value.

Sequence-to-sequence (generation)¶

Inputs → another sequence. "Translate this French to English." "Continue this paragraph." Output: tokens one at a time.

Most modern AI (LLMs, image generation, speech, video) is some flavor of sequence generation.

Generalization, overfitting, underfitting¶

Three states:

Underfitting: model too small or undertrained. Wrong on training and test data.
Good fit: correct on training data, mostly correct on new data.
Overfitting: memorized training data but fails on new data.

The whole game is finding the sweet spot. Techniques: more data, regularization, dropout, early stopping, smaller models.

What "deep learning" adds¶

Same four pieces. The model is a neural network - a function built of many simple layers stacked. "Deep" means lots of layers. Deep nets:

Can fit much more complex functions than classical ML.
Need more data and compute.
Discover useful intermediate representations automatically (the layers' outputs).

The last point is huge. In classical ML, you'd hand-craft features. In deep learning, the model figures them out.

What "LLMs" add¶

LLMs (Large Language Models) are deep nets, trained self-supervised on enormous text corpora, with a specific architecture (transformer - see page 5). They're "trained to predict the next token" at planet-scale, and out of that simple task comes the ability to translate, summarize, code, reason at-least-somewhat, and converse.

Nobody planned all those abilities. They emerged from scale. This is both the most exciting and most uncomfortable fact about modern AI.

The map of the field¶

Classical ML: decision trees, SVMs, random forests, linear regression. Still used. Still useful. Often the right answer for tabular data.
Deep learning: neural nets for perception (images, audio, video).
NLP / LLMs: transformers for language. Currently the loudest part of the field.
RL: agents learning from reward. Used heavily in LLM post-training (RLHF, DPO).
Generative models: GANs (older), diffusion (current). For images, video, audio.

You don't need to specialize in all of these. Pick one in page 7.

What you might wonder¶

"Is ML just statistics?" Sort of. Heavy overlap. ML emphasizes prediction over inference; statistics emphasizes inference over prediction. The math is largely shared.

"Why does GPT 'understand' me if it's just predicting tokens?" Open question. Empirically, next-token prediction at scale produces models that pass many tests of understanding. Whether they "really" understand is philosophy. Engineers treat them as useful tools and measure outputs.

"How can a model with no understanding write working code?" Patterns in training data + emergent generalization. The model has seen billions of examples of code-and-explanation pairs. It learns the conditional distribution. Often-but-not-always, this produces correct code.

Done¶

One mental model for all of ML.
Four pieces, three shapes, three fit-states.
Know where LLMs sit in the landscape.

Next: Transformers in one page →