05 - Transformers in one page¶

What this session is¶

Transformers explained at the level an AI engineer needs. Not a paper reading. Not a derivation. Just: what they are, what they do, the four ideas that make them work, and why they took over.

The one-sentence definition¶

A transformer is a neural network architecture where every position in the input can directly attend to (look at, weight, and pull information from) every other position, in parallel, using a mechanism called attention.

That's the whole thing. "Attention is all you need" was the 2017 paper's title; it remains the elevator pitch.

The four ideas¶

1. Tokens¶

Text gets split into tokens - pieces of words. "Tokenization" turns a string into a list of integers. The transformer never sees text; it sees token IDs.

"Hello, world" → [15496, 11, 995] (or similar).

The model's vocabulary is ~30K-200K tokens. Each token has a learned vector (an "embedding") of some hundreds or thousands of numbers.

2. Attention¶

For each token in the sequence, the model asks: "Which other tokens in this sequence should I pay attention to, and how much?"

Concretely, every token computes three vectors: a query, a key, and a value. Each token's attention to every other token is query · key (dot product). High dot product = high attention. The output for each position is a weighted sum of all values, weighted by attention.

This is the "every position attends to every position" part. It happens in parallel - that's why transformers train fast on GPUs.

3. Layers¶

A transformer has many layers. Each layer does: attention, then a small per-position feed-forward network. Stack 12-100+ of these. Each layer can route information differently. Lower layers tend to handle local structure (syntax); higher layers handle semantics; the very highest handle task-specific behavior.

Nobody hard-codes this. The optimizer discovers it.

4. Positional encoding¶

Attention is order-agnostic by default - it sees a set, not a sequence. So you add a positional encoding to each token embedding to inject "this is position 0, this is position 1, ..." Different schemes exist (sinusoidal, learned, RoPE, ALiBi). Modern LLMs use RoPE.

That's it. Tokens → embeddings + position → many attention layers → output logits over vocab → softmax → predicted next token.

Why transformers won¶

Three reasons:

Parallelism. Unlike RNNs, transformers can process all positions in parallel during training. GPUs love that. Training is 10-100x faster per parameter.
Scaling. Transformer performance scales smoothly with parameters, data, and compute. Doubling each gives predictable improvements. This made the AI investment thesis possible.
Generality. Same architecture works for text, code, images (Vision Transformer), audio, video. One architecture, many domains.

How LLMs use them¶

LLMs are decoder-only transformers trained to predict the next token. Show them billions of token sequences with the task "predict token N+1 given tokens 0..N." The optimizer tunes hundreds of billions of parameters until the model gets very good at this.

To generate text:

Tokenize the prompt.
Forward pass: get logits over vocabulary at the last position.
Sample one token (greedy, top-k, top-p, temperature).
Append to sequence.
Repeat from step 2.

That's inference. Slow because it's sequential - one token at a time. The whole serving optimization industry exists to make this fast.

What transformers don't do (intuition)¶

They don't reason in the human sense. They compute conditional probabilities over tokens.
They don't have memory across conversations unless you give it to them (context window or external memory).
They don't know what's true. They know what's statistically likely given the training data.
They don't refuse to make things up unless trained to.

Everything an LLM appears to do - reasoning, planning, refusing - is a learned behavior. Some emerge from scale (basic reasoning). Some are added in post-training (refusals, helpfulness).

What to know to do AI engineering¶

Context window. How many tokens the model can attend to at once. 8K, 128K, 1M depending on model.
Tokenization. Different tokenizers = different costs and behaviors. BPE, sentencepiece, tiktoken.
Attention is O(n²) in sequence length. Long context is expensive. This drives most serving optimizations.
KV cache. During generation, you cache the keys and values from previous tokens so you don't recompute. Huge memory consumer.
Quantization. Reducing the precision of weights (16-bit → 8-bit → 4-bit) to fit bigger models in memory. Some accuracy lost; often worth it.

These five items come up in every serving conversation. Know them.

Architectural variants¶

You'll hear these in interviews:

Encoder-only (BERT): good at understanding tasks, embeddings.
Decoder-only (GPT, Llama): good at generation. The current default for LLMs.
Encoder-decoder (T5): good at translation, summarization.
Mixture-of-Experts (MoE): only some parameters activate per token. Bigger total capacity, cheaper inference. DeepSeek, Mixtral.
Diffusion transformers: transformers for image/video diffusion. Stable Diffusion 3, SoRA.

What you might wonder¶

"Do I need to read the original 'Attention Is All You Need' paper?" Once, yes. Skim it. It's clear. The Annotated Transformer (Harvard NLP) walks through it with code.

"Will transformers be replaced soon?" Maybe. Active research: state-space models (Mamba), linear attention, hybrid architectures. Whatever replaces them will be a variation on "every position attends to every other position, cheaply." Skills transfer.

"Should I implement a transformer from scratch?" Optional. Karpathy's "Let's build GPT" video does it in ~2 hours. Worth it once for intuition. Don't make it your project.

Done¶

Know what tokens, attention, layers, positional encoding are.
Know why transformers won.
Know the engineer's checklist: context window, tokenization, O(n²), KV cache, quantization.

Next: The Hugging Face ecosystem map →