Saltar a contenido

07 - Transformers and Tokenization

What this session is

About an hour. What an LLM actually does - at the level needed to use, fine-tune, and serve them. The architecture, the tokenization step that confuses everyone, the autoregressive generation loop.

This page is dense. Plan to re-read.

The big picture

A language model: 1. Takes a sequence of tokens (integer IDs representing chunks of text). 2. Predicts a probability distribution over the next token. 3. You sample one. Append it. Repeat.

That's it. The clever part - what makes LLMs work - is the architecture that produces the next-token prediction. That architecture is the transformer.

Tokenization

Text → token IDs. The model never sees characters or words; it sees integers.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"
ids = tok.encode(text)
print(ids)
# [15496, 11, 995, 0]
print([tok.decode([i]) for i in ids])
# ['Hello', ',', ' world', '!']

Each token is roughly a "subword piece." Common words → single token. Rare words → multiple. The tokenizer learned its vocabulary during the model's training; you can't change it.

Why subwords: vocabulary size matters. Word-level vocabulary needs hundreds of thousands of entries (with new ones constantly appearing). Character-level produces very long sequences. Subword (BPE, WordPiece, SentencePiece) is the compromise: 32k-256k tokens covering nearly any text.

Practical implications: - Token counts are not word counts. "I am happy" = 3 tokens, "antidisestablishmentarianism" = many. - Prices are per-token; latency is per-token. - Different models have different tokenizers; same text → different token counts.

What a transformer does

Inside the model, each token ID becomes an embedding - a vector of ~768 to ~12000 dimensions (depending on the model).

A transformer layer processes a sequence of these vectors and produces an updated sequence of the same shape. The crucial operation is attention - every output position is a weighted sum of all input positions (and itself), where the weights are computed from the inputs themselves.

This lets the model "look at" other parts of the sequence when generating each output. Long-range dependencies (across thousands of tokens) become tractable.

Stack many such layers (~12-100), add positional encodings so the model knows token order, end with a linear projection back to the vocabulary, apply softmax - you have the next-token distribution.

The full math is in the AI Systems senior path, Deep Dive 07. For now, the operational view: a transformer transforms a sequence of token embeddings into a probability distribution over the next token.

Causal (autoregressive) vs masked

Two families:

  • Causal / autoregressive models (GPT-family, Llama, Mistral, Gemma) - each position attends only to positions before it. Generates left-to-right. Used for language generation.

  • Masked models (BERT-family) - every position attends to every other. Used for understanding (classification, NER, embeddings).

If you're working with chatbots, code completion, RAG - causal. Embedding for retrieval - masked. Many modern open-source LLMs are causal (the GPT-style architecture won).

The generation loop

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()

prompt = "The quick brown fox"
input_ids = tok.encode(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=30,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
    )

print(tok.decode(output[0]))

What's happening: 1. Tokenize the prompt into IDs. 2. Call model.generate(...) - a wrapped helper around the basic predict-and-append loop. 3. The model generates 30 more tokens, sampling from each next-token distribution. 4. Decode the resulting IDs back to text.

Sampling parameters: - temperature - how peaked the distribution is. 0 = pick the most likely (greedy). Higher = more diverse, less predictable. 0.7-1.0 typical. - top_p (nucleus) - only consider tokens whose cumulative probability is up to p. Avoids low-probability "weird" tokens. - top_k - only consider the k most-likely. Cruder but works. - max_new_tokens - when to stop. - stop - explicit stop strings.

Different sampling strategies → different output styles. Greedy is deterministic but often repetitive. Top-p sampling is the modern default.

What "small" and "big" mean

Some calibration:

  • GPT-2 small: 124M params. Fits in 500MB. Runs on a laptop.
  • Llama 3 8B: 8 billion. 16GB at FP16. Single high-end GPU.
  • Llama 3 70B: 70 billion. 140GB at FP16. Multiple GPUs (or quantized down to 4-bit for a single ~40GB GPU).
  • GPT-4-class frontier models: ~hundreds of billions to trillions (rumors; not public). Many GPUs.

The pattern: 10x bigger → noticeably smarter on hard tasks. Quality scales with parameters + training data + compute (the "scaling laws").

For learning, GPT-2 / Llama 3 8B (or smaller) suffice.

Context length

The maximum sequence length the model can attend over. GPT-2 was 1024. Llama 3 is 8192-128000+. Frontier models claim 200k+.

Two implications: - Memory cost is quadratic in context length (attention is O(n²)). 100k context is 10000x more attention compute than 1k. - Practical context isn't the same as advertised context. A model trained on long contexts doesn't necessarily use the middle well - research papers ("Lost in the Middle") show degradation. RAG (page 10) mitigates this.

Embeddings (preview)

A model's tokenization step also produces token embeddings - vector representations of pieces of text. These are useful beyond generation: semantic search, classification, clustering.

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["a cat sat on a mat", "the dog ran"]
embeddings = m.encode(texts)            # shape (2, 384) for this model

Cosine similarity between two embeddings ≈ semantic similarity. Page 10 builds RAG on this.

Exercise

  1. Run the GPT-2 generation example above. Try different prompts. Vary temperature from 0.1 to 1.5. Note how it changes.

  2. Inspect tokenization:

    for text in ["hello", "antidisestablishmentarianism", "I'm fine.", "🦀"]:
        ids = tok.encode(text)
        print(f"{text!r} -> {ids} -> {[tok.decode([i]) for i in ids]}")
    
    Notice how rare words and emoji become multiple tokens.

  3. Greedy decoding:

    output = model.generate(input_ids, max_new_tokens=30, do_sample=False)
    
    Run twice with the same prompt. Output is identical (deterministic). Then with do_sample=True, different each time.

  4. (Stretch - GPU helpful): try a small open-source LLM. Hugging Face Hub: search gpt2-medium, microsoft/phi-2, meta-llama/Llama-3.2-1B. (Llama gates require accepting a license on HF.) Load and generate. Note the quality difference.

What you might wonder

"Why is the same word sometimes one token and sometimes two?" Subword tokenizers split based on frequency in their training data. " happy" (with leading space) and "happy" (without) are distinct tokens. Case matters too. Don't fight it; understand it.

"What's a 'chat model' vs a 'base model'?" Base models are trained on raw text. Chat models are fine-tuned (page 09) with conversation data + safety training. For "ask a question, get an answer" use chat models. For raw text completion or further fine-tuning, base models.

"What's the actual difference between GPT-2 and modern LLMs architecturally?" Mostly: more parameters, more training data, more compute. Architectural tweaks (rotary positional encoding, grouped-query attention, SwiGLU activations, RMSNorm) are real but modest. The scaling matters more than the architecture changes.

"Should I implement attention from scratch?" Once, for understanding. Andrej Karpathy's "Let's build GPT" video walks you through it. Then use library implementations for production.

Done

  • Understand the token → embedding → transformer → next-token-prob pipeline.
  • Use a tokenizer; understand subword units.
  • Generate text with sampling parameters.
  • Know causal vs masked models.
  • Have a calibration of model sizes.

Next: Hugging Face Transformers →

Comments