Saltar a contenido

Deep Dive 11-Multimodal Foundations

A self-contained reference chapter patching the gap between a text-only 2026 curriculum and the natively-multimodal frontier of 2027. Targeted at applied AI engineers-backend/SRE/platform people pivoting into ML-who need the math, the architectural reasoning, and the production patterns at the same level of rigor as the LLM chapters.

Conventions: vectors are lower-case (x), matrices upper-case (W), batch dims explicit. Math is plain Unicode. Where a result is from a specific paper it is cited by author and year. Where a number is a frontier-model capability or price, it is hedged ("as of late 2024/2025") because those move quarterly-verify before quoting in production.


0. Why this chapter exists

The 2026 curriculum that this deep-dive lives next to was scoped around text-only LLMs: tokenization, transformer attention, RAG, evals, fine-tuning, agents. That scoping was correct in 2023 and defensible through 2024. By the end of 2024 it was already starting to bend, and by mid-2026 it is structurally incomplete. The reasons are concrete:

  1. Frontier models are natively multimodal. GPT-4o (OpenAI, May 2024), Claude 3.5 Sonnet with vision (Anthropic, June 2024), Gemini 1.5/2.0 (Google), and Llama 3.2 Vision (Meta, September 2024) all accept image input as a first-class modality. Most accept audio too. By 2027 there will be no production-grade frontier model that is text-only-the same way there is no production-grade web framework that does not support HTTPS.

  2. User input is no longer text. Real users paste screenshots, photograph receipts, drag in PDFs, and dictate voice messages. The text box is one of several inputs, not the primary one. A chat product launched in 2026 that does not accept image upload is missing table stakes.

  3. Production pipelines are collapsing. The 2022-pattern was OCR → parser → LLM (three systems, three failure modes, three sets of evals). The 2026-pattern is image → vision-LLM (one system, one set of evals). The economics flipped when per-image inference dropped under a cent and quality crossed the OCR-pipeline baseline on most document types.

  4. Every applied AI engineer ships at least one multimodal feature. Document understanding, screenshot-driven debugging, voice transcription for call centers, image moderation, slide-deck search-these are no longer specialized CV/speech roles. They are line items on the backend roadmap.

This chapter therefore covers, with rigor: vision encoders (the foundation), CLIP (the bridge), multimodal LLM architectures (the four families), audio (Whisper-style ASR), image generation (diffusion), video models (brief), evaluation, production patterns, cost economics, the open-weights landscape, multimodal RAG, and ends with worked exercises.

It is dense. It is meant to be re-read. It assumes the LLM deep-dives in this repo (transformers, attention, RAG, evals, fine-tuning) have been internalized.


1. The shape of the problem

1.1 What "multimodal" means precisely

A modality is a data type with its own native structure: text (1D sequence of discrete tokens), image (2D grid of continuous RGB triples), audio (1D continuous waveform, sampled), video (3D-height × width × time), point clouds, sensor traces. A multimodal model is one whose forward pass accepts at least two of these as input or emits at least two as output.

Three kinds of multimodality matter in practice:

  • Multimodal input, text output-the dominant 2024–2026 pattern. "Here is an image; describe it / answer questions about it / extract this field." GPT-4o vision, Claude 3.5 with vision, Gemini, LLaVA, Pixtral, Qwen2-VL.
  • Text input, image/audio/video output-generative. Stable Diffusion, DALL-E, Sora, ElevenLabs TTS.
  • Any-to-any-the 2026+ frontier. GPT-4o speech-to-speech with image grounding. Gemini 2.0 with native audio I/O. Less mature open-weights story.

This chapter prioritizes the first (because it is what most applied engineers ship) and the second (because the economics and self-hosting story are very different from text), and treats any-to-any as a near-future trajectory rather than a deployment target.

1.2 The fundamental representational question

Every multimodal architecture answers one question: how do we get image/audio/video data into the same representational space as text tokens, so that the same transformer machinery can attend over the union? There are essentially four answers, and Section 4 will lay them out. The rest of this chapter is mostly the consequences of those four choices.


2. Vision encoders-the foundation

The job of a vision encoder is to map an H × W × 3 image tensor to a sequence of D-dimensional embedding vectors that downstream layers (an LLM, a contrastive head, a classifier) can consume. Two architectural eras matter: CNNs (briefly) and ViT.

2.1 The CNN era-what to remember, what to forget

From roughly 2012 (AlexNet) to 2020 (ViT), convolutional networks dominated computer vision. The core inductive biases:

  • Locality: a convolution kernel of size k × k slid over the input only mixes pixels within k pixels of each other. Rationale: edges, textures, and small motifs are local.
  • Translation equivariance: the same kernel is applied at every spatial position. A cat in the top-left and a cat in the bottom-right activate the same feature detectors. This is hard-wired by weight sharing across spatial positions.
  • Hierarchy via pooling: stride-2 convolutions or max-pool layers downsample, doubling the effective receptive field per layer. Early layers see edges; deep layers see object parts.

Canonical architectures: VGG (Simonyan & Zisserman, 2014, very deep, very simple), Inception (Szegedy et al., 2014, multi-scale parallel branches), ResNet (He et al., 2015, residual connections that enabled training networks 50–152 layers deep). EfficientNet (Tan & Le, 2019) systematized the depth/width/resolution tradeoff.

What to remember from the CNN era: - Residual connections (ResNet) are universal-every modern architecture uses them, including transformers. - The locality + hierarchy combination is data-efficient. CNNs trained on ImageNet (1.3M images) reach respectable accuracy. ViT does not, without augmentation tricks. - Convolutions remain the right tool for very small data regimes, real-time edge inference, and as the patchifier at the front of a ViT.

What to forget: CNNs as the dominant feature extractor for general-purpose vision. Since ViT and its successors (Swin, ConvNeXt, EVA), the field has converged on transformer-style backbones for anything that touches a foundation model.

2.2 Vision Transformer (ViT)-Dosovitskiy et al., 2020

The ViT paper ("An Image is Worth 16×16 Words") collapsed vision into the same architectural template as language. The pipeline:

  1. Patchify the image. Take an input of shape H × W × 3 (commonly 224 × 224 × 3). Divide into a grid of non-overlapping P × P patches (commonly 16 × 16, sometimes 14 × 14). For 224 × 224 with P=16, the grid is 14 × 14 = 196 patches. Each patch is a flattened vector of length P × P × 3 = 768.

  2. Linearly project each patch. A single learned matrix W_patch ∈ R^(P²·3 × D) maps each patch vector to a D-dimensional token embedding. D is typically 768 (ViT-Base), 1024 (ViT-Large), or 1280 (ViT-Huge). After this step the image is a sequence of 196 D-dim tokens-structurally identical to a text token sequence.

  3. Prepend a [CLS] token. A learned D-dim vector is prepended, mirroring BERT. Its final-layer state can be used as a global image representation.

  4. Add positional embeddings. Learned, one per position. Without these the transformer is permutation-invariant and cannot recover the 2D structure. Note: 1D positional embeddings work fine for ViT despite the input being 2D-the model learns the 2D layout from the embeddings themselves.

  5. Apply a standard transformer encoder. L layers, each with multi-head self-attention + MLP, with LayerNorm and residuals. Identical to BERT.

  6. Pool. For classification, take the [CLS] token's final state and apply a linear head. For embedding (CLIP-style) or feeding into an LLM, you may keep the full sequence of patch tokens.

Concretely, ViT-Base/16 has 12 layers, 12 heads, D=768, MLP-dim=3072, ~86M parameters. Compute-wise it is dominated by attention (quadratic in sequence length = 197) and the MLP (linear in sequence length, but expensive per token).

Why ViT won

  • Scaling. ViT scales like text transformers. ViT-Huge with JFT-300M pretraining beats CNN baselines on ImageNet. Bigger model + more data keeps helping, well past where CNNs plateaued.
  • Uniform architecture. A single transformer codebase serves text, vision, audio (Whisper), and protein sequences (AlphaFold-style attention). This compounds engineering velocity across modalities.
  • Pretraining transfer. The same self-supervised pretraining ideas that work for text (masked modeling, contrastive) transfer to ViT-MAE (He et al., 2021), DINOv2 (Oquab et al., 2023), SigLIP. CNNs had no comparable self-supervised story.
  • It is the same architecture as the LLM. This is not aesthetic. It means image tokens and text tokens can share a transformer stack with no architectural mismatch-which is the whole basis of native-multimodal models in Section 4.

Patch arithmetic-be precise

This calculation comes up constantly. For an image of size H × W and patch size P × P (assume H, W divisible by P):

n_patches = (H / P) × (W / P)

Examples: - 224 × 224, P=16 → 14 × 14 = 196 patches. - 224 × 224, P=14 → 16 × 16 = 256 patches. - 384 × 384, P=14 → 27 × 27 = 729 patches. - 512 × 512, P=16 → 32 × 32 = 1024 patches. - 1024 × 1024, P=14 → ~73 × 73 = 5329 patches (this is where attention's O(n²) bites).

The token count is what determines compute and what determines per-image API pricing. A vision-LLM that charges "85 to 1100 tokens per image" is doing a resolution-dependent patch count plus some overhead. Section 12 returns to this.

2.3 ViT successors worth knowing

  • Swin Transformer (Liu et al., 2021): hierarchical ViT with shifted windows. Restores some of the CNN-style locality bias for dense-prediction tasks (segmentation, detection). Important for tasks beyond classification.
  • DINOv2 (Oquab et al., 2023, Meta): self-supervised ViT trained on ~142M curated images. Produces general-purpose features that work zero-shot for retrieval, segmentation, depth estimation. Free open weights.
  • SigLIP / SigLIP 2 (Zhai et al., 2023; 2024): sigmoid-loss CLIP variant; trains better at smaller batch sizes, often used as the vision encoder in modern open VLMs.
  • EVA / EVA-02 (Fang et al., 2022/2023): scaled MIM-pretrained ViT, strong feature extractor, used by some Qwen-VL variants.

For applied work in 2026: pick a SigLIP-2 or DINOv2 encoder for embedding/retrieval, and rely on whichever ViT the open VLM you self-host already uses for its vision tower (you don't usually swap that out).


3. CLIP-the bridge between text and vision

CLIP (Contrastive Language-Image Pretraining, Radford et al., 2021, OpenAI) is the single most consequential pretraining recipe in multimodal ML. Almost every open VLM in 2025 uses a CLIP-or-CLIP-descendant as its vision encoder, and CLIP's embedding space underpins multimodal retrieval, zero-shot classification, image search, and Stable Diffusion's text conditioning.

3.1 Setup

You have a dataset of (image, caption) pairs at scale-CLIP's was 400M pairs scraped from the web. You want to learn:

  • An image encoder f_img : Image → R^D
  • A text encoder f_txt : Text → R^D

…such that for matched (image_i, text_i), f_img(image_i) and f_txt(text_i) point in the same direction in R^D, and for mismatched pairs they don't.

Architecturally: f_img is a ViT (or CNN in CLIP's original paper); f_txt is a transformer; both end with a linear projection to a shared D-dim space. The output vectors are L2-normalized, so similarity is just a dot product (= cosine similarity).

3.2 The contrastive loss-derive it

For a batch of N (image, text) pairs, encode both:

I_i = f_img(image_i), normalized        for i = 1..N
T_j = f_txt(text_j),  normalized        for j = 1..N

Compute the N × N similarity matrix:

S_{ij} = (I_i · T_j) / τ

where τ is a learned temperature scalar (CLIP initializes log τ such that τ ≈ 0.07).

For each image i, treat the N candidate texts as a classification problem where the correct label is text i. The image-to-text loss for image i:

L_i2t(i) = -log( exp(S_{ii}) / Σ_j exp(S_{ij}) )

Symmetrically, the text-to-image loss for text j:

L_t2i(j) = -log( exp(S_{jj}) / Σ_i exp(S_{ij}) )

The total CLIP loss:

L = (1 / 2N) · Σ_i [ L_i2t(i) + L_t2i(i) ]

That is: standard cross-entropy over rows of S (image-anchored) plus over columns (text-anchored), averaged. Both directions matter-without the symmetric term the text encoder would not be regularized to map to the same space.

Why this works

  • The denominator forces every matched pair (i, i) to outscore every mismatched pair (i, j ≠ i). The shared embedding space is implicit: it is whatever space makes that classification problem easiest.
  • N matters. Larger batches give harder negatives (more wrong texts to outscore for each image). CLIP used batch sizes of 32,768. Open replications (OpenCLIP, LAION) confirmed: scale of batch and scale of data both compound.
  • The temperature τ is learned. If τ is too high, all pairs look similar; too low, gradients vanish. Letting it learn is one of CLIP's sneakily important details.

3.3 What you get from a trained CLIP

Zero-shot classification

You don't need to fine-tune. To classify an image into K classes:

  1. Encode the image: I = f_img(image), normalized.
  2. For each class k, write a prompt template: t_k = "a photo of a {class_k}". Encode: T_k = f_txt(t_k), normalized.
  3. Predict: argmax_k (I · T_k).

CLIP's ImageNet zero-shot accuracy was ~76% top-1 (CLIP ViT-L/14 @ 336px, as reported in Radford 2021)-competitive with a fully-supervised ResNet-50, with no ImageNet labels ever seen.

The "a photo of a {x}" template matters. Prompt ensembling-averaging text embeddings from many templates ("a photo of a {x}", "a picture of a {x}", "a {x}")-gives a few points of accuracy, exactly as with LLM prompting.

Open-vocabulary retrieval and detection

Because the embedding space is shared, you can build text-to-image search over a corpus of unlabeled images: - Index: pre-compute I_j for all images. - Query at runtime: encode the text query → T. Return top-k by I · T.

This is a complete image search engine in roughly 50 lines of code on top of an HNSW index. Pre-CLIP, this required either labeled tags or a captioning pipeline plus text search.

For object detection, OWL-ViT, GLIP, and Grounding DINO extend the idea: CLIP-style alignment between text prompts and image regions, enabling "detect any object I describe in words" without retraining per class.

Conditioning Stable Diffusion

CLIP's text encoder is what Stable Diffusion 1.x and 2.x use to condition the diffusion U-Net (Section 7). The "prompt" you type into Stable Diffusion is encoded by a frozen CLIP text encoder; that vector cross-attends into the denoising network. SDXL added a second text encoder (CLIP-L + OpenCLIP-G) for richer prompts.

3.4 What CLIP does not solve

  • Fine-grained reasoning. CLIP knows "a photo of a cat" vs "a photo of a dog" cleanly. It does not reliably know "a photo of a cat sitting on top of a dog"-compositional spatial relations are weak. This is partly why VLMs add an LLM on top.
  • OCR. Vanilla CLIP is poor at reading text in images. SigLIP and follow-ups improved this. For document understanding, you want a VLM with a stronger OCR-trained vision backbone (Qwen2-VL, InternVL, GPT-4o, Claude 3.5).
  • Counting. "Three apples" vs "four apples" embeds nearly identically in CLIP space. Known limitation; LLM-based VLMs partially mitigate.

These limits motivate Section 4: don't stop at CLIP; bolt an LLM onto its image encoder.


4. Multimodal LLM architectures-the four families

Once you have a vision encoder (ViT or CLIP-ViT) and an LLM, the architectural question is how to fuse them. There are four common answers, with different cost/quality tradeoffs.

4.1 Late fusion (a.k.a. adapter / projector style)-LLaVA pattern

The cheapest, most modular, most replicated approach. Architecture:

image → ViT (frozen) → image features (n_patches × D_v)
                            ↓
                        MLP projector (learned)
                            ↓
                image-as-tokens (n_patches × D_lm)
                            ↓
[image tokens] + [text tokens] → LLM (mostly frozen) → output

The MLP projector is small-two linear layers with a GELU between them is the LLaVA-1.5 default. Its only job is to translate image-feature vectors into the LLM's token-embedding distribution so the LLM can consume them as if they were extra tokens.

Training is two-stage: 1. Feature alignment: freeze ViT and LLM; train only the projector on millions of caption pairs ("describe this image"). Cheap-projector is ~tens of millions of parameters. 2. Instruction tuning: unfreeze the LLM (and optionally ViT); train end-to-end on (image, instruction, response) triples. Datasets: LLaVA-Instruct-150K, ShareGPT4V, etc.

Pros: cheap, modular (swap LLMs without re-training the image stack), open-weights friendly, the entire LLM's capabilities transfer for free.

Cons: capacity bottleneck is the projector. The LLM never sees raw pixels-only what the ViT and the projector chose to surface. Fine-grained tasks (small text, diagrams) suffer.

This is the architecture you should assume by default when someone says "open vision-language model."

4.2 Cross-attention fusion-Flamingo pattern, Llama 3.2 Vision

Keep two separate streams. The text stream is a (mostly frozen) LLM. The image stream is a ViT producing patch tokens. Insert cross-attention layers into the LLM at intervals-so at layer k, text tokens additionally attend over image tokens via a new learned cross-attention block:

text_h_{k+1} = LM_layer_k(text_h_k) + GatedCrossAttn(text_h_k, image_tokens)

The "gated" part (Flamingo, Alayrac et al., 2022) means the cross-attention is multiplied by a learned scalar that is initialized at zero-so at training start the model behaves exactly like the original text-only LLM, and the cross-attention contribution learns in gradually. This stabilizes training enormously when adapting a strong text LLM to vision.

Llama 3.2 Vision (Meta, September 2024) uses this pattern: take Llama 3.1 text weights, add cross-attention blocks every 4 layers, freeze most of the text weights, train the cross-attention + adapter on image-text data. The 11B and 90B vision variants share the text-side weights with their text-only siblings.

Pros: the original LLM is preserved (so text-only quality does not regress), more capacity than a thin projector, clean separation of streams, easy to scale image resolution independently.

Cons: more parameters to train than late fusion; engineering complexity (custom attention patterns); image and text streams are still separate-no early mixing.

4.3 Early fusion-Chameleon, partly Gemini

Tokenize the image into discrete tokens (typically with a VQ-VAE or VQ-GAN style image tokenizer that maps patches to codebook indices). Then interleave image tokens and text tokens into a single sequence, with a single transformer trained from scratch on the union.

Chameleon (Meta, 2024) does this end-to-end: shared vocabulary across text and image tokens, single autoregressive objective over the interleaved sequence. The model emits both text and image tokens (the latter decoded back to pixels by the same VQ-VAE).

Pros: cleanest information flow-every layer sees both modalities. Highest quality ceiling, especially for tasks requiring tight image-text reasoning. Same model can generate images.

Cons: must train from scratch on a curated mixed corpus-you don't get to bolt vision onto an existing strong text LLM. Image tokenization introduces information loss (VQ-VAE bottleneck). Engineering difficulty is high.

4.4 Native multimodal-GPT-4o (rumored), Gemini, future-default

A continuum of "early fusion" where the model is trained from scratch with all modalities in scope from day one-text, image, audio, possibly video-with whatever per-modality encoders/decoders are needed and a shared transformer backbone over the union of token streams.

GPT-4o's audio capabilities, in particular, are believed to be native (audio in, audio out, end-to-end through a single model) rather than the older pipeline of Whisper → LLM → TTS. The end-to-end nature is what enables sub-300ms latency for voice conversation.

Gemini was designed from the ground up as multimodal (per Google's published descriptions) with text, image, audio, and video in the training mix.

Pros: no impedance mismatch between modalities; lowest latency; highest quality on cross-modal reasoning; the ability to generate in multiple modalities.

Cons: only feasible at frontier-lab scale. Open-weights replications are catching up but lag.

4.5 The decision matrix

For an applied engineer choosing what to consume or fine-tune:

Architecture When to use Examples
Late fusion (LLaVA) You want to fine-tune cheaply on a custom domain; you have an existing LLM you like. LLaVA, BakLLaVA, MiniGPT-4, ShareGPT4V
Cross-attention Open-weights model with strong text quality preserved. Llama 3.2 Vision, Flamingo (research)
Early fusion You need image generation + understanding in one model, and you have research-team scale. Chameleon
Native multimodal You are an API consumer; pick the strongest model and pay. GPT-4o, Claude 3.5 Sonnet vision, Gemini

For 95% of production work in 2026 the answer is: consume an API for hard tasks, self-host a late-fusion or cross-attention open model for high-volume narrow tasks. Section 13 gives the specific model menu.


5. LLaVA-style architecture in detail-the most common open pattern

Because LLaVA (Liu et al., 2023, "Visual Instruction Tuning") is the de facto open-weights template and the one you are most likely to reproduce or fine-tune, this section walks through it end to end.

5.1 The forward pass

Inputs: an image x_img and a tokenized text prompt x_txt (a list of token IDs).

  1. Vision encoder. Run x_img through a CLIP ViT (LLaVA-1.5 used CLIP ViT-L/14 @ 336px). Take the penultimate layer's patch tokens, not the final layer (the final layer is too "classification-y" and discards spatial detail). For 336×336 with patch=14, you get 24×24 = 576 image features, each of dim 1024.

  2. Projector. Pass each image feature through a 2-layer MLP:

    z_i = W_2 · GELU(W_1 · I_i + b_1) + b_2

…with output dim D_lm = the LLM's hidden size (e.g., 4096 for Llama-2-7B). Now you have 576 "visual tokens," each shaped like an LLM input embedding.

  1. Sequence assembly. Construct the LLM input sequence as:

    [BOS] [system_text_embeds] [image_token_embeds × 576] [user_text_embeds] [assistant_text_embeds...]

The image tokens are inserted at the position marked by a sentinel like <image> in the prompt template.

  1. LLM forward. Standard autoregressive decoder. The image tokens are part of the context and are attended over normally. Generation proceeds token by token over the assistant's text response.

That is the full architecture. There is no architectural novelty in the LLM-it is Llama or Vicuna with a pre-pended visual context of 576 extra tokens.

5.2 The two-stage training recipe

Stage 1-Feature alignment (a.k.a. projector pretraining). - Data: ~558K image-caption pairs (LLaVA used a filtered subset of CC-3M). - Frozen: ViT, LLM. Trainable: projector only. - Objective: standard next-token prediction on the caption, conditioned on image. - Cost: hours on a single 8×A100 node. - Why: align the visual feature distribution with the LLM's expected input embedding distribution. Without this, the LLM sees pseudo-tokens that are out-of-distribution and treats them as noise.

Stage 2-Instruction tuning. - Data: ~150K-1.2M (image, instruction, response) triples. LLaVA-Instruct-150K is GPT-4-generated by feeding it COCO captions and asking for instruction/response pairs about the image. - Trainable: projector + LLM (full fine-tune or LoRA). ViT optionally. - Objective: next-token prediction on the response. - Cost: a day or so on 8×A100; cheaper with LoRA.

The result is a model that follows visual instructions: "What is on the menu?", "Describe the chart," "Read the text in the screenshot."

5.3 Resolution handling-the dirty secret

Vanilla LLaVA at 336×336 is fine for natural images and useless for documents (text in a screenshot is ~10 pixels tall and unreadable). The fixes:

  • Higher resolution. LLaVA-1.5-HD, LLaVA-NeXT (1.6) increased to 672×672, 1344×336, etc. More patches = more compute, but readable text.
  • AnyRes / dynamic tiling. LLaVA-NeXT, Qwen2-VL, InternVL2: split the input into multiple tiles at the model's native resolution, run each through the ViT, concatenate the resulting visual tokens. A 1344×1344 image becomes 4×4 tiles of 336×336 → 16 × 576 = 9216 image tokens. Expensive but accurate for documents.
  • Native dynamic resolution. Qwen2-VL takes images at any aspect ratio and resolution natively, computes the patch grid dynamically, and feeds the resulting variable-length sequence to the LLM.

For applied work: if your inputs are documents/screenshots, use a model that supports either tiling or native dynamic resolution. Vanilla 336×336 LLaVA is a research artifact, not a production system.


6. Audio-the speech recognition foundation

Audio is the second modality every applied AI engineer touches. The dominant recipe is Whisper-style, and the dominant open model is Whisper itself.

6.1 From waveform to spectrogram

Audio enters as a waveform: a 1D sequence of amplitudes sampled at 16,000 Hz (the standard for speech). One second = 16,000 samples. A 30-second utterance is 480,000 samples-too long to feed directly to a transformer.

The pre-processing pipeline:

  1. Resample to 16 kHz if not already.
  2. Compute the short-time Fourier transform (STFT). Window the signal (e.g., 25 ms windows hopping every 10 ms), FFT each window. This produces a complex-valued spectrogram of shape (time × frequency).
  3. Take magnitudes, then mel-filter. The mel scale is a perceptual frequency scale that is roughly linear below 1 kHz and roughly logarithmic above-matching how humans hear pitch. A mel-filterbank is a set of (typically 80) overlapping triangular filters spaced on the mel scale. Apply them to the magnitude spectrogram → an 80 × T mel-spectrogram.
  4. Log compression. Take log(mel + ε). This compresses dynamic range, again matching human perception (loudness is logarithmic).

The result is an 80 × T tensor for T ≈ 3000 (for 30 s of audio, 10 ms hop). This is the input Whisper consumes.

Why mel and not raw FFT?

  • Human hearing's frequency resolution is logarithmic. A mel-scale concentrates representational capacity where humans (and speech) live (roughly 100 Hz – 4 kHz).
  • Empirically: every successful ASR system from DeepSpeech to Whisper to Conformer uses log-mel features. Models trained on raw waveforms (wav2vec 2.0) work but are more compute-intensive per second of audio.

6.2 Whisper architecture-Radford et al., 2022

Whisper (OpenAI) is an encoder-decoder transformer trained on ~680,000 hours of multilingual, multitask web audio.

Encoder. 1. Input: 80-channel log-mel spectrogram, 30-second chunk → 80 × 3000. 2. Two 1D convolution layers, kernel 3, the second with stride 2. After this, time dim is downsampled by 2 → 80 × 1500. Each "audio token" now represents ~20 ms (10 ms × 2). 3. Add sinusoidal positional embeddings. 4. Standard transformer encoder, L layers (4 to 32 depending on model size).

Output: 1500 audio-token embeddings.

Decoder. 1. Standard autoregressive transformer decoder. 2. Cross-attends over the 1500 audio tokens. 3. Vocabulary is BPE-tokenized text + special tokens for the multitask interface (Section 6.3).

Sizes (as in the Whisper paper). - tiny (39M), base (74M), small (244M), medium (769M), large (1.5B), large-v2/v3 (1.5B with more data).

6.3 The multitask interface

Whisper's clever trick: encode the task into the decoder prompt via special tokens, so a single model handles transcription, translation, language ID, and voice activity:

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|> ... text tokens ... <|endoftext|>
  • <|en|> / <|fr|> / etc.-language tag (or <|nolanguage|> for VAD-only).
  • <|transcribe|> vs `<|translate|> - output in source language, or translated to English.
  • <|notimestamps|> vs timestamp tokens-emit timestamps at sentence boundaries or not.

At inference, the user sets these prefix tokens to choose the task. Same weights, four tasks. This is the same idea as instruction tuning for LLMs, predating it slightly in the audio domain.

6.4 Production gotchas

  • Chunking. Whisper takes 30 s chunks. Longer audio: chunk + concatenate; handle word boundaries with VAD or sliding overlap. Libraries like whisperX and faster-whisper do this.
  • Language ID first. If your input language is unknown, run language ID on the first chunk before transcribing-wrong language tag tanks accuracy.
  • Hallucinations on silence. Whisper-large is famous for hallucinating "thanks for watching" or "subtitles by …" on silence, because YouTube training data contains those phrases over silent endings. Mitigate with VAD pre-filtering and condition_on_previous_text=False.
  • faster-whisper. A CTranslate2 reimplementation; ~4× faster than the reference at the same accuracy. Use it.
  • Streaming. Vanilla Whisper is not streaming. For real-time, use streaming variants (whisper_streaming, NVIDIA Parakeet, AssemblyAI's streaming API) or accept 1–5 s latency from chunked processing.

6.5 Beyond Whisper

  • Conformer-style (Gulati et al., 2020): conv + transformer hybrid; lower latency. NVIDIA Parakeet is a 2024 strong open variant.
  • wav2vec 2.0 / HuBERT (Meta): self-supervised pretraining on raw waveforms; basis for some VLMs that consume audio directly.
  • TTS (the reverse direction). ElevenLabs, OpenAI TTS, F5-TTS, Bark. Diffusion-based and autoregressive variants. Cheap relative to LLM tokens.
  • Speech-to-speech. GPT-4o's voice mode is end-to-end (no Whisper-LLM-TTS pipeline). Open replications: Moshi (Kyutai, 2024), Mini-Omni. End-to-end avoids cumulative latency and preserves prosody/emotion.

7. Image generation-diffusion foundations

The text-to-image world runs on diffusion models. Understanding them is non-optional for an applied engineer who will, at some point, ship an image-generation feature.

7.1 The big idea

Train a model to denoise images. To generate, start from pure noise and iteratively denoise. The ingenious part is how you train denoising: by adding known noise to real images and asking the model to predict the noise back.

Two processes: forward (noising, fixed, no parameters) and reverse (denoising, learned).

7.2 The forward process-DDPM (Ho et al., 2020)

Define a sequence of noise levels β_1, …, β_T (a schedule, typically linear or cosine, with β_t small and growing slowly; T = 1000 is canonical).

Define α_t = 1 − β_t and the cumulative product ᾱ_t = Π_{s=1..t} α_s. ᾱ_t shrinks from ~1 (almost no noise) to ~0 (almost all noise) as t grows from 1 to T.

The forward process adds Gaussian noise to a clean image x_0 to produce x_t:

x_t = √(ᾱ_t) · x_0 + √(1 − ᾱ_t) · ε,    ε ~ N(0, I)

This is a closed form-you can sample x_t at any t directly from x_0 in one step (no iteration needed). Crucially, you know the noise ε you added.

At t = T, x_T is essentially pure Gaussian noise, indistinguishable from N(0, I).

7.3 The reverse process-what the model learns

The reverse process tries to undo this:

p_θ(x_{t−1} | x_t) = N(x_{t−1}; μ_θ(x_t, t), Σ_θ(x_t, t))

DDPM parameterizes this by predicting the noise ε_θ(x_t, t) that was added, rather than predicting x_{t−1} or x_0 directly. Empirically, ε-prediction is the most stable parameterization.

Training loss

For each training example, sample x_0 from the dataset, sample t uniformly from {1, …, T}, sample ε ~ N(0, I), compute x_t in closed form, and minimize:

L = E_{x_0, t, ε} [ ‖ε − ε_θ(x_t, t)‖² ]

That is the entire training objective. A simple mean-squared error between the actual noise and the predicted noise. The model learns to look at any noisy image at any noise level and predict what noise was added.

Sampling-DDPM ancestral sampler

Start from x_T ~ N(0, I). For t = T, T−1, …, 1, compute:

x_{t−1} = (1 / √α_t) · ( x_t − (β_t / √(1 − ᾱ_t)) · ε_θ(x_t, t) ) + σ_t · z

where z ~ N(0, I) for t > 1, z = 0 for t = 1, and σ_t is a noise term (DDPM uses σ_t² = β_t).

Each step nudges x toward something less noisy, biased by the model's noise prediction. After T steps, x_0 is a generated image.

T = 1000 steps is slow. DDIM (Song et al., 2020) showed you can take a deterministic, non-Markovian path through the same ε_θ network with as few as 20–50 steps and get comparable quality. Modern samplers (DPM-Solver++, Euler-A, Heun) push this further.

7.4 Latent diffusion-the Stable Diffusion innovation (Rombach et al., 2022)

Doing diffusion in pixel space is expensive: a 512×512 image is 786,432 dimensions per step, and you do 50+ steps. Latent Diffusion Models (LDMs) instead:

  1. Train a VAE (or VQ-VAE) that compresses 512×512×3 images to a 64×64×4 latent z. ~48× spatial compression, ~12× total dimension reduction.
  2. Run the diffusion process in latent space-the U-Net denoises 64×64×4 latents instead of 512×512×3 pixels.
  3. After denoising, decode the final z back to an image with the VAE decoder.

Compute drops by ~50× with minor quality loss. Stable Diffusion 1.x, 2.x, SDXL, and SD3 all follow this template.

7.5 Conditioning-text-to-image

You don't want to generate a random plausible image; you want one matching a prompt. The standard mechanism:

  1. Encode the prompt with a frozen text encoder (CLIP text encoder for SD 1/2; CLIP-L + OpenCLIP-G for SDXL; T5-XXL for SD3 / FLUX). Output: a sequence of N text-token embeddings.
  2. The U-Net's blocks include cross-attention layers where image-feature tokens (queries) attend over text-token embeddings (keys, values). This lets the prompt guide which features to denoise toward.

So ε_θ becomes ε_θ(x_t, t, c) where c is the text embedding sequence.

7.6 Classifier-free guidance-derive it

Naively conditioning on text gives weak adherence to the prompt. Classifier-free guidance (CFG; Ho & Salimans, 2022) is the trick that makes text-to-image actually follow prompts.

Train the same network on both conditional and unconditional inputs (drop the prompt with probability ~10% during training, replacing with a null token). At inference, run the network twice:

ε_uncond = ε_θ(x_t, t, ∅)
ε_cond   = ε_θ(x_t, t, c)

…and combine:

ε_guided = ε_uncond + w · (ε_cond − ε_uncond)

w is the guidance scale, typically 5 to 12. w = 1 is standard conditional; w = 0 is unconditional; w > 1 amplifies the conditioning direction. Higher w → tighter prompt adherence and lower diversity / saturated colors / "fried" look.

Geometrically: ε_cond − ε_uncond is the direction in noise space that "points toward the prompt." Scaling that direction up amplifies the prompt's influence on the trajectory.

CFG doubles inference cost (two forward passes per step). Distillation tricks (LCM, Hyper-SD, SDXL Turbo) reduce step count and amortize CFG cost; some skip CFG by training a single guided network.

7.7 Modern variants

  • DiT-Diffusion Transformer (Peebles & Xie, 2022): replace the U-Net with a transformer over patchified latents. Scales better. Used by SD3, FLUX, Sora-class video models.
  • Flow Matching / Rectified Flow (Lipman et al., 2022; Liu et al., 2022): a different mathematical framing where the network predicts a velocity field mapping noise to data along straight paths. Same intuition, simpler training, often fewer sampling steps. SD3 and FLUX use this.
  • Consistency models (Song et al., 2023) and Latent Consistency Models (Luo et al., 2023): train a model that predicts x_0 directly in a single (or few) steps, by distilling a multi-step diffusion teacher. 1-4 step generation; quality cost.
  • Adversarial distillation (SDXL Turbo, SD3-Turbo): combine consistency-style distillation with a GAN discriminator. Single-step, near-multi-step quality.

For an applied engineer in 2026: - For prototyping image generation: use an API (DALL-E 3, Midjourney, FLUX-pro, Imagen)-fastest to ship. - For self-hosted: FLUX.1-dev or SDXL with LoRA fine-tuning. Quality gap to APIs is small for most domains. - For real-time / edge: SDXL-Turbo or LCM-distilled SDXL, or Stable Diffusion 1.5 with LCM-LoRA. 1–4 steps on consumer GPUs.


8. Video models-brief, the 2024–2026 frontier

Video is image generation with a time axis. The architectural templates:

  • Spatiotemporal patches. Tokenize video into 3D patches (H × W × T); run a transformer (DiT-style) with diffusion over them. Sora (OpenAI, February 2024) introduced this at frontier scale; the technical report described "spacetime patches" but did not release weights.
  • Open-weights as of 2025: CogVideoX (Tsinghua / Zhipu, 2024), HunyuanVideo (Tencent, December 2024), Mochi-1 (Genmo, 2024), Open-Sora (HPC-AI, 2024), Wan (Alibaba, 2025). Quality is real but not Sora-tier; the gap is closing each quarter.
  • Conditioning. Text-to-video, image-to-video (animate a still), video-to-video (edit a clip). Same CFG mechanics as images.

Compute cost: a video of T frames is roughly T× the cost per frame, modulated by temporal compression in the latent space (typical: 4× temporal compression in the VAE). 5-second 720p clips cost meaningfully more than single images-both in inference dollars and training compute. As of late 2024/2025, expect API pricing on the order of dollars per generated video clip (verify), and self-hosting video on consumer GPUs is feasible but slow.

For an applied engineer in 2026: video generation is rarely the right tool unless your product is video itself. For most apps, generated images, animations of static images, or simple parametric motion suffice. Watch the space; adopt when Sora-class quality is open-weights and consumer-GPU inferenceable.


9. Multimodal evaluation-what to measure and how

Evals for multimodal systems split along two axes: what kind of output (text answer vs generated image vs both) and what aspect of capability (perception, reasoning, generation faithfulness, hallucination).

9.1 Vision-LLM perception evals

Does the model see the image correctly? Standard benchmarks (as of late 2024/2025; verify current state of the art):

  • MMMU (Yue et al., 2023)-11.5K college-level problems across 30 subjects, image + question + multiple choice. Tests both perception and domain knowledge. Frontier models in 2024–2025 are in the 60–80% range; humans high 80s.
  • MMBench-multiple-choice across perception and reasoning sub-skills (object localization, counting, spatial relations, OCR).
  • MathVista-math problems requiring chart/diagram understanding. OCR + reasoning. Discriminative for chart-heavy applications.
  • DocVQA, InfographicVQA-document QA. Discriminative for document understanding.
  • ChartQA-chart QA specifically.
  • AI2D-diagram understanding (textbook-style scientific diagrams).
  • OCRBench-OCR-specific.

Pick the benchmark closest to your domain. A model that scores 85% on MMMU but 50% on DocVQA is the wrong choice for invoice processing.

9.2 Reasoning over image + text-LLM-as-judge

For open-ended VQA ("describe the chart and what conclusion it supports"), there is no clean ground-truth answer. Use the same LLM-as-judge framework from the text RAG eval chapter: rubric-based scoring on a held-out set, with periodic human spot-checks to recalibrate.

Specific multimodal-aware judging tips: - The judge LLM should see both the image and the candidate response. Use a strong vision-LLM judge (GPT-4o, Claude 3.5 Sonnet vision)-text-only judges miss visual hallucinations. - Score along axes: factual correctness about the image, completeness, specificity, hallucination penalty. - Pairwise comparisons are more reliable than absolute scores at the high end.

9.3 Hallucination-the unique multimodal failure mode

Vision LLMs hallucinate visual content. Common failures: - Object hallucination: claims an object is in the image that isn't. - Attribute hallucination: gets color, count, or position wrong despite the object being correctly identified. - OCR hallucination: invents plausible-looking text that the image does not contain. Especially bad with low-resolution or blurry text. - Anchoring on text prompts: "Is there a cat in this picture?" → "Yes" even when there isn't (sycophancy + visual prior).

Eval techniques: - POPE (Polling-based Object Probing Evaluation; Li et al., 2023)-yes/no questions about presence of objects, with adversarial "is there a chair?" when there isn't. - HallusionBench-adversarial images and edited variants. - Negative-image probes: in your own eval set, include images where the prompted entity is absent. Measure refusal rate.

Production posture: for high-stakes uses (medical imaging, legal docs, content moderation), assume the VLM hallucinates 1–10% of the time and design accordingly-confidence thresholds, secondary checks, human review on disagreement.

9.4 Image generation eval

Three layers:

  • Distributional / aesthetic. FID (Fréchet Inception Distance, Heusel et al., 2017) compares the distribution of generated and real images via Inception-v3 features. Lower is better. Crude-improving FID does not always mean prettier pictures.
  • Prompt-image alignment. CLIP-Score: average cosine similarity between CLIP image embedding of the generation and CLIP text embedding of the prompt. Easy to compute; saturates at high quality. T2I-CompBench, GenAI-Bench, GenEval probe specific compositional capabilities (counting, color binding, spatial relations).
  • Human eval. The gold standard. Pairwise A/B; report Elo (e.g., Artificial Analysis, Imagen Arena). Discriminative for fine-grained quality.

For a production text-to-image app, build a domain-specific golden set of prompts (your real user queries), generate with candidate models, and have your team vote pairwise.

9.5 Audio eval

  • WER (Word Error Rate) for transcription. Compute Levenshtein distance / reference length.
  • Diarization Error Rate for speaker attribution.
  • MOS (Mean Opinion Score) for TTS quality-human ratings 1-5. Crowd-sourced.
  • Latency p50/p95 for streaming systems. The user-perceived metric.

9.6 The systems eval-end-to-end task success

The above are component evals. For shipping product, build a task-level eval: given a multimodal user request, did the system produce the right end output? E.g., "Given a screenshot of an invoice, did we correctly extract the totals, vendor, and due date as JSON?" This subsumes all component failures and is the metric that correlates with user retention.


10. Production patterns for multimodal

The recipes that show up in real codebases.

10.1 Document understanding (the killer app)

Replace the OCR → parser → LLM pipeline with image → VLM. Pattern:

  1. Convert PDF pages to images (pdf2image, PyMuPDF). 200 DPI is usually enough; 300 DPI for tiny text.
  2. Send each page image to a VLM with a structured prompt: "Extract the following fields as JSON: {schema}. If a field is absent, use null."
  3. Validate JSON with Pydantic / JSONSchema. Re-prompt on failure.
  4. For multi-page docs, either stitch into a single VLM call (if context allows) or process per page and merge in a second LLM call.

Gotchas: - Tables with rotated headers, merged cells, multi-line cells: VLMs handle these much better than OCR + heuristics, but still imperfect. Have a human-review escape hatch. - Hand-written content: variable. GPT-4o and Claude 3.5 Sonnet handle clean handwriting; messy handwriting is still hard. - Privacy / on-prem: self-host Qwen2-VL or InternVL for sensitive docs.

This pattern collapses what used to be a multi-vendor stack (Tesseract + AWS Textract + custom regex + an LLM) into one LLM call. Cost can go either way (Section 11)-verify per use case.

10.2 Visual question answering / screenshot debugging

User pastes a screenshot of an error, a UI bug, a chart they don't understand. Vision LLM answers. This is the most common consumer-facing multimodal pattern in 2024–2026 chat products.

Engineering details: - Auto-detect when a user message contains an image and route to a vision-capable model. Don't always use vision (more expensive); only when needed. - For error-screenshot debugging, prompt the model to transcribe the visible text first, then answer. This forces the OCR step explicitly and reduces hallucinated diagnoses.

10.3 Visual classification at scale

If you have, say, 10M product images to classify into a fixed taxonomy:

  • Few classes (≤1000), fixed: fine-tune a CLIP-style image classifier or a small ViT. Pennies per million inferences on a single GPU. Cheaper and faster than VLM-per-image.
  • Many classes, evolving, complex semantics: VLM in a structured-output prompt. More expensive per image but no fine-tuning loop.
  • Hybrid: VLM labels a few thousand examples → train a cheap CLIP classifier on those labels → run CLIP at production scale → spot-check with VLM.

The crossover point is roughly: above ~1M inferences/month with a stable taxonomy, fine-tuning a classifier wins. Below that, VLM wins on engineering simplicity.

10.4 Audio transcription pipelines

Standard recipe (2026): - self-host faster-whisper (large-v3) on a single GPU; ~80× real-time on an A100, ~30× on consumer GPUs. - Or API: OpenAI Whisper API, AssemblyAI, Deepgram. Compete on price (~$0.006/minute as of 2025; verify) and features (diarization, language coverage, streaming). - For streaming UX: use a streaming-capable provider or a streaming wrapper around Whisper. - For highest quality: send the transcript through an LLM for cleanup, punctuation, speaker labeling.

10.5 Speech-to-speech

Emerging pattern; hard to get right. Latency is the dominant constraint-humans expect ~300 ms response time in voice. This rules out long pipelines.

Options: - Native end-to-end (GPT-4o voice mode, Gemini Live, Moshi): low latency, expressive prosody, but vendor-locked. - Pipeline with aggressive optimization: streaming Whisper → fast LLM (Groq, Cerebras, or distilled local model) → streaming TTS (Cartesia, ElevenLabs Turbo). 500 ms–1 s round-trip is achievable.

For most products in 2026, voice mode is a "nice to have"-adopt when the product genuinely benefits from voice and you have the latency budget.

10.6 Image generation in product

Ship-quality patterns: - Asset generation (marketing images, illustrations): API-DALL-E 3, FLUX-pro, Midjourney. Latency-tolerant. - User-generated content (avatars, generated stickers): self-hosted SDXL or FLUX.1-dev with LoRA fine-tuning per persona. - Real-time interactive (Krea-style live drawing): SDXL-Turbo or LCM-distilled SDXL on a beefy GPU; sub-second per image. - Safety: every image-generation product needs a content moderation layer (NSFW classifier on output, prompt-input filtering). This is non-optional.


11. Cost economics

All numbers below are as of late 2024 / 2025 and move quarterly. Verify before quoting in any contract.

11.1 Vision input pricing

Major APIs charge per image as a function of resolution. Rough rules of thumb: - OpenAI (GPT-4o, GPT-4-Vision): images are tokenized at ~85 tokens for "low detail" and up to ~1100+ tokens for high-detail or large images, billed at the same per-token rate as text. - Anthropic (Claude 3.5 Sonnet vision): images are converted to a token count roughly proportional to (W × H) / 750. - Google (Gemini): per-image flat-ish pricing; very cheap for 1.5-Flash.

Practical effect: a single 1024×1024 image to GPT-4o costs roughly the same as ~1000 input tokens of text. At ~$2.50/M input tokens (GPT-4o, late 2024), that is ~$0.0025 per image. At ~$3/M (Claude 3.5 Sonnet), similar.

For 10,000 images: ~$25 to ~$30. Trivial. For 10M images: ~$25k–$30k. Now self-hosting starts to make sense.

11.2 Image generation pricing

Wildly variable. Late 2024/2025 ballparks: - DALL-E 3 standard: ~$0.04/image (1024×1024, standard). HD: ~$0.08. - Stable Image / FLUX.1-pro APIs: ~$0.03–$0.05/image. - Midjourney: ~$10–$60/month subscription, fixed quota. - Self-hosted SDXL on a rented A100 (~$1–$2/hr): ~1 second per 1024×1024 image at 30 steps → ~3000 images/hr → ~$0.0003–$0.0007/image. Two orders of magnitude cheaper, with engineering overhead.

Crossover: roughly 100k–1M images/month makes self-hosting worth the engineering investment.

11.3 Audio pricing

Cheap. Whisper API at OpenAI: ~$0.006/minute. AssemblyAI/Deepgram comparable, with extras (diarization, sentiment). Self-hosted faster-whisper on consumer GPU: nearly free at scale. TTS: ~$0.015–$0.030/1k characters at ElevenLabs (cheaper tiers exist), down to near-free for self-hosted Piper / Coqui.

11.4 The economic decision rule

Same as text: - Low traffic, broad capability needed → API. - High traffic, narrow task, latency-sensitive, or privacy-sensitive → self-host. - Very high traffic with stable taxonomy → fine-tune a small specialized model (CLIP-classifier, distilled VLM) and self-host.

The crossover volume for vision in 2026 is similar to text: roughly 1M+ inferences/month before the engineering cost of self-hosting amortizes.


12. Engineering integration-the small details that bite

12.1 SDK ergonomics

Most SDKs accept either base64-encoded images or URLs in the message:

# OpenAI / Anthropic style (sketched, verify against current SDK):
{
  "role": "user",
  "content": [
    {"type": "text", "text": "What is in this image?"},
    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "<base64>"}}
  ]
}

URL inputs avoid the bandwidth cost of sending the bytes but require the URL to be reachable from the provider-a non-starter for private images. Base64 works always; expect 33% size overhead.

12.2 Image preprocessing-do it on the client

Before sending: - Resize to model's expected resolution. Most APIs cap at ~2048 px or downscale silently; you pay for the upload bandwidth either way. Resize client-side to the model's known native resolution (e.g., 1024 × longest side for Claude vision). - Preserve aspect ratio. Stretching distorts shapes; pad with neutral color if the model wants square input. - Strip EXIF. Privacy and unnecessary bandwidth. - Convert format. PNG for screenshots/text, JPEG for photos (smaller, no quality loss visible). - Re-encode: a freshly-resized JPEG at quality 85 is typically ~80% smaller than the original.

For pdf-to-image: 200 DPI for documents with normal text; 300 DPI for tiny text or low-resolution scans; do not exceed 300-diminishing returns and growing token costs.

12.3 The tokens-per-image quirk

Token counts vary with resolution. As of late 2024, GPT-4o tiles a high-detail image into 512×512 squares and charges ~170 tokens per tile + 85 tokens base. A 1024×1024 image: 4 tiles + base = ~765 tokens. A 2048×2048: 16 tiles + base = ~2805 tokens. At 1024×768: 6 tiles + base = ~1105 tokens.

Plan budgets accordingly. A naïve "send the original 4K screenshot" can blow up your monthly bill by 4–10× versus a properly resized 1024×1024 input.

Anthropic publishes a similar formula (verify current docs). Google's Gemini is cheaper per image but also has its own quirks.

12.4 Streaming

  • Output streaming: works for text output regardless of multimodal input. Stream as usual.
  • Input streaming: image and audio are fully consumed before generation starts. There is no notion of "stream-in an image." For audio, however, end-to-end native models (GPT-4o voice) genuinely stream input-but that is a different API surface.
  • Real-time UX with images: optimistically render "Reading the image…" while the first output token comes back. This is purely a UX decision; the latency is real.

12.5 Retries and idempotency

Multimodal requests are large (hundreds of KB to MB). Retries must be careful: - Use idempotency keys where the SDK supports them. - On 5xx, exponential backoff. Be aware that the server may have charged you for a partially-processed request. - Cache image preprocessing results-recomputing a base64-encoded resized image on every retry is wasteful.

12.6 Observability

Log per-request: input image dimensions, output token counts, latency, model version. Multimodal latency has long tails-the p99 can be 10× the p50 for large images. Without observability you will misdiagnose this as "the model is slow."


13. Open-weights multimodal landscape-the 2025–2026 menu

A snapshot. All weights on Hugging Face. All numbers are characteristic of the model family as of 2024–2025; check leaderboards before committing.

13.1 Vision-Language (image in, text out)

  • Llama 3.2 Vision (Meta, September 2024). 11B and 90B parameters. Cross-attention fusion onto Llama 3.1 text. Good general VLM, strong text-only behavior preserved. License: Llama 3 community license (commercial-friendly with a use-case carve-out).
  • Pixtral 12B (Mistral, September 2024). Native-multimodal, 400M-param vision encoder + 12B language model. Strong document/chart performance for its size. Apache 2.0.
  • Qwen2-VL / Qwen2.5-VL (Alibaba, 2024–2025). 2B / 7B / 72B sizes. Dynamic resolution, native multilingual, strong OCR. Often the open-weights choice for document understanding. Apache 2.0 for some sizes.
  • InternVL2 / InternVL2.5 (Shanghai AI Lab + others, 2024). 1B–78B sizes. Tiling-based high-resolution. Competitive on academic benchmarks.
  • MiniCPM-V (OpenBMB, 2024). Small (~8B), efficient, on-device viable.
  • DeepSeek-VL2 (DeepSeek, late 2024). MoE multimodal-sparse experts, large total parameter count, fewer activated. Hints at the next architectural wave.
  • Molmo (Allen AI, 2024). Strong open VLM with annotation transparency (PixMo dataset, point/region-level labels).

13.2 Image generation

  • Stable Diffusion XL (Stability, 2023) and Stable Diffusion 3 / 3.5 (2024). Latent diffusion → DiT (in SD3+). Quality plateau is real but ecosystem is unmatched (LoRAs, ControlNets, fine-tunes).
  • FLUX.1 (Black Forest Labs, 2024). DiT + flow matching. FLUX.1-dev (open weights, non-commercial) and FLUX.1-schnell (open, Apache 2.0). Strongest open-weights image generator as of late 2024.
  • Sana (NVIDIA, 2024). Linear-attention-based DiT; very fast. Quality cost.
  • Distilled variants (SDXL-Turbo, Hyper-SDXL, FLUX-schnell, LCM-LoRAs): single- to few-step generation for real-time use.

13.3 Speech

  • Whisper-large-v3 (OpenAI, 2023). Still SOTA-ish for open ASR. faster-whisper for production.
  • Parakeet (NVIDIA, 2024). Conformer-based, lower latency, English-strong.
  • Moshi (Kyutai, 2024). End-to-end speech LLM.
  • F5-TTS / XTTS-v2 / Bark / Piper: open-weights TTS at varying quality/speed tradeoffs.

13.4 Choosing-the heuristics

  • Document/chart understanding, narrow task: Qwen2-VL-7B or Qwen2.5-VL.
  • General VLM, drop-in for text Llama: Llama 3.2 Vision (preserves text quality).
  • Smallest viable VLM: MiniCPM-V or Pixtral-12B if you have the VRAM.
  • Image generation for product: FLUX.1-dev or SDXL with LoRAs.
  • Real-time image generation: SDXL-Turbo or FLUX-schnell.
  • ASR: faster-whisper (large-v3).
  • TTS for product: Cartesia API or self-hosted F5-TTS.

When to API vs self-host: the same calculus as text. Above ~1M inferences/month or with privacy/latency requirements, self-host. Below that, an API is cheaper end-to-end once you account for engineering time.


14. Multimodal RAG-an emerging area

Standard RAG retrieves text chunks and feeds them to a text LLM. Multimodal RAG generalizes:

14.1 The recipe

  1. Embed everything into a shared space. Use CLIP (or BGE-M3, Jina-CLIP, ColPali, MMRet) to embed both text passages and images into a single D-dim vector space.
  2. Index in a vector store. HNSW, FAISS, Pinecone, Qdrant-same as text RAG.
  3. At query time: encode the query (text or image or both) and retrieve top-k from the union.
  4. Synthesize with a vision-LLM that can consume both retrieved text and retrieved images in its context.

14.2 Retrieval architectures worth knowing

  • Single-vector CLIP retrieval: one embedding per image / per text chunk. Simple, fast, weak on fine-grained.
  • ColPali / ColQwen (Faysse et al., 2024): treat each PDF page as an image; compute late interaction (one vector per visual patch) and score against query tokens via MaxSim. Skips OCR entirely; outperforms text-RAG on visually-rich documents.
  • Hybrid text + image: for a slide deck, embed each slide as both an image (CLIP) and its OCR'd text (BGE). Retrieve both; pass both to the VLM.

Use case: "Find the slide that mentions Q3 revenue growth."

Pipeline: 1. PDF → page images. 2. For each page: compute CLIP/ColPali image embedding; OCR with a VLM or Tesseract; embed text with BGE. 3. Index image embeddings in one collection, text embeddings in another. 4. At query: search both collections; merge by reciprocal-rank fusion; take top-5 pages. 5. Feed page images + retrieved text to GPT-4o / Claude with the user's question.

Why this beats text-only RAG on slides: charts and visual layouts carry information that OCR loses. ColPali-style retrieval captures it directly.

14.4 Eval

Same as text RAG (recall@k, MRR, end-to-end answer quality) with the added wrinkle that ground-truth labels for "which page contains the answer" need a human pass-text matching is unreliable for image-anchored content.


15. Practical exercises-work each one

These are not optional. Do them in a notebook before considering this chapter internalized.

Exercise 1-Patch arithmetic

For a ViT with patch size P=14, how many patches does a 384×384 image yield?

n = (384 / 14) × (384 / 14)

384 / 14 is not integer (= 27.43). In practice, ViTs trained at this resolution use a slightly different config: 384/16 → 24×24 = 576 patches, or the image is resized to a multiple of 14 (e.g., 378×378 → 27×27 = 729). The often-cited "729 tokens for 384×384 patch=14" assumes the latter-confirm against your model's preprocessor.

For a clean case: 224 × 224 with P=14 → 16 × 16 = 256 patches. 384 × 384 with P=14 (after resize to 378) → 27 × 27 = 729 patches. 1024 × 1024 with P=14 (after resize to 1022) → 73 × 73 = 5329 patches.

The point: token count grows quadratically with resolution. Doubling resolution quadruples cost. This determines API pricing and on-device feasibility.

Exercise 2-Implement CLIP's contrastive loss

In ~25 lines of PyTorch:

import torch
import torch.nn.functional as F

def clip_loss(image_features, text_features, logit_scale):
    # image_features: [N, D], text_features: [N, D]
    # logit_scale: scalar (= 1/τ), typically clamped to [0, log(100)]
    image_features = F.normalize(image_features, dim=-1)
    text_features  = F.normalize(text_features,  dim=-1)

    logits_per_image = logit_scale * image_features @ text_features.T   # [N, N]
    logits_per_text  = logits_per_image.T

    N = image_features.shape[0]
    labels = torch.arange(N, device=image_features.device)

    loss_i = F.cross_entropy(logits_per_image, labels)
    loss_t = F.cross_entropy(logits_per_text,  labels)
    return (loss_i + loss_t) / 2

Verify: with random matched pairs, loss should be near log(N) (chance); with perfectly aligned pairs, near 0.

Sanity check the temperature: log_scale = nn.Parameter(torch.tensor(np.log(1/0.07))); clamp at each step to log(100).

Exercise 3-Walk through a diffusion sampling step (T=3)

Tiny example, scalar x for clarity. Schedule: β = (0.1, 0.2, 0.3). Then α = (0.9, 0.8, 0.7); ᾱ = (0.9, 0.72, 0.504).

Forward: pick x_0 = 1.0; sample ε_1 ~ N(0,1) = +0.5 (say). Then: - x_1 = √0.9 · 1.0 + √0.1 · 0.5 = 0.949 + 0.158 = 1.107. - Sample ε_2 = -0.3: x_2 = √0.72 · 1.0 + √0.28 · (-0.3) = 0.849 - 0.159 = 0.690. - Sample ε_3 = +0.4: x_3 = √0.504 · 1.0 + √0.496 · 0.4 = 0.710 + 0.282 = 0.992.

Reverse: a trained model predicts ε_θ(x_t, t). Suppose the model is well-trained and predicts approximately the true ε at each step. Starting from x_3 = 0.992:

Step t=3 → t=2: x_2 ≈ (1/√0.7) · ( x_3 − (β_3 / √(1 − ᾱ_3)) · ε_θ ) + σ_3 · z = (1/0.837) · ( 0.992 − (0.3 / √0.496) · 0.4 ) + ... = 1.195 · ( 0.992 − 0.426 · 0.4 ) + ... = 1.195 · 0.821 + small noise ≈ 0.981 + noise

Compare to the true x_2 = 0.690-the model is approximate, especially with only 3 steps. With T=1000 and a properly trained ε_θ, the trajectory tracks much more tightly. The exercise's value is feeling the closed-form forward and the iterative reverse.

Exercise 4-Multimodal RAG over a 200-page PDF

Design:

  • Chunking: per page (1 image + ~500 OCR'd tokens). Don't try to chunk within a page-page boundaries are the natural unit for visually-laid-out content.
  • Embedding: dual-CLIP image embedding and BGE-M3 text embedding of the OCR'd content. Store both.
  • Retrieval: at query time, encode the query as both a text vector (BGE) and a CLIP text vector. Search both indexes; take top-5 from each; deduplicate; rerank with a cross-encoder (or by a small VLM call: "Does this page answer the query? yes/no").
  • Generation: pass the top-3 page images to a VLM (Claude 3.5 Sonnet vision or Qwen2-VL) along with the query. Have the VLM cite the page number explicitly.
  • Eval: build 30 question-answer pairs by hand from the PDF, plus the page number that contains the answer. Measure: page-recall@5, answer-correctness (human review or LLM-as-judge with the image included).

Failure modes to plan for: tables that span pages (handle by retrieving adjacent pages); scanned-with-handwriting pages where OCR fails (CLIP/ColPali still works); duplicated content (deduplicate by perceptual hash on retrieval).

Exercise 5-Cost estimate, 10k document images

Assume images are average documents at 1024×1024 resolution, ~1 question per image, expected ~200 token output.

Claude 3.5 Sonnet (as of late 2024 pricing; verify): - Input: ~1500 tokens per image (vision tokens) + ~50 prompt tokens = 1550 tokens. - Output: 200 tokens. - Cost per image: 1550 × ($3/M) + 200 × ($15/M) = $0.00465 + $0.003 = ~$0.0077. - 10k images: ~$77. Trivial.

Self-hosted Pixtral-12B on a rented A100: - Throughput: ~5–10 images/sec at this resolution (verify on your setup). - 10k images at 7/s = ~1430 s = ~24 min. - A100 rental: ~$1.50/hr → ~$0.60. - Engineering time: assume 1 day to set up, debug quantization, build the JSON-output prompt = ~$1000–$2000 in fully-loaded engineer time.

Crossover: at 10k images, the API wins by 1000×. At 10M images: API ~$77,000, self-hosted ~$600 + setup. Self-hosting wins by ~100×. Crossover is somewhere around 100k–500k images depending on volume stability and engineering rate.

The point of the exercise: do this calculation every time, with current prices, before committing to an architecture.

Exercise 6-Diagnose a vision-LLM hallucination

Symptom: the VLM says "the image shows a dog" when the image is a cat. Five plausible root causes:

  1. Prompt anchoring / sycophancy. The user's prompt mentioned a dog ("Is this dog cute?"). The model deferred to the user's framing. Fix: neutral prompts; explicit "first describe the image, then answer."
  2. Resolution loss in preprocessing. The image was downscaled to 224×224 before encoding; a small or distant cat got smeared and was classified as a dog by the vision encoder's prior. Fix: higher resolution (or a model with dynamic resolution).
  3. Adversarial / ambiguous content. Image is a cat in a dog-shaped costume, or a cat-dog chimera, or shot from an angle that obscures distinguishing features. Fix: ensemble (ask multiple times with different prompts) and surface low-confidence to the user.
  4. Domain shift. The model was trained on web images with web-typical labels; if your input is medical, satellite, or microscopic, the vision encoder is out of distribution. Fix: domain-specific fine-tune or retrieval-augmented prompting with reference images.
  5. Vision-LM disconnect. The vision encoder correctly produced "cat-like" features, but the projector / cross-attention failed to surface them, and the LLM defaulted to a high-prior word ("dog" is more common than "cat" in image-caption training data, depending on dataset). Fix: better-aligned model (or fine-tune with hard cat-vs-dog negatives).

A real diagnosis combines several. The investigation playbook: (a) reproduce; (b) try the same prompt against a different VLM-does the failure persist? (c) try a higher resolution-does it resolve? (d) try a more neutral prompt-does it resolve? (e) inspect the image for obvious confounds. By the end of these you will know whether it's a model limitation, a preprocessing bug, or a prompt-design issue.


16. What's next-beyond this chapter

Things that are real, are accelerating, and are not yet stable enough for a deep-dive but worth tracking:

  • Long-context multimodal. Gemini 1.5 demonstrated 1M-token contexts including hours of video. As context windows grow, "RAG vs in-context" rebalances for multimodal too.
  • Action models / GUI agents. Anthropic's "computer use" (October 2024), OpenAI's Operator (early 2025), Google's Project Mariner. Vision-LLMs that take actions on screens. The eval, safety, and reliability problems are open.
  • 3D and embodied multimodal. Robotics foundation models (Pi-Zero, RT-2, Helix). Vision + language + action policies trained jointly. Mostly research today; expect productization 2026–2028.
  • Audio LLMs as first-class. GPT-4o-style native audio is rare in open weights. Watch Moshi, Mini-Omni, and future Llama / Qwen audio releases.
  • Test-time compute for multimodal. o1/o3-style reasoning extended to vision and audio. Early signals (o1-vision, Gemini 2.0 thinking) suggest big gains on multimodal reasoning benchmarks.

The half-life on this chapter is probably 18 months. Re-read in late 2027 and update.


Appendix A-Citation summary

The architectural claims in this chapter rest on these primary sources. Names and years are accurate; full bibliography omitted for brevity.

  • ViT-Dosovitskiy et al., "An Image is Worth 16×16 Words," 2020.
  • CLIP-Radford et al., "Learning Transferable Visual Models from Natural Language Supervision," 2021.
  • DDPM-Ho et al., "Denoising Diffusion Probabilistic Models," 2020.
  • DDIM-Song et al., "Denoising Diffusion Implicit Models," 2020.
  • Latent Diffusion / Stable Diffusion-Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models," 2022.
  • Classifier-Free Guidance-Ho & Salimans, 2022.
  • DiT-Peebles & Xie, "Scalable Diffusion Models with Transformers," 2022.
  • Flow Matching-Lipman et al., 2022; Rectified Flow-Liu et al., 2022.
  • Whisper-Radford et al., "Robust Speech Recognition via Large-Scale Weak Supervision," 2022.
  • Flamingo-Alayrac et al., 2022.
  • LLaVA-Liu et al., "Visual Instruction Tuning," 2023.
  • POPE-Li et al., 2023.
  • MMMU-Yue et al., 2023.
  • ColPali-Faysse et al., 2024.
  • Chameleon-Meta, 2024.
  • DINOv2-Oquab et al., 2023.
  • SigLIP-Zhai et al., 2023.

For every model-capability claim ("X scores Y on benchmark Z"), the canonical move is: check the model card, check the paper, check a recent independent leaderboard (Open LLM Leaderboard, Artificial Analysis, Papers with Code). Numbers shift quarterly; this chapter does not.


Appendix B-A 12-week study path through this chapter

Because this chapter is dense and the exercises are non-trivial, here is a sequenced path for the user's roadmap. Each week is roughly 4–6 hours.

  • Week 1: Sections 0–2 (motivation + ViT). Read the ViT paper. Do Exercise 1.
  • Week 2: Section 3 (CLIP). Read the CLIP paper. Do Exercise 2; run on tiny synthetic data; verify loss converges.
  • Week 3: Section 4 (architecture families). Read the LLaVA paper, the Flamingo paper.
  • Week 4: Section 5 (LLaVA in detail). Spin up a local LLaVA-1.5 or Qwen2-VL-7B with vLLM; run a few queries.
  • Week 5: Section 6 (audio). Read the Whisper paper. Run faster-whisper on 30 minutes of your own audio; compute WER against a transcript.
  • Week 6: Section 7 (diffusion). Read DDPM + Stable Diffusion papers. Do Exercise 3 with NumPy.
  • Week 7: Section 7 continued. Run Stable Diffusion locally; vary CFG and steps; build intuition.
  • Week 8: Sections 8–9 (video + eval). Skim a Sora-class technical report. Build a tiny eval set for your favorite VLM.
  • Week 9: Section 10 (production patterns). Build a document-extraction prototype-PDFs in, structured JSON out, with eval.
  • Week 10: Sections 11–12 (cost + integration). Profile token usage on a real workload; calculate cost; tune preprocessing.
  • Week 11: Section 13 (open-weights menu). Self-host one open VLM end-to-end on a single GPU. Benchmark against the API you've been using.
  • Week 12: Sections 14–15 (multimodal RAG + exercises). Build the 200-page-PDF multimodal RAG of Exercise 4 end-to-end. Evaluate. Write up findings.

By end of week 12 the gap between a text-only AI engineer and a multimodal-fluent AI engineer is closed. Past that, the field will keep moving-but the foundations in this chapter generalize.


End of Deep Dive 11.

Comments