Week 19 - Quantization: INT8, INT4, FP8, AWQ, GPTQ, SmoothQuant¶
19.1 Conceptual Core¶
- Quantization reduces precision of weights (and sometimes activations) to shrink memory and increase throughput.
- The categories:
- Weight-only quantization (W8A16, W4A16): weights INT8 or INT4; activations BF16. Cuts decode HBM traffic by 2-4×-huge for memory-bound decode.
- Weight + activation quantization (W8A8, W4A8): both quantized. Tensor cores can be invoked at lower precision; throughput wins but stability harder.
- Per-tensor / per-channel / per-group: granularity of the scale factor. Per-group (e.g., group size 128) is the modern standard-balances size and accuracy.
- The methods:
- Post-Training Quantization (PTQ): quantize a trained model in minutes-to-hours. Methods: AWQ, GPTQ, SmoothQuant. The 2026 standard for production.
- Quantization-Aware Training (QAT): train with simulated quantization. More accurate, much more expensive. Used for highest-accuracy scenarios.
- AWQ (Lin et al., MLSys 2024): observation that activation outlier channels matter more; scale them up before quantization to preserve accuracy. Standard for INT4 weight-only.
- GPTQ (Frantar et al., ICLR 2023): optimal-brain-surgeon-style quantization; one layer at a time, calibration-data driven.
- FP8 for inference (Hopper+): native hardware support. E4M3 for weights/activations, E5M2 for gradients (training only). Production-deployed at major labs.
19.2 Mechanical Detail¶
- Storage format:
- INT8: 1 byte per weight + 1 scale per group.
- INT4 (packed): 2 weights per byte + scale. Need a "dequant" kernel that unpacks.
- Compute:
- W4A16: dequantize to BF16 just before matmul (the
gemvin decode is memory-bound anyway, so dequant doesn't hurt). The matmul itself runs in BF16 on tensor cores. - W8A8: matmul runs in INT8 tensor cores (
mma.s8). Higher throughput, requires careful scale handling. - Library landscape: bitsandbytes (W8/W4 with LoRA), AutoAWQ, AutoGPTQ, llama.cpp's GGUF formats (k-quants), TensorRT-LLM (production NVIDIA path), Marlin (W4A16 fast kernels).
19.3 Lab-"Quantize and Compare"¶
On a 7B-13B model: 1. Run baseline BF16 inference. Capture TTFT, TPOT, model size, throughput. 2. Quantize with AWQ (W4A16). Re-measure. Eval on a small held-out set (e.g., MMLU 200-question subset, or perplexity on Wikitext) for accuracy. 3. Quantize with FP8 (if on Hopper+). Re-measure. 4. Optionally: GPTQ comparison, AWQ INT8 comparison. 5. Build a tradeoff matrix: throughput, memory, perplexity / accuracy.
19.4 Idiomatic & Diagnostic Drill¶
- A quantized kernel must handle calibration: gather activation statistics on representative inputs. Document your calibration set and how you chose it (random docs ≠ production traffic).
19.5 Production Slice¶
- Quantization in production must be reproducible. The quantized weights are a new artifact that must be versioned, signed, and SBOM'd just like training artifacts. Treat as such.