Saltar a contenido

Week 19 - Quantization: INT8, INT4, FP8, AWQ, GPTQ, SmoothQuant

19.1 Conceptual Core

  • Quantization reduces precision of weights (and sometimes activations) to shrink memory and increase throughput.
  • The categories:
  • Weight-only quantization (W8A16, W4A16): weights INT8 or INT4; activations BF16. Cuts decode HBM traffic by 2-4×-huge for memory-bound decode.
  • Weight + activation quantization (W8A8, W4A8): both quantized. Tensor cores can be invoked at lower precision; throughput wins but stability harder.
  • Per-tensor / per-channel / per-group: granularity of the scale factor. Per-group (e.g., group size 128) is the modern standard-balances size and accuracy.
  • The methods:
  • Post-Training Quantization (PTQ): quantize a trained model in minutes-to-hours. Methods: AWQ, GPTQ, SmoothQuant. The 2026 standard for production.
  • Quantization-Aware Training (QAT): train with simulated quantization. More accurate, much more expensive. Used for highest-accuracy scenarios.
  • AWQ (Lin et al., MLSys 2024): observation that activation outlier channels matter more; scale them up before quantization to preserve accuracy. Standard for INT4 weight-only.
  • GPTQ (Frantar et al., ICLR 2023): optimal-brain-surgeon-style quantization; one layer at a time, calibration-data driven.
  • FP8 for inference (Hopper+): native hardware support. E4M3 for weights/activations, E5M2 for gradients (training only). Production-deployed at major labs.

19.2 Mechanical Detail

  • Storage format:
  • INT8: 1 byte per weight + 1 scale per group.
  • INT4 (packed): 2 weights per byte + scale. Need a "dequant" kernel that unpacks.
  • Compute:
  • W4A16: dequantize to BF16 just before matmul (the gemv in decode is memory-bound anyway, so dequant doesn't hurt). The matmul itself runs in BF16 on tensor cores.
  • W8A8: matmul runs in INT8 tensor cores (mma.s8). Higher throughput, requires careful scale handling.
  • Library landscape: bitsandbytes (W8/W4 with LoRA), AutoAWQ, AutoGPTQ, llama.cpp's GGUF formats (k-quants), TensorRT-LLM (production NVIDIA path), Marlin (W4A16 fast kernels).

19.3 Lab-"Quantize and Compare"

On a 7B-13B model: 1. Run baseline BF16 inference. Capture TTFT, TPOT, model size, throughput. 2. Quantize with AWQ (W4A16). Re-measure. Eval on a small held-out set (e.g., MMLU 200-question subset, or perplexity on Wikitext) for accuracy. 3. Quantize with FP8 (if on Hopper+). Re-measure. 4. Optionally: GPTQ comparison, AWQ INT8 comparison. 5. Build a tradeoff matrix: throughput, memory, perplexity / accuracy.

19.4 Idiomatic & Diagnostic Drill

  • A quantized kernel must handle calibration: gather activation statistics on representative inputs. Document your calibration set and how you chose it (random docs ≠ production traffic).

19.5 Production Slice

  • Quantization in production must be reproducible. The quantized weights are a new artifact that must be versioned, signed, and SBOM'd just like training artifacts. Treat as such.

Comments