Saltar a contenido

Deep Dive 09: Quantization Theory and Practice for LLMs

"FP16 is a courtesy. INT4 is the contract."-Anonymous inference engineer, c. 2024.

This chapter is the self-contained reference for everything an AI systems engineer needs to know about quantization for large-language-model inference (and a sketch of training). It is written so that, after reading it carefully, you should be able to:

  1. Re-derive every algorithm presented (AWQ, GPTQ, SmoothQuant, FP8 scaling) without consulting the original papers.
  2. Estimate the on-device memory footprint of any model under any precision scheme to within ~1%.
  3. Reason about why a given scheme is fast (or not) on a given hardware target.
  4. Design and run a defensible quantization evaluation for a production deployment.

We assume the reader has internalized the previous deep dives, especially:

  • DD 03 (the GPU memory hierarchy and the HBM/L2/SMEM/register pyramid),
  • DD 06 (KV-cache management and the prefill/decode split),
  • DD 07 (attention kernels: FlashAttention and Marlin's structural cousins).

If you have not, the single most important fact to anchor on is this:

Decode is memory-bandwidth-bound. Every output token requires reading every weight from HBM. Halving the bytes per weight roughly doubles tokens-per-second.

Quantization is therefore the highest-leverage inference optimization. Nothing else-not better attention kernels, not speculative decoding, not better schedulers-buys as much throughput per engineering hour as moving from FP16 weights to INT4. This chapter is about how that miracle is implemented without destroying model quality.


Table of Contents

  1. Why quantize at all
  2. Number-format theory: floats, integers, and the bits in between
  3. Quantization fundamentals: affine maps, symmetric vs. asymmetric
  4. Granularity: per-tensor, per-channel, per-group
  5. Round-to-nearest (RTN) and why INT8 just works
  6. Why INT4 RTN fails: outliers and heavy tails
  7. AWQ-Activation-aware Weight Quantization, derived
  8. GPTQ-Hessian-aware column-wise quantization, derived
  9. SmoothQuant-redistributing difficulty for W8A8
  10. Activation quantization: static vs. dynamic
  11. FP8 inference on H100
  12. FP8 training (brief)
  13. On-the-fly dequantization and the Marlin kernel
  14. Mixed-precision inference
  15. Calibration set design
  16. Evaluation discipline
  17. Practical exercises
  18. Cheat sheet and further reading

1. Why quantize at all

1.1 The arithmetic intensity argument

For a transformer decoder generating one token at a time with batch size 1, the compute required is roughly 2 × P FLOPs (one multiply and one add per parameter), where P is the parameter count. The memory traffic required is B × P bytes, where B is bytes per parameter. The arithmetic intensity is therefore:

AI = (2 × P) FLOPs / (B × P) bytes = 2/B FLOPs per byte

Some concrete numbers:

Precision B (bytes/param) AI (FLOP/byte)
FP32 4 0.5
BF16/FP16 2 1.0
FP8 1 2.0
INT4 0.5 4.0

Compare these to the roofline arithmetic intensity of an H100, which is (989 TFLOPS BF16) / (3.35 TB/s HBM3) ≈ 295 FLOP/byte. Decode at every precision listed above is at least two orders of magnitude below the roofline-squarely memory-bound.

When you are memory-bound, throughput scales as 1/B. Cutting weight bytes in half doubles your tokens-per-second. Cutting them by 4× quadruples it. Quantization is leverage you don't get anywhere else in the stack.

1.2 The capacity argument

A 70B-parameter model needs, in raw weights:

  • FP32: 70 × 10^9 × 4 = 280 GB
  • FP16: 140 GB
  • INT8: 70 GB
  • INT4 (group=128 with FP16 scales): ~36 GB

The H100 80 GB SXM has 80 GB of HBM, of which ~10 GB is reserved for KV cache, activations, and the CUDA runtime. INT4 is the only way to fit a 70B model on a single H100. This is not a footnote-it is the dominant operational reason quantization is mandatory in 2024–2026 production deployments.

1.3 The energy argument

For server-class inference, the number that matters is tokens per joule. Memory traffic dominates that as well: an HBM3 read costs roughly 50–100× more energy than an FMA on tensor cores. Quantization lowers tokens-per-joule by directly cutting the dominant cost term.

1.4 The cost argument

If you serve 1B tokens/day from a fleet of H100s, doubling tokens-per-GPU halves your fleet size and roughly halves your inference COGS. There is no other lever in the stack with this multiplier.


2. Number-format theory

A floating-point number is (-1)^s × m × 2^e with three fields packed into a fixed bit width: a 1-bit sign s, an exponent field, and a mantissa (significand) field. IEEE 754 introduces:

  • An exponent bias so the exponent field is unsigned.
  • Subnormals (denormals) for graceful underflow.
  • Special values ±∞ and NaN.

For all formats below let Eb be exponent bits and Mb be mantissa bits, exponent bias bias = 2^(Eb-1) - 1. A normal number is (-1)^s × (1.m_2) × 2^(E - bias).

2.1 FP32 (binary32)

Field Bits
Sign 1
Exponent 8
Mantissa 23
Total 32
  • Exponent bias: 127.
  • Smallest positive normal: 2^-126 ≈ 1.18 × 10^-38.
  • Largest finite: (2 - 2^-23) × 2^127 ≈ 3.40 × 10^38.
  • Decimal precision: log10(2^24) ≈ 7.22 digits.

This is the historical default for ML training. It is wasteful for both training and inference because neural networks empirically tolerate aggressive precision loss.

2.2 FP16 (binary16, IEEE half-precision)

Field Bits
Sign 1
Exponent 5
Mantissa 10
  • Bias: 15.
  • Smallest normal: 2^-14 ≈ 6.10 × 10^-5.
  • Largest: (2 - 2^-10) × 2^15 ≈ 65 504.
  • Decimal precision: ~3.3 digits.

The killer problem for FP16 in training is the exponent range: gradients can underflow to zero. Loss-scaling (multiplying loss by 2^k before backprop, dividing gradients by 2^k after) was the workaround until BF16 displaced it.

2.3 BF16 (bfloat16, Google Brain)

Field Bits
Sign 1
Exponent 8
Mantissa 7
  • Bias: 127 (same as FP32).
  • Range matches FP32 (~1.18e-38 to ~3.39e38).
  • Decimal precision: log10(2^8) ≈ 2.4 digits.

BF16 was specifically designed to be a drop-in replacement for FP32 in deep learning: the upper 16 bits of an FP32 number, full-stop. No loss-scaling needed. Almost all modern training (since A100 / TPU v3) uses BF16. It is the de-facto FP16 of 2024+.

2.4 FP8 E4M3

Field Bits
Sign 1
Exponent 4
Mantissa 3
  • Bias: 7 (per the OFP8/NVIDIA spec; some specs use 8).
  • Range: ~2^-9 (smallest subnormal, ~1.95e-3) up to ~448 (the spec replaces Inf with the largest finite value to extend range).
  • Used for forward activations and weights in inference and training.

The mantissa is wider than E5M2 for better precision in the heart of the distribution, at the cost of a smaller exponent range.

2.5 FP8 E5M2

Field Bits
Sign 1
Exponent 5
Mantissa 2
  • Bias: 15 (matches FP16).
  • Range matches FP16 (~6e-5 to ~65 504).
  • Used for gradients in FP8 training.

The wider exponent range trades two mantissa bits for the dynamic range that backprop requires.

2.6 INT8

  • Signed two's-complement 8-bit integer.
  • Range: -128 to 127.
  • No exponent. Dynamic range is fixed at log2(256) ≈ 8 bits.

To represent a real number x you need an external scale (typically FP32 or FP16): x ≈ scale × q.

2.7 INT4

  • Signed two's-complement 4-bit integer.
  • Range: -8 to 7 (or sometimes -7 to 7 with one redundant code).
  • Always packed: two INT4 values per byte. Hardware tensor cores load them packed and unpack on the fly.

INT4 is not natively addressable-you cannot *ptr an INT4. Software must pack/unpack at storage boundaries, and kernels must implement dequant logic in registers or shared memory. This is the central engineering challenge that Marlin (§13) solves.

2.8 Comparison table

Format Bits Sign Exp Mantissa Range (approx) ~Decimal precision Typical use
FP32 32 1 8 23 1.2e-38 .. 3.4e38 7.2 Reference training, master weights
BF16 16 1 8 7 matches FP32 2.4 Training, activations
FP16 16 1 5 10 6e-5 .. 6.5e4 3.3 Legacy training, inference
FP8 E4M3 8 1 4 3 2e-3 .. 448 ~1 FP8 inference fwd, training fwd
FP8 E5M2 8 1 5 2 6e-5 .. 6.5e4 <1 FP8 training gradients
INT8 8 1 - - -128 .. 127 (× scale) n/a INT8 PTQ, W8A8
INT4 4 1 - - -8 .. 7 (× scale) n/a W4A16 inference

The thing to internalize: floats trade range for precision via the exponent. Integers have no exponent, so they need an external scale to be useful. That external scale is the entire conceptual core of integer quantization.


3. Quantization fundamentals

3.1 Affine quantization

The most general scheme:

q     = round(x / scale) + zero_point          # quantize
x_hat = scale × (q - zero_point)               # dequantize

Here: - x ∈ ℝ is the original real-valued tensor element. - q ∈ ℤ is the integer code. - scale ∈ ℝ_{>0} (typically stored FP16 or FP32). - zero_point ∈ ℤ (typically same width as q, so it fits in the same dtype).

The quantization error per element is e = x - x_hat, bounded by |e| ≤ scale / 2 if rounding is correct (round-to-nearest, ties-to-even).

3.2 Symmetric quantization

Set zero_point = 0. Then:

q     = round(x / scale)
x_hat = scale × q

The integer range [Q_min, Q_max] should be symmetric around 0. For INT8 we typically use Q_max = 127 and Q_min = -127 (forfeiting one code at -128 for symmetry; some implementations use the full -128..127). Then:

scale = max(|x|) / Q_max

Symmetric is the right default for weights, which are empirically near-zero-centered after standard initialization and training.

3.3 Asymmetric quantization

Allow nonzero zero_point. The mapping that uses the full integer range is:

scale       = (x_max - x_min) / (Q_max - Q_min)
zero_point  = round(Q_min - x_min / scale)

Then for any input x:

q = clip( round(x / scale) + zero_point , Q_min , Q_max )

Asymmetric is the right default for activations, especially post-ReLU/post-GELU activations that have a hard one-sided floor at 0. Forcing symmetric quantization on a [0, x_max] activation throws away half your codes.

3.4 Derivation of scale and zero_point (asymmetric)

We want a linear (affine) map from [x_min, x_max] → [Q_min, Q_max]:

q(x) = a × x + b

with constraints q(x_min) = Q_min, q(x_max) = Q_max. Two equations, two unknowns:

a × x_min + b = Q_min
a × x_max + b = Q_max

Subtract:

a × (x_max - x_min) = Q_max - Q_min
a = (Q_max - Q_min) / (x_max - x_min) = 1 / scale

so scale = (x_max - x_min) / (Q_max - Q_min). Substitute back:

b = Q_min - a × x_min = Q_min - x_min / scale

i.e. zero_point = round(Q_min - x_min / scale).

3.5 Symmetric as a special case

Set x_min = -x_max (after taking x_max ← max(|x|)). Then Q_min = -Q_max gives b = 0, i.e. zero_point vanishes, and scale = 2 × x_max / (2 × Q_max) = x_max / Q_max = max(|x|) / Q_max. Consistent.

3.6 The MAD-vs-MSE choice

When you compute max(|x|) you are picking an `L_inf - optimal scale. You can also pick a scale that minimizes MSE under a Gaussian/Laplacian assumption-this is the basis of percentile clipping (e.g., set scale based on the 99.9th percentile of |x|, clipping outliers). For weights this rarely matters; for activations it can matter a lot, and is one knob that AWQ/SmoothQuant indirectly tune.


4. Granularity

A single scale per tensor is the cheapest. But it is also the most error-prone, because a single outlier element forces the scale large, wasting precision on the rest.

4.1 Per-tensor

One scale (and optionally one zero_point) for the entire tensor. Storage cost is negligible. Quality is poor for INT4 weights because the dynamic range of weights varies wildly across rows of W.

4.2 Per-channel (per-row, per-output-channel)

For a weight matrix W ∈ ℝ^{out × in} representing y = W x, each output channel (each row of W) gets its own scale. This is the standard for weight quantization because:

  1. Each output is a linear combination of all inputs; the scale of row i only affects output i.
  2. There is no cross-channel arithmetic that would cause scale mismatches in the matmul itself.
  3. Dequantization at output time is y_i ≈ scale_i × (W_q · x)_i, a cheap final multiply per output element.

For activations, per-channel-per-token is also possible but expensive to apply in the matmul.

4.3 Per-group (block-wise)

A compromise: split each row of W into contiguous groups of size G (typically G = 128 or 64). Each group has its own scale and (optionally) zero_point.

Why 128? It matches the K-dimension tile size of standard tensor-core GEMMs. A column-major dequantization can dequantize one 128-wide tile, multiply, accumulate, then move to the next tile-the scale is constant within the inner loop.

Per-group is the dominant scheme for INT4 weight quantization in 2024+. It is what AWQ, GPTQ, and Marlin all target.

4.4 Effective bits for INT4 group=128

Each element costs 4 bits. Each group of 128 elements has one FP16 scale = 16 bits (and possibly one FP16 or INT4 zero_point-let's count both).

bits_per_element = 4 + 16/128 + 16/128
                 = 4 + 0.125 + 0.125
                 = 4.25 bits/element     (with FP16 zp)

# or, symmetric (no zp):
                 = 4 + 16/128
                 = 4.125 bits/element

This is why INT4 group=128 is sometimes quoted as "~4.13 bits/element". The exact number depends on whether scales are FP16/BF16/FP32 and whether zero_points are FP16 or packed integers. In practice, 4.25 bits/element is a safe planning number.

4.5 Granularity comparison

Granularity Quality Storage cost Kernel cost Typical use
Per-tensor Worst Negligible Cheapest INT8 W8A8 (with smoothing)
Per-channel (row) Good 1 scale per row 1 mul per output INT8 weights
Per-group Best 1 scale per G elems 1 mul per group inside the inner loop INT4 weights

5. Round-to-nearest (RTN)

The simplest possible PTQ algorithm:

def rtn_quantize_per_group(W, group_size=128, bits=4):
    Q_max = 2**(bits-1) - 1                   # 7 for INT4
    W_q = empty_int(W.shape)
    scales = empty_fp(W.shape[0], W.shape[1] // group_size)
    for i in range(W.shape[0]):               # rows
        for g in range(W.shape[1] // group_size):
            block = W[i, g*group_size:(g+1)*group_size]
            s = block.abs().max() / Q_max
            scales[i, g] = s
            W_q[i, g*group_size:(g+1)*group_size] = round(block / s).clamp(-Q_max, Q_max)
    return W_q, scales

For INT8 per-channel symmetric RTN on weights, this is essentially good enough for most dense LLMs: degradation is typically <0.1 PPL on well-calibrated benchmarks for 7B+ models. The reason is that the dynamic range of any single output channel of a trained W_proj rarely exceeds ~2^7, so 8 bits + a per-channel scale captures it.

For INT4 per-group RTN, the story is more painful. We will see why next.


6. Why INT4 RTN fails

6.1 The error model

Per-element rounding error is uniform on [-scale/2, scale/2], so its variance is scale^2 / 12. The matmul output is a sum of such errors:

y_i = Σ_j W_ij × x_j ≈ Σ_j (W_ij + e_ij) × x_j
e_y = Σ_j e_ij × x_j

If e_ij are independent zero-mean with variance σ_w^2 = scale^2 / 12, then:

Var(e_y) = Σ_j σ_w^2 × x_j^2 = σ_w^2 × ||x||^2

So the output error scales with the squared norm of the activation. A single outlier x_j with large magnitude dominates ||x||^2, and therefore dominates the output error of every output channel.

6.2 Heavy-tailed activations in LLMs

It is an empirical, well-replicated fact that LLM activations-specifically the inputs to the down-projection of MLP blocks and the inputs to attention output projections-have heavy-tailed per-channel distributions. A handful of channels (sometimes called systematic outliers or emergent features) carry magnitudes 10× to 100× larger than the median channel. These are not bugs; they appear during training and are load-bearing for the network's function.

6.3 Why outliers break INT4 weight quantization

Even though we are quantizing weights, the quality metric we ultimately care about is output error. By the equation above, the impact of weight error on output is multiplied by the corresponding input (activation) channel. If weight column j is multiplied by an input channel with 100× the typical magnitude, then any error in that column is amplified 100× at the output.

INT4 weight RTN treats all columns equally-it applies the same per-row, per-group scale logic regardless of which input channel a weight column will be multiplied by. The columns paired with outlier activation channels get just as much rounding noise as the rest, and that noise blows up into a 100×-larger output error.

Two responses are possible:

  1. AWQ: protect the weights paired with outlier activation channels by giving them more precision (effectively scaling them up before quantization, then compensating).
  2. GPTQ: don't try to protect anything-instead, after quantizing each weight, update the remaining weights to compensate for the error you just introduced.

These are the two great pillars of modern weight-only INT4 PTQ.

A third response, addressing W8A8 (where activations are also quantized), is SmoothQuant: shift the magnitude out of the activations and into the weights, smoothing the activation distribution to make it INT8-friendly.


7. AWQ

Lin, Tang, Tang, Yang, Xiao, Dang, Han, "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration", MLSys 2024.

7.1 Insight

AWQ rests on three observations:

  1. Not all weights matter equally. Profiling a calibrated network shows that ~1% of weight channels carry most of the importance, measured by the magnitude of activations they multiply.
  2. Keeping just 1% of channels in FP16 (mixed-precision) recovers most of the accuracy of full FP16. This is the empirical proof that the rest of the channels are quantization-tolerant.
  3. Mixed-precision is a kernel headache. You don't actually want to ship an INT4-mostly-with-1%-FP16 weight matrix because the GEMM kernel would have to handle two layouts.

AWQ's innovation: instead of keeping salient channels in FP16, scale them up before quantization so they get more INT4 precision, and scale the matching activation channels down to compensate. The matmul output is mathematically unchanged, but the salient weights now occupy a more quantization-friendly part of the INT4 grid.

7.2 The math

For a linear layer with output y = W x where W ∈ ℝ^{m × n} and x ∈ ℝ^n, introduce a per-input-channel diagonal scaling matrix S = diag(s) with s ∈ ℝ^n_{>0}:

y = W x
  = W (S S^{-1}) x
  = (W S) (S^{-1} x)
  = W' x'

where W' = W S (each column of W is scaled by the corresponding s_j) and x' = S^{-1} x (each input element is divided by s_j). The product is exactly unchanged; we have only shifted magnitude between the two operands.

Now apply INT4 quantization to W' instead of W:

W_q = Q(W S)
y_hat = (1/scale_row) × dequant(W_q) × (S^{-1} x)
      ≈ W' × x'
      = W x

The key question: how should we choose s?

7.3 Choosing s: the AWQ heuristic

For weight column j, its contribution to the output is (column_j of W) × x_j. If x_j is large in magnitude (a salient activation channel), then errors in column_j of W get amplified at the output.

If we set s_j to be large for salient channels, then W'[:, j] = W[:, j] × s_j is also large. The per-row, per-group scale used by INT4 quantization is determined by max |W'[:, j']| over j' in the group. By inflating salient columns, we ensure that even the small entries of W in salient columns are quantized at a finer absolute resolution (relative to their original magnitude).

Why does this work? Imagine a group of 128 weights where one column is salient (paired with a 100× outlier activation) and the rest are normal. Without AWQ, the scale of this group is set by the largest absolute weight in it, which might be a non-salient one-and our salient column gets only INT4 resolution for what really should be INT5 or INT6 precision. With AWQ, the salient column has been pre-scaled by, say, 2× or 4×, so its weights now dominate the group's scale-setting max|W'|. Implicitly, the salient column gets 2 or 4 effective levels of additional precision.

The cost is borne by `S^{-1} x - but the activation multiplier compensates exactly, if it's done in higher precision. AWQ keeps activations in FP16, so the compensation is essentially free.

7.4 The AWQ algorithm, formal

Input:  W ∈ ℝ^{m × n}, calibration activations X ∈ ℝ^{n × N}
        (collected across N tokens from a small calibration set)
Output: W_q (INT4 group=128), per-group scales

Step 1. Compute per-input-channel activation magnitude:
        a_j = (1/N) Σ_t |X[j, t]|         for j = 1..n
        (mean absolute value per channel, optionally restricted to top-k tokens)

Step 2. Choose per-channel scaling vector s:
        s_j = a_j^α
        where α ∈ [0, 1] is a hyperparameter (typical: 0.5–0.7)

Step 3. Optionally normalize s so that geometric mean is 1:
        s_j ← s_j / geomean(s)
        (purely numerical hygiene; the math is invariant)

Step 4. Form W' = W · diag(s)

Step 5. Quantize W' with RTN per-group, group_size = 128:
        compute (W'_q, scales_per_group)

Step 6. Store W'_q, scales, and s.
        At inference: y = (W'_q dequantized) × (x / s)
        (the divide-by-s is fused into the previous layer's output, so it is free at runtime)

Step 7. (Optional) Search α: for each candidate α ∈ {0.0, 0.1, ..., 1.0},
        run forward pass on calibration set, compute output MSE, pick the α with
        minimum MSE. This is a 1-D grid search per layer.

The grid search in Step 7 is what makes AWQ "activation-aware" rather than just "activation-magnitude-scaled". It picks the α that empirically minimizes per-layer reconstruction error on real data.

7.5 Worked tiny example

Let W ∈ ℝ^{2 × 4} and consider one row of the weight matrix:

W[0, :] = [ 0.10,  0.05,  0.02,  0.08 ]

and a calibration mean-abs activation vector:

a = [ 1.0,  1.0,  10.0,  1.0 ]

Channel 2 is a 10× outlier.

Without AWQ, RTN INT4 with Q_max = 7 and one group covering all 4 elements:

max|W[0,:]| = 0.10
scale = 0.10 / 7 ≈ 0.01429
W_q = round(W / scale) = round([7.0, 3.5, 1.4, 5.6]) = [7, 4, 1, 6]    # ties-to-even may differ
W_dequant = scale × W_q = [0.1000, 0.0571, 0.0143, 0.0857]
err = W - W_dequant ≈ [0.000, -0.007, +0.006, -0.006]

The error in channel 2 is +0.006. The output contribution of this column to the output is:

err_y_from_col2 = err_col2 × a_col2 = 0.006 × 10.0 = 0.060

versus columns 0,1,3 each contributing roughly 0.007 × 1.0 = 0.007. Channel 2 dominates output error by ~10×.

With AWQ, choose α = 0.5:

s = a^0.5 = [1.0, 1.0, 3.162, 1.0]                              # before normalization
geomean = (1 × 1 × 3.162 × 1)^0.25 = 3.162^0.25 ≈ 1.333
s ← s / 1.333 = [0.750, 0.750, 2.372, 0.750]                    # after normalization
W' = W × s = [0.075, 0.0375, 0.0474, 0.060]                     # channel 2 is now near the top
max|W'[0,:]| = 0.075
scale' = 0.075 / 7 ≈ 0.01071
W'_q = round(W'/scale') = round([7.0, 3.5, 4.43, 5.6]) ≈ [7, 4, 4, 6]
W'_dequant = scale' × W'_q = [0.0750, 0.0429, 0.0429, 0.0643]
err' = W' - W'_dequant = [0, -0.0054, +0.0045, -0.0043]
err in original W space: err_W = err' / s = [0, -0.0072, +0.0019, -0.0057]
err_y_from_col2 = err_W[2] × a[2] = 0.0019 × 10.0 = 0.019

Output error from the outlier column dropped from 0.060 to `0.019 - roughly 3×. Errors in non-outlier channels grew slightly (because we shrank their effective precision), but they were 10× smaller to begin with, so total output MSE drops substantially.

The general principle: AWQ trades a little precision in non-outlier columns for a lot of precision where it counts.

7.6 Practical notes on AWQ

  • The scales s are absorbed into the previous layer's output. For a transformer block, the per-channel scaling of the MLP down-projection's input is folded into the up-projection's output. There is no runtime divide.
  • AWQ is weight-only-activations stay BF16. So there's no activation quantization error, only weight quantization error.
  • Typical degradation: less than 0.5 PPL on standard perplexity benchmarks for 7B+ dense models. Smaller models (1-3B) are more sensitive and can lose 1-2 PPL.
  • The AWQ kernel + Marlin (§13) is the highest-throughput W4A16 kernel as of writing.

8. GPTQ

Frantar, Ashkboos, Hoefler, Alistarh, "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", ICLR 2023.

8.1 Insight

Where AWQ asks "which weights should I protect?", GPTQ asks "after I round one weight, how should I update the rest to undo the damage?". It descends from Hassibi & Stork's Optimal Brain Surgeon (1993) for neural-network pruning.

8.2 The Optimal Brain Surgeon background

Suppose you have a trained network with loss L(w) minimized at w*. You want to perturb w (e.g., set one weight to a specific quantized value) while minimizing the increase in L. Locally, expand L to second order around w*:

L(w* + δ) ≈ L(w*) + g^T δ + (1/2) δ^T H δ

At the minimum g = 0, so δL ≈ (1/2) δ^T H δ.

If we constrain the perturbation to satisfy e_q^T δ = -w*_q (i.e., we are forcing the q-th weight to become 0, equivalently δ_q = -w*_q), then the optimal δ minimizing the quadratic subject to this linear constraint is, by Lagrange multipliers:

δ* = - (w*_q / [H^{-1}]_{qq}) × H^{-1} e_q

Equivalently, after we set weight q to its target value w_q^new, every other weight w_i should be updated by:

δw_i = -(w_q* - w_q^new) × [H^{-1}]_{iq} / [H^{-1}]_{qq}

This is the OBS update rule. The increase in loss it causes is:

δL = (1/2) × (w_q* - w_q^new)^2 / [H^{-1}]_{qq}

Notice that this δL is the minimum possible loss increase given that you're forced to change w_q to w_q^new.

8.3 Layer-wise reformulation for quantization

GPTQ does not run on the global loss L. Instead, it considers each linear layer in isolation and minimizes the layer-wise reconstruction error:

E(W_q) = ‖W X - W_q X‖_F^2

where X ∈ ℝ^{n × N} is a batch of N calibration activations going into this layer. Expanding:

E(W_q) = trace[(W - W_q) X X^T (W - W_q)^T]
       = Σ_i (Δw_i)^T H (Δw_i)

where Δw_i = (W - W_q)[i, :] is the error in row i, and:

H = 2 × X X^T   ∈ ℝ^{n × n}

is the Hessian of the layer-wise reconstruction loss with respect to a single row of W. The factor of 2 from differentiating the squared norm is conventional. Crucially:

  • H only depends on the input activations X, not the weights.
  • H is the same for every row of W. So we precompute it once per layer.
  • The problem decouples by row: each row of W is quantized independently.

8.4 The greedy column-by-column algorithm

For one row w ∈ ℝ^n of W, we want to choose integer q ∈ ℤ^n minimizing (w - dequant(q))^T H (w - dequant(q)). This is an integer-quadratic-programming problem (NP-hard in general).

GPTQ's approximation: quantize one column at a time, in order, and compensate the rest using OBS.

Pseudocode for a single row, ignoring grouping for clarity:

w_q = zeros(n)              # quantized weights (integers)
w_remaining = w.copy()      # current "live" weight vector
H_inv = inverse(H + λI)     # damped inverse for stability

for j in 0..n-1:
    # Step 1: quantize column j
    q_j = round(w_remaining[j] / scale[j])         # using whatever scale rule
    w_q_value = scale[j] × q_j                     # the dequantized value we'll use

    # Step 2: error introduced
    err = w_remaining[j] - w_q_value

    # Step 3: OBS update-push the error into columns j+1..n-1
    for i in j+1..n-1:
        w_remaining[i] -= err × H_inv[j, i] / H_inv[j, j]

    # Record q
    w_q[j] = q_j

After all n columns are processed, w_q is the final quantized row. The quantization scale scale[j] can be per-column (rare), per-group, or per-channel.

8.5 Why the greedy scheme is good

At each step we are solving the optimal-update problem for the column we just quantized, given that all already-quantized columns are frozen. We do not re-update earlier columns-that would un-quantize them. The remaining error after the loop is bounded by the sum over all greedy steps of the residual that can't be absorbed into later columns; in practice this is small for transformer weights.

The greedy choice of column order matters less than you'd think (left-to-right is fine), but a more robust variant called act-order GPTQ sorts columns by descending diagonal of H (i.e., by activation magnitude) so that the high-impact columns are quantized first and have the most "downstream slack" to absorb errors.

8.6 Cholesky-based efficient implementation

Computing and storing H^{-1} ∈ ℝ^{n × n} is O(n^2) storage and O(n^3) for the inverse. For large hidden dimensions (n = 8192 in Llama-7B's MLP) that's 256 MB just for the inverse and a few seconds for the inversion-fine for an offline calibration but not free.

The trick: we only ever access the upper-triangular part of H^{-1} (we only update columns j+1..n-1 from column j). The Cholesky factorization of a positive-definite matrix H = L L^T lets us compute H^{-1} = L^{-T} L^{-1} cheaply, and crucially:

  • The Cholesky factor L^{-1} is itself lower-triangular.
  • The rows of L^{-T} we need are obtained sequentially.

GPTQ's trick is to perform Cholesky decomposition of H^{-1} once, then walk through its upper triangle column by column. The inner OBS update uses the precomputed Cholesky rows. This converts the algorithm from O(n^3) per layer to O(n^2) with cache-friendly access patterns.

8.7 The block (lazy-batch) trick

A further optimization: process columns in blocks of B (typically B = 128) instead of one-at-a-time. Within a block, do full updates. Between blocks, accumulate the lazy update for distant columns and apply it once when we move to the next block. This gives an order-of-magnitude wall-clock speedup because the compensations within a block fit in L1/L2 cache.

The block size B = 128 also matches the per-group scale boundary, so a single block uses one scale value (inside the block) and the algorithm naturally computes scales group-by-group.

8.8 GPTQ pseudocode, full

def gptq_layer(W, X, group_size=128, bits=4, percdamp=0.01):
    # W: [out_features, in_features]
    # X: [in_features, N]   (calibration activations)
    n = W.shape[1]
    H = 2.0 * X @ X.T                                    # [n, n]

    # Damping for numerical stability
    diag_mean = trace(H) / n
    H += percdamp * diag_mean * eye(n)

    # Cholesky of H^{-1}
    H_inv = cholesky_inverse(H)
    L = cholesky(H_inv, upper=True)                      # H^{-1} = L^T L  (some conventions vary)

    Q = zeros_like(W, dtype=int)
    scales = zeros((W.shape[0], n // group_size))

    for blk in range(0, n, B):                           # B = 128
        block_end = min(blk + B, n)
        Wblk = W[:, blk:block_end].clone()
        Lblk = L[blk:block_end, blk:block_end]           # B × B

        for j in range(B):
            col = blk + j
            # 1. Determine scale for this group if at group boundary
            if col % group_size == 0:
                gstart = col
                gend = min(col + group_size, n)
                # gather pre-update weights for this group
                gblock = Wblk[:, j:j + group_size] if (gend - gstart) <= (B - j) else  W[:, gstart:gend]
                s = gblock.abs().max(dim=1) / Q_max
                scales[:, col // group_size] = s

            # 2. Quantize column
            w_col = Wblk[:, j]
            q = round(w_col / s).clamp(-Q_max, Q_max)
            w_q = s * q
            err = (w_col - w_q) / Lblk[j, j]
            Q[:, col] = q

            # 3. OBS update within block
            Wblk[:, j+1:] -= err.unsqueeze(1) * Lblk[j, j+1:].unsqueeze(0)

        # 4. Lazy update: apply block residual to all columns to the right
        W[:, block_end:] -= (Wblk - W[:, blk:block_end]) @ L_offblock_relevant
        # (in practice: applied via the appropriate slice of L)

    return Q, scales

The percdamp (typically 0.01) adds a small multiple of the identity to H before inversion, guaranteeing positive-definiteness. Without damping, H can be near-singular when activations have small variance in some channels.

8.9 GPTQ vs. AWQ, head-to-head

Aspect AWQ GPTQ
Conceptual basis Activation-aware scaling Hessian-aware error compensation
Calibration cost Cheap (a few hundred forwards) More expensive (Cholesky per layer)
Output kernel Standard W4A16 + dequant + scale Standard W4A16 + dequant
Reordering required No Optional (act-order)
Quality on 7B-70B dense Excellent Excellent
Quality on small (<3B) Slightly better in practice Slightly worse
Production dominant? Yes (Marlin path is fastest) Yes (long-standing default)

Both methods produce INT4 group=128 weight matrices that are compatible with the same Marlin kernel-only the calibration procedure differs. In production you typically try both, pick the one with lower perplexity / better downstream eval.


9. SmoothQuant

Xiao, Lin, Seznec, Demouth, Han, "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models", ICML 2023.

9.1 Why W8A8 is hard

W4A16 keeps activations in BF16, so there is no activation quantization error. W8A8 quantizes both weights and activations, enabling INT8 tensor-core matmul (which is twice the throughput of BF16 on most modern GPUs). But:

  • Weights are well-behaved → INT8 is easy.
  • Activations have heavy-tailed per-channel distributions → INT8 is hard because per-tensor scale is dragged up by outlier channels, and per-channel-per-token scaling is runtime-expensive.

The asymmetry: weight outliers are static (you can per-channel calibrate them once), but activation outliers are dynamic per token and per channel.

9.2 SmoothQuant insight

Use the same W = (W S)(S^{-1}) identity as AWQ, but with a different aim: redistribute magnitude from the activations to the weights such that both become INT8-friendly.

y = W x = (W diag(s)) (diag(s)^{-1} x) = W' x'

If s_j is large for outlier activation channels, then x'_j = x_j / s_j is small-the activation outliers are dampened. The price is that W'_{:, j} = W_{:, j} × s_j is larger-but weights had headroom to absorb that.

9.3 The migration strength α

How much magnitude to migrate? The choice is a hyperparameter α ∈ [0, 1] controlling how aggressively activation outliers are dampened:

s_j = max|x_j|^α / max|w_:, j|^(1-α)

The intuition:

  • α = 0s_j = 1 / max|w_:, j| → all magnitude pushed into activations (bad).
  • α = 1s_j = max|x_j| → all magnitude pushed out of activations (also bad-weights now have outliers).
  • α ≈ 0.5 → balanced.

In practice α = 0.5 is a good default; α = 0.85 has been reported for some Llama-class architectures whose activation outliers are particularly severe.

9.4 Where SmoothQuant is applied

The transformation is applied offline before INT8 calibration, at three points per transformer block:

  1. The input to the QKV projection (smoothing the LayerNorm output).
  2. The input to the MLP up-projection (smoothing the LayerNorm output).
  3. The input to the attention output projection-sometimes, depending on the architecture.

Importantly, the s scaling is fused into the previous layer's parameters:

  • For Pre-LN architectures, fold s into the LayerNorm scale (γ) before the attention/MLP block.
  • For other architectures, fold into the previous linear layer's weight columns.

Either way, no runtime compute is added-the smoothing is purely a calibration-time rewrite of the network parameters.

9.5 SmoothQuant pseudocode

def smoothquant_layer(prev_norm, layer_input_W, X_calib, alpha=0.5):
    # X_calib: [n, N]-calibration activations into this layer
    # layer_input_W: weight matrix [out, n] of the linear immediately following prev_norm
    # prev_norm: per-channel γ ∈ ℝ^n of the LayerNorm preceding the linear

    a = X_calib.abs().max(dim=1)                        # [n]
    w_max = layer_input_W.abs().max(dim=0)              # [n]
    s = (a ** alpha) / (w_max ** (1 - alpha))           # [n]

    # Fold into LayerNorm:
    prev_norm.γ /= s
    # Equivalent: divide the output of LayerNorm by s, which scales the input into the linear.

    # Fold into linear's input columns:
    layer_input_W *= s                                  # broadcast over output dim

9.6 W8A8 result

With SmoothQuant, a transformer can typically be quantized to INT8 weights and INT8 activations with <1 PPL degradation, enabling INT8 tensor-core throughput (2× BF16 on Hopper, more on Ada). It's a different point in the design space from AWQ/GPTQ-those target W4A16 (weight-only INT4); SmoothQuant targets W8A8 (both INT8).

In production, modern stacks often combine these:

  • Use SmoothQuant-style activation smoothing.
  • Apply GPTQ or AWQ for weight quantization to INT4.
  • Keep activations in BF16 (W4A16 path)-the smoothing helps even when activations stay in BF16, by making the post-smoothing weight distribution easier to quantize.

This combination is what tools like AutoAWQ and llm-compressor offer out of the box.


10. Activation quantization

If you are running W8A8 or any scheme that quantizes activations, you must decide when to compute the activation scale.

10.1 Static activation quantization

Calibrate once over a representative dataset; freeze a per-tensor (or per-channel) scale and zero_point; reuse them at runtime.

  • Pros: zero runtime overhead. The scale and zero_point are baked into the kernel.
  • Cons: bad accuracy when the runtime activation distribution differs from calibration. Especially bad with long context, long-tailed inputs, or out-of-distribution prompts.

10.2 Dynamic activation quantization

At each forward pass (each token, each layer), compute the scale on the fly:

scale_x = max(|x|) / Q_max

This is a per-token reduction over the channel dimension. On a GPU, it's a reduce_max over n elements, fused into the kernel.

  • Pros: handles per-token variation automatically. Robust.
  • Cons: costs a reduction. For decode (n ~ 4096-16384), this is a few microseconds-small compared to the matmul, but not free.

10.3 Per-token vs. per-tensor activation scales

Per-token (also called per-row) dynamic scaling is the gold standard. Per-tensor would require a global reduction across all tokens in a batch, which is impractical.

10.4 Practical guidance

  • For W8A8 inference: dynamic per-token activation scales, even if it costs 5% throughput. The accuracy gap to static is large.
  • For W4A16: no activation quantization at all-keep BF16. This is the "free" axis.
  • For FP8: per-tensor amax-based scaling, recomputed periodically (see §11).

11. FP8 inference

11.1 Hardware

NVIDIA H100 (Hopper) introduced FP8 tensor cores supporting:

  • E4M3 for activations and weights (forward path).
  • E5M2 for gradients (backward path, training).
  • 2× BF16 throughput on H100 SXM (~989 TFLOPS BF16 → ~1979 TFLOPS FP8).
  • Native dequant-on-load: tensor cores accept FP8 inputs and output FP32 accumulators.

Ada (RTX 4090, L40S) has FP8 support with similar speedup. Blackwell (B100, B200) extends to FP4.

11.2 Per-tensor scaling factor

Unlike INT8, where scale must be applied to recover the real value, FP8 is a floating-point format and represents a real number directly within its limited range. The catch: the limited range (~448 for E4M3) means real-world activations and weights need to be prescaled into the FP8 representable range.

The scaling factor is typically per-tensor and stored as FP32:

x_fp8 = quantize_e4m3(x_fp32 × scale_x)
y_fp8 = quantize_e4m3(W_fp32 × scale_w)        # quantized once at load
matmul_output (FP32 accumulator) = (1 / (scale_x × scale_w)) × Σ (x_fp8 × w_fp8)

The output is an FP32 accumulator; you then re-quantize to FP8 (with a new per-tensor scale) to feed the next layer.

11.3 Calibration: amax tracking

The scale factor for each tensor is set by tracking the maximum absolute value seen during calibration:

amax = max(|tensor|) across calibration set
scale = fp8_max / amax        # fp8_max ≈ 448 for E4M3

The challenge: amax is determined per tensor, and "per tensor" includes activations whose distribution depends on inputs. For weights this is static; for activations it must be tracked.

11.4 Delayed scaling

Recomputing amax every forward pass costs a reduction. NVIDIA TransformerEngine (TE) introduces delayed scaling:

  • Maintain a moving-window history of recent amax values (e.g., last 16 forward passes).
  • The scale used at step t is computed from the history through step t-1.
  • Rationale: amax changes slowly; using the previous amax for the current step is a good approximation.

This avoids the overhead of synchronous reduction inside each forward pass. The scale update is essentially an asynchronous bookkeeping operation.

11.5 NVIDIA TransformerEngine

TE is the NVIDIA-blessed library for FP8 training and inference. Key features:

  • Drop-in replacement for nn.Linear, nn.LayerNorm, attention layers.
  • Manages amax history and scale computation transparently.
  • Supports FP8 GEMM, FP8 attention (FA-3 with FP8 KV-cache), and mixed FP8/BF16 layers.
  • Integrates with PyTorch autograd: forward in E4M3, backward in E5M2.

In production, FP8 inference for dense LLMs is currently the right call for prefill-heavy workloads (because you get 2× tensor-core throughput) and for models that fit the H100's FP8-friendly architecture (Llama, Mistral, Qwen). For memory-bound decode, INT4 is still the higher-leverage choice because it halves memory traffic again.

11.6 FP8 vs. INT8

Both are 8-bit. Why prefer FP8?

  • FP8 has a built-in exponent → it handles dynamic range natively.
  • INT8 requires explicit scale storage and scale-aware kernels.
  • FP8 supports backprop (E5M2 covers gradient range); INT8 does not naturally.
  • FP8 is trickier on hardware that doesn't support it (Ampere). INT8 is universal.

In 2024+: FP8 is rapidly displacing INT8 for forward-pass quantization on H100/Blackwell. INT8 (W8A8 with SmoothQuant) is still common for older hardware and edge deployments.


12. FP8 training

A full treatment is in DD 14 (distributed training). The TL;DR:

  • Forward: weights, activations, optimizer-state copies → E4M3.
  • Backward: gradients → E5M2 (needs the wider exponent for the small magnitudes that arise from chain rule).
  • Master weights: kept in BF16 or FP32 to accumulate updates without underflow.
  • Per-tensor amax with delayed scaling: as above.
  • Loss scaling is no longer needed (FP8's exponent suffices for stable forward; E5M2 suffices for backward).

The training story is qualitatively similar to mixed-precision BF16 training but with a tighter scaling regime. NVIDIA's published recipes for FP8 LLama training show parity with BF16 within ~0.05 PPL on standard pretraining benchmarks at large scales.


13. On-the-fly dequantization and the Marlin kernel

W4A16 inference requires solving an unusual GEMM problem: the A operand is BF16, the B operand is INT4 packed as 4-bit values with FP16 per-group scales, and the output is BF16. Tensor cores don't natively accept INT4-with-scales. So the kernel must:

  1. Load packed INT4 weights from HBM.
  2. Unpack to BF16 in registers or shared memory.
  3. Multiply by per-group FP16 scale.
  4. Feed the resulting BF16 tile into a tensor-core BF16 matmul.

The naive implementation runs at FP16 GEMM throughput (no win) because dequantization is in series with the matmul. Marlin solves this.

13.1 Marlin kernel

Frantar, Castro, Chen, Ashkboos, Alistarh, "Marlin: Mixed-precision Auto-Regressive Linear kernels", 2024. (And the open-source marlin repository.)

Marlin is the highest-throughput open-source W4A16 GEMM as of writing. Key design decisions:

1. SMEM-based dequantization, double-buffered. While one warp dequantizes the next tile of weights from packed INT4 in SMEM into BF16 in another SMEM region, another warp consumes the previously-dequantized BF16 tile for tensor-core matmul. This hides dequant latency behind matmul latency.

2. Tensor-core BF16 matmul on dequantized weights. The matmul is a standard BF16 × BF16 → FP32 tensor-core GEMM. No new tensor-core type needed.

3. K-axis tile size aligned with group size. The inner-K tile is 128 (matching group_size = 128) so the per-group scale is loaded once per K-tile and held in registers throughout that tile's matmul.

4. Fused output dequantization. The per-group FP16 scale is multiplied into the BF16 weight tile before it goes into the tensor core. The tensor core itself sees pre-scaled BF16-no post-multiply pass.

5. Asynchronous HBM loads via cp.async. HBM → SMEM transfers use the Hopper cp.async.bulk instructions to maximize bandwidth and hide latency behind SMEM-resident dequant + matmul.

13.2 Marlin performance, qualitatively

  • Decode (memory-bound): ~3× FP16 GEMM throughput. This is the regime where halving weight bytes halves runtime.
  • Prefill (compute-bound): approaches FP16 GEMM throughput. Once arithmetic dominates, the dequant savings are gone, but we don't lose much either.
  • Crossover batch size: roughly 16–32 tokens per forward pass on H100 with Llama-class models. Below this batch, you're memory-bound and INT4 wins big. Above this, INT4 is roughly tied with BF16 on time, but you still save the 4× HBM footprint (which lets you run bigger models or longer contexts).

13.3 Why this matters

The Marlin kernel is what makes W4A16 inference practical. Without it, you'd have a quantized model that loaded 4× faster but ran the matmul at FP16 speed-wasting half the available speedup. With it, decode genuinely doubles or triples.

The Marlin design is now widely copied: vLLM, TensorRT-LLM, and SGLang all integrate Marlin or Marlin-derived W4A16 kernels for their INT4 paths.


14. Mixed-precision inference

Not every layer is equally quantization-tolerant. Empirically:

  • Embedding layers and the LM head are sensitive (small numerical noise → wrong tokens).
  • LayerNorm parameters are tiny but multiplicative; quantizing them is rarely worth it.
  • Early MLP layers and early attention QKV projections are sometimes more fragile than later ones.
  • The down-projection of the MLP block is often the most outlier-sensitive.

Mixed-precision inference exploits this: quantize most layers aggressively (INT4), but keep the few sensitive ones in higher precision (BF16 or INT8).

14.1 LLM.int8() (Dettmers et al., 2022)

The seminal mixed-precision INT8 scheme. Splits each matmul into:

  • The outlier columns (~1% of input channels) → kept in FP16.
  • The regular columns (~99%) → quantized to INT8.
  • Two separate matmuls; results summed.

Implementation in bitsandbytes. Slower than pure-INT8 but enabled the first 175B models to fit on 8×A100s.

14.2 Per-layer schemes in vLLM

vLLM's quantization config supports per-layer precision selection. Typical recipe:

default: int4_awq_g128
overrides:
  - lm_head: bf16
  - model.embed_tokens: bf16
  - model.layers.0.mlp.down_proj: int8     # extra-sensitive, bumped to INT8

The user provides this from a YAML or programmatic API; vLLM dispatches each layer to the appropriate kernel.

14.3 The lookup-free heuristic

For pure W4A16 with AWQ or GPTQ, you usually don't need mixed precision at all on 7B+ dense models. The single biggest win from mixed precision is on:

  • Models smaller than 3B.
  • Models with known outlier-sensitivity (some MoE experts).
  • Long-context regimes where activation tails grow.

If you're shipping a 70B Llama at INT4 AWQ, just go full INT4-the quality is fine.


15. Calibration set design

Calibration is the process of collecting activation statistics (X for GPTQ, mean-abs activations for AWQ, amax for FP8) on a representative dataset.

15.1 Size

  • 100–1000 examples is typical. Below 100, statistics are noisy. Above 1000, marginal gains are small.
  • Each example should be a sequence of typical length (e.g., 512–2048 tokens).
  • Total calibration tokens: ~100K–1M.

15.2 Distribution

The calibration set should match the deployment distribution. Quantization is an empirical approximation-if you calibrate on Wikipedia and deploy on chat dialogues, the activation distribution differs and quality suffers.

For general-purpose chat models, common calibration sets:

  • WikiText-2 / WikiText-103: standard, English encyclopedic prose.
  • C4: web-scrape, more diverse.
  • Pile: research-grade pretraining mix.
  • In-domain prompts: a sample of real production queries (best, when available).

For domain-specific deployments (code, medical, legal), use in-domain calibration. The accuracy difference from generic-vs-in-domain calibration can be 10–30% on downstream evals.

15.3 Length matching

If your deployment uses 8K context, calibrate on 8K-token sequences (or longer than typical). Activation distributions shift with sequence length-the longer the context, the heavier the activation tails in attention layers.

15.4 Mode matching

Calibrate in inference mode with the same prefill/decode pattern your serving stack uses. In particular:

  • Calibrate with a chat template wrapping if you ship with one.
  • Calibrate with the system prompt prepended.
  • Calibrate with the same tokenizer settings (BOS/EOS handling).

Small mismatches here are common bug sources.

15.5 Practical workflow

1. Collect 256 prompts from production logs (or a representative public set).
2. Render them through your full chat template / tokenizer pipeline.
3. Truncate or pad to a common length matching your typical workload.
4. Run forward passes on the original FP16/BF16 model, capturing activations.
5. Feed activations to GPTQ/AWQ/SmoothQuant.
6. Save the resulting quantized model.
7. Evaluate (next section).

16. Evaluation discipline

Quantization is an empirical engineering activity. You ship what passes evaluation. Two orthogonal axes:

16.1 Perplexity

Cheap (~1 GPU-hour for 7B on WikiText-2), sensitive (tenths of a PPL are detectable), but proxy. Run perplexity on a held-out set similar to your calibration distribution but disjoint:

  • WikiText-2 test split is the de-facto baseline.
  • C4 validation slice is also common.

What to look for:

  • INT4 AWQ/GPTQ on 7B+ dense: <0.5 PPL absolute increase from BF16.
  • INT4 on smaller models (<3B): up to 1-2 PPL.
  • W8A8 with SmoothQuant: <1 PPL.
  • FP8 inference: ~0.1 PPL or less.

If you see >2 PPL, something is wrong: bad calibration data, wrong group size, an unfused fold of s somewhere, or a layer that needs to stay BF16.

16.2 Downstream task evaluation

Perplexity is necessary but not sufficient. Always also measure on a battery of downstream benchmarks:

  • MMLU (Massive Multitask Language Understanding): broad knowledge.
  • HumanEval: code generation correctness (pass@1).
  • GSM8K: grade-school math reasoning.
  • HellaSwag / WinoGrande / ARC-Challenge: common-sense reasoning.
  • TruthfulQA: hallucination resistance.

A quantized model can have unchanged perplexity but drop 3% on MMLU-usually a sign that the long-tail factual recall has been damaged by precision loss in the LM head or final layers.

What to look for:

  • Quality-grade INT4: <1% absolute drop on MMLU and GSM8K.
  • Acceptable INT4: <2% drop.
  • W8A8: typically indistinguishable on benchmarks.

16.3 The full evaluation loop

1. Compute baseline metrics on the FP16 reference model.
2. Quantize with method X (AWQ, GPTQ, etc.) at precision P.
3. Recompute all metrics.
4. Compare: tabulate (metric, baseline, quantized, delta).
5. If any delta exceeds your acceptance threshold, iterate:
     - Try the other method.
     - Try larger group size (smaller is better; 64 vs 128 is a common knob).
     - Try mixed-precision (keep one or two fragile layers BF16).
     - Try a different α for AWQ/SmoothQuant.
6. Lock the recipe in version control along with calibration data hash.

The recipe lockdown is non-negotiable: quantization is sensitive enough that "I re-ran AWQ and got 0.3 PPL different" is a real and frequent failure mode. Pin everything: the model checkpoint hash, the calibration sample IDs, the random seed, the version of the quantization tool.

16.4 The ship-criteria

Don't ship a quantization scheme without:

  1. ≤ X PPL increase on at least two domains.
  2. ≤ Y% drop on at least three downstream tasks.
  3. A throughput measurement showing the expected speedup is realized in your serving stack.
  4. A latency tail check (p99 first-token, p99 inter-token)-quantization shifts kernel performance in non-uniform ways.
  5. A quality A/B on real production traffic for a small fraction of users, before full rollout.

17. Practical exercises

These are exercises with full worked solutions where appropriate. Set a calculator aside and try them yourself first.

Exercise 17.1: Memory footprint of a 70B model under various schemes

A Llama-3-70B has ~70.6B parameters. Compute the storage in GB under:

(a) FP32. (b) BF16. (c) INT8 per-channel symmetric. (d) INT4 group=128 with FP16 scale + FP16 zero_point. (e) INT4 group=64 with FP16 scale (symmetric, no zp).

Solution.

Let P = 70.6 × 10^9.

(a) FP32: 4 × P = 282.4 GB.

(b) BF16: 2 × P = 141.2 GB.

(c) INT8 per-channel: 1 × P for the integers, plus one FP16 scale per output channel. Output channels for Llama-3-70B sum to roughly ~10^6 across all linear layers, contributing ~2 MB-negligible. So ≈ 70.6 GB.

(d) INT4 group=128 with FP16 scale + FP16 zp: 4 × P / 8 = 35.3 GB for integers, plus (2 + 2) bytes × P / 128 = 4P/128 bytes ≈ 2.2 GB for scales+zp. Total ≈ 37.5 GB.

(e) INT4 group=64, FP16 scale, no zp: 35.3 GB for ints, plus 2 × P / 64 ≈ 2.2 GB for scales. Total ≈ 37.5 GB. (Smaller group means more scales, but no zp roughly compensates.)

The lesson: scale+zp overhead is meaningful at small group sizes. The crossover where scale overhead equals weight savings is around group = 8 (everything is then a scale).

Exercise 17.2: Effective bits per element

You're storing INT4 group=64, with FP32 scale and INT4 zp. Compute the effective bits per element.

Solution.

bits_per_elem = 4 (int) + 32/64 (scale) + 4/64 (zp)
              = 4 + 0.5 + 0.0625
              = 4.5625 bits/element

vs. INT4 group=128 FP16 scale only at 4.125. Group=64 with FP32 scale almost loses you back the savings of going INT4 in the first place-a useful reminder to use FP16 scales (or even INT8 packed scales in some implementations).

Exercise 17.3: Tracing AWQ on a 4-element row

Given:

W[0, :] = [0.20, -0.10, 0.05, -0.30]
calibration mean-abs activations a = [1.0, 1.0, 8.0, 1.0]

Compute AWQ INT4 quantization with α = 0.5, group_size = 4, Q_max = 7. Compare error vs. plain RTN.

Solution.

Plain RTN:

max|W| = 0.30
scale = 0.30 / 7 ≈ 0.04286
W/scale = [4.667, -2.333, 1.167, -7.0]
round → [5, -2, 1, -7]
W_dequant = [0.2143, -0.0857, 0.0429, -0.3000]
err = [-0.0143, -0.0143, 0.0071, 0.0]

Output error, per-channel:
err × a = [-0.0143, -0.0143, 0.0571, 0.0]
sum |err × a| = 0.0857

AWQ:

s = a^0.5 = [1.0, 1.0, 2.828, 1.0]
geomean = (1 × 1 × 2.828 × 1)^(1/4) ≈ 1.297
s ← s / 1.297 ≈ [0.771, 0.771, 2.181, 0.771]
W' = W × s = [0.1543, -0.0771, 0.1090, -0.2314]
max|W'| = 0.2314
scale' = 0.2314 / 7 ≈ 0.03306
W'/scale' = [4.668, -2.333, 3.298, -7.0]
round → [5, -2, 3, -7]
W'_dequant = [0.1653, -0.0661, 0.0992, -0.2314]
err in W'-space = [-0.0110, -0.0110, 0.0098, 0.0]
err in W-space (divide by s) = [-0.01427, -0.01427, 0.00450, 0.0]

Output error per-channel:
err × a = [-0.01427, -0.01427, 0.0360, 0.0]
sum |err × a| = 0.0645

AWQ reduced the channel-2 output error from 0.0571 to 0.036 - a ~37% reduction on the salient channel. Total summed error fell from0.086to0.065`. The improvement is modest but consistent; with realistic LLM activation tails (10×–100× outliers, not 8×) and across millions of weights, these savings compound to the difference between a usable INT4 model and a broken one.

Exercise 17.4: GPTQ on 2 columns

Suppose:

w = [0.5, 0.3]    (1 row, 2 cols, just for illustration)
H = [[2.0, 0.5],
     [0.5, 1.0]]
H^{-1} = (1/(2×1 - 0.5×0.5)) × [[ 1.0, -0.5],
                                  [-0.5,  2.0]]
       = (1/1.75) × [[1.0, -0.5], [-0.5, 2.0]]
       ≈ [[0.5714, -0.2857],
          [-0.2857, 1.1429]]

Quantize using INT2 (Q_max = 1, codes ∈ {-1, 0, 1}) with per-column scale = 0.5. Trace GPTQ.

Solution.

Column 0:

w[0] = 0.5
q[0] = round(0.5 / 0.5) = 1
w_q[0] = 0.5 × 1 = 0.5
err = 0.5 - 0.5 = 0.0
No update needed for column 1 (err is 0 here, lucky).

Column 1 (no update from column 0):

w[1] = 0.3
q[1] = round(0.3 / 0.5) = 1   (rounds to nearest; 0.6 rounds up)
w_q[1] = 0.5
err = 0.3 - 0.5 = -0.2
No further columns to update.

Result: w_q = [0.5, 0.5]. Layer reconstruction error = (0 + (-0.2))^T H ((0, -0.2)^T) = 0.04 × 1.0 = 0.04.

Now repeat with a non-trivial first-column error. Suppose w = [0.7, 0.3]:

Column 0: q[0] = round(0.7/0.5) = 1, w_q[0] = 0.5, err = 0.2.
OBS update for column 1:
  δw_1 = -err × H^{-1}[0,1] / H^{-1}[0,0]
       = -0.2 × (-0.2857 / 0.5714)
       = -0.2 × (-0.5)
       = +0.1
  w[1] ← 0.3 + 0.1 = 0.4

Column 1: q[1] = round(0.4/0.5) = 1, w_q[1] = 0.5, err = -0.1.
No further columns.

Final w_q = [0.5, 0.5]

Without the OBS update, w[1] would have been 0.3, rounded to 1 (since 0.6 > 0.5), giving the same `w_q = [0.5, 0.5] - same answer in this tiny case. But in a real layer with many columns and a non-pathological Hessian, the OBS updates do change the rounding decisions of later columns and lower reconstruction error noticeably.

Exercise 17.5: SmoothQuant α derivation

Given activation max-abs a = [1, 1, 100] and weight column-max-abs w = [10, 10, 10]. Compute s for α ∈ {0.0, 0.5, 1.0}. After smoothing, what are the new activation and weight max-abs?

Solution.

For α = 0.0: s = a^0 / w^1 = [1, 1, 1] / [10, 10, 10] = [0.1, 0.1, 0.1]. - New x' = x / s, max-abs [10, 10, 1000]. New w' = w × s, max-abs [1, 1, 1]. - Activation outlier got worse (100 → 1000). Bad.

For α = 0.5: s = a^0.5 / w^0.5 = [1, 1, 10] / [√10, √10, √10] ≈ [0.316, 0.316, 3.162]. - New x' max-abs: [1/0.316, 1/0.316, 100/3.162] ≈ [3.16, 3.16, 31.62]. - New w' max-abs: [10×0.316, 10×0.316, 10×3.162] ≈ [3.16, 3.16, 31.62]. - Activation and weight max are now equal-perfectly balanced.

For α = 1.0: s = a^1 / w^0 = [1, 1, 100] / [1, 1, 1] = [1, 1, 100]. - New x' max-abs: [1, 1, 1]. New w' max-abs: [10, 10, 1000]. - Now the weights have the outlier. Bad.

Conclusion: α = 0.5 makes activations and weights equally hard. The empirical optimum varies by architecture but lives in this neighborhood.

Exercise 17.6: End-to-end throughput estimate

You're serving Llama-3-70B at INT4 AWQ on H100 SXM (80 GB, 3.35 TB/s HBM). At batch size 1 (decode), estimate tokens-per-second.

Solution.

Weight footprint: ~37.5 GB (from Ex. 17.1). KV cache per token at FP16: 2 (k+v) × 80 (heads) × 128 (head_dim) × 80 (layers) × 2 (bytes) ≈ 3.3 MB/token. At 4096 context: 13.5 GB KV cache. Total HBM traffic per token: 37.5 GB + (per-token KV reads) ≈ 37.5 GB (KV is incremental per layer; full reads dominate).

Assume Marlin achieves 80% of peak HBM bandwidth in this regime: effective 0.8 × 3.35 = 2.68 TB/s.

Tokens/sec ≈ 2.68 TB/s / 37.5 GB ≈ 71.5 tokens/sec for the weight-traffic component alone.

Adding KV-cache reads (which scale with context length) and overhead, real-world numbers for vLLM-Marlin Llama-70B at INT4 are typically in the 50–80 tokens/sec range at batch 1, 4096 context-broadly consistent with this back-of-envelope. The same model at BF16 would be ~30 tokens/sec (memory-bound, half the throughput).


18. Cheat sheet

18.1 Default recipes

Goal Recipe
Maximum throughput, dense LLM INT4 AWQ group=128 + Marlin kernel
Quality-first, dense LLM INT4 GPTQ group=128 with act-order
W8A8 for older HW (A100, V100) SmoothQuant + GPTQ INT8 weights + dynamic INT8 act
H100 prefill-bound FP8 E4M3 with TransformerEngine
Small model (<3B) INT8 PTQ or mixed-precision INT4
Long-context with KV cache pressure INT4 weights + FP8 KV cache

18.2 Rules of thumb

  • INT4 group=128 ≈ 4.13–4.25 effective bits/element.
  • Halving weight bytes ≈ doubles decode throughput (memory-bound regime).
  • Marlin gives ~3× FP16 throughput on decode, near-parity on prefill.
  • AWQ ~ GPTQ in quality on 7B+ dense; AWQ slightly faster to calibrate, GPTQ slightly more flexible.
  • SmoothQuant α defaults: 0.5 generic, 0.85 for outlier-heavy Llamas.
  • Calibrate with 100–1000 examples matching deployment distribution.
  • Always evaluate perplexity AND downstream tasks before shipping.
  • FP8 ≠ INT8: FP8 has built-in dynamic range; INT8 needs explicit per-channel/per-group scale.

18.3 Common pitfalls

  • Forgetting to fold the AWQ s into the previous layer's parameters → runtime divide that wasn't budgeted.
  • Mismatch between calibration tokenization and deployment tokenization → activation distribution shift, accuracy drop.
  • Using a too-small group size with FP32 scales → effective bits balloons, savings disappear.
  • Static activation quantization with prompts that don't match calibration → silent quality regression.
  • No mixed-precision exception for the LM head → top-k token decisions degrade; benchmarks tank.
  • Skipping downstream eval and trusting perplexity → 0.5 PPL OK but MMLU drops 4 points.
  • Comparing pre-quant and post-quant on different decoding settings → spurious differences.
  • Forgetting the K-axis tile alignment → custom kernels run at half-speed because dequant doesn't fuse with matmul.

18.4 Further reading

  • Lin et al., AWQ, MLSys 2024.
  • Frantar et al., GPTQ, ICLR 2023.
  • Xiao et al., SmoothQuant, ICML 2023.
  • Frantar et al., Marlin, 2024.
  • Dettmers et al., LLM.int8(), NeurIPS 2022.
  • NVIDIA, FP8 Formats for Deep Learning (white paper, 2022).
  • NVIDIA TransformerEngine documentation.
  • The vLLM and TensorRT-LLM source trees-read the actual quantization configs and W4A16 kernels. They are the operational ground truth.

Closing notes

Quantization is the rare topic in AI systems where theory, kernels, and economics align cleanly: a few hundred lines of clean math (affine maps, OBS updates, magnitude redistribution) translate directly into 2–4× throughput, 4× memory savings, and tens of percent of inference cost reduction. It is also the topic where sloppy execution most reliably destroys model quality-quietly, without crashing-because every step is an empirical approximation.

The discipline that distinguishes a competent quantization engineer from a great one is paranoid evaluation. Anyone can run AutoAWQ on a Hugging Face checkpoint. The engineer who validates the result on three benchmarks, two domains, and a production traffic A/B is the one whose model actually ships and stays shipped.

Master this chapter and you have the leverage to halve your inference bill on any LLM workload. Master the failure modes and you have the discipline to do so without the customer noticing-except for their latency improving.

End of Deep Dive 09.

Comments