Explain model quantization — what it is, the main schemes (INT8, INT4, FP8), how it affects quality, and when you'd use it in production. Why does quantization matter more for LLMs than it did for earlier neural networks?

Question

Accepted Answer

What quantization does

Model weights are stored as floating-point numbers — typically BF16 or FP32 during training. Quantization maps these to lower-precision formats:

INT8:  1 sign + 7 magnitude bits → 256 values, 1 byte
INT4:  16 values per weight,       0.5 bytes
FP8:   8-bit float (E4M3 or E5M2), 1 byte

The standard approach: for each tensor (or sub-block), find the range [min, max] and linearly map it to the quantization grid.

q = round(w / scale + zero_point)     # quantize
w̃ = (q - zero_point) × scale          # dequantize
quantization_error = w̃ - w

Why it matters more for LLMs

LLM inference is memory-bandwidth-bound at decode time. The bottleneck is streaming weight matrices from HBM to compute units — not arithmetic throughput.

A Llama-3 70B model in FP16: 140 GB. At 80 GB/s HBM bandwidth (A100), streaming all weights takes ~1.75 seconds per token — far too slow. INT4 quantization: 35 GB, ~0.44s/token. You can fit the model on fewer GPUs and generate tokens faster.

Rule of thumb: halving precision roughly doubles decode throughput until you hit compute limits.

Quantization schemes

Post-Training Quantization (PTQ): quantize weights after training, no gradient updates. Fast to apply, can degrade quality, especially at INT4 or below.

Quantization-Aware Training (QAT): simulate quantization noise during training with straight-through estimators. Better quality, expensive.

Weight-only quantization: quantize weights to INT4/INT8; activations stay in FP16 at runtime. Common for LLM serving — weights dominate memory, activations are smaller.

W8A8 / W4A8: quantize both weights and activations. Enables INT8 matrix multiplication — uses GPU integer SIMD units, which are faster than FP16 CUDA cores on some hardware.

GPTQ, AWQ, and GGUF

GPTQ (Frantar et al., 2022): PTQ that minimizes per-layer quantization error by solving a weighted least-squares problem. 4-bit with near-lossless quality on language benchmarks. Runs on GPU.

AWQ (Activation-aware Weight Quantization): identifies salient weights (high activation scale) and protects them from aggressive quantization. Better quality than GPTQ at INT4.

GGUF (llama.cpp): format for running quantized models on CPU+GPU with mixed precision. Q4_K_M: 4-bit with different bit widths for different layers based on sensitivity. Enables running 70B models on a MacBook.

Quality degradation patterns

- FP16 → BF16: nearly lossless on most tasks
- FP16 → INT8 (PTQ): ~1-2% degradation on benchmarks, often imperceptible
- FP16 → INT4 (PTQ naive): significant degradation on reasoning, math, code
- FP16 → INT4 (GPTQ/AWQ): 1-3% degradation, usually acceptable
- INT4 with per-group quantization (block size 128): much better than per-tensor

Sensitivity varies by layer

Not all layers tolerate quantization equally. The first and last layers (embedding, unembedding) are highly sensitive — often kept in FP16 even when everything else is INT4. Attention QKV projections are more sensitive than FFN layers. This is why group/block quantization and mixed-precision schemes exist.

Production decision framework

| Constraint | Choice |
|---|---|
| Need exact reproducibility | FP16 or BF16 |
| Reduce VRAM, small quality hit OK | INT8 weight-only |
| Maximum compression, acceptable quality loss | INT4 AWQ/GPTQ |
| Consumer hardware / edge deployment | GGUF Q4_K_M |
| Maximum throughput on modern GPU | FP8 (H100) |