Explain model quantization — what it is, the main schemes (INT8, INT4, FP8), how it affects quality, and when you'd use it in production. Why does quantization matter more for LLMs than it did for earlier neural networks?
formulate your answer, then —
tldr
Quantization reduces weight precision (FP16 → INT8/INT4) to shrink model size and improve decode speed. LLM decode is memory-bandwidth-bound — halving precision roughly doubles throughput. INT4 with GPTQ/AWQ keeps quality loss to 1-3%. Not all layers tolerate quantization equally — embedding/unembedding layers usually stay FP16. FP8 training on H100 is now common for full training runs. GGUF enables running 70B+ models on consumer hardware.
follow-up
- How does group quantization (per-block scaling) improve quality over per-tensor quantization?
- What is a straight-through estimator and why is it needed for quantization-aware training?
- When would you choose FP8 training over BF16 training, and what hardware is required?