What is a KV cache, and why does it matter for LLM inference?

Question

Accepted Answer

Explain the KV cache in autoregressive transformer inference. What does it save, what does it cost, and how does it affect serving? Think about: what changes between prefill and decode. Why old keys and values do not need to be recomputed for every generated token. Why memory, not just compute, becomes the bottleneck. **The inference problem** Autoregressive LLMs generate one token at a time. For token `t`, each layer computes queries, keys, and values. The new token attends to all previous toke