Explain the KV cache in autoregressive transformer inference. What does it save, what does it cost, and how does it affect serving?
formulate your answer, then —
tldr
KV cache stores previous keys and values during autoregressive decoding. It avoids recomputing old tokens and lowers decode compute, but it consumes memory proportional to layers, context length, hidden size, and concurrency. Long-context LLM serving is often KV-cache constrained.
follow-up
- What is the difference between prefill latency and decode latency?
- How do multi-query attention and grouped-query attention reduce KV cache memory?
- Why can increasing context length reduce serving throughput even if the model quality improves?