How do long-context transformers work, and where do they fail?

Question

Accepted Answer

Your model supports a 128K token context window. Does that mean you should put everything into the prompt? Explain the tradeoffs. Think about: attention cost, KV cache memory, positional extrapolation, retrieval quality, needle-in-haystack failures, and how context length affects latency. **Long context is capacity, not guaranteed reasoning** A long context window means the model can accept many tokens. It does not guarantee the model will use all of them reliably. Models can miss relevant facts