mlprep
mlprep/ML Breadthhard14 min

Your model supports a 128K token context window. Does that mean you should put everything into the prompt? Explain the tradeoffs.

formulate your answer, then —

tldr

Long context increases what the model can read, but it raises latency, memory, and reliability costs. It does not replace retrieval, reranking, grounding, or evaluation. Senior answers should discuss attention cost, KV cache memory, positional generalization, distractors, and evidence-position testing.

follow-up

  • How would you evaluate whether a model actually uses information from the middle of a long context?
  • When would RAG beat a long-context-only approach?
  • What serving bottlenecks appear when moving from 8K to 128K context?