Walk me through exactly what happens inside an LLM when it generates a response — from receiving the raw prompt string to producing each output token. Cover tokenization, the forward pass, logit projection, and how temperature and top-p change the output distribution.
formulate your answer, then —
tldr
Generation pipeline: tokenize → embed → transformer forward pass (prefill) → project to logits → temperature/top-k/top-p → sample token → decode to text → repeat. Prefill processes the full prompt in parallel; decode generates one token at a time using KV cache. Temperature scales logit sharpness. Top-p keeps the minimal token set covering p cumulative probability. Each decode step is O(n) attention over cached KVs.
follow-up
- Why does lowering temperature make the model more deterministic but not necessarily more accurate?
- What is repetition penalty and how is it applied in logit space?
- How does beam search differ from nucleus sampling, and when would you prefer each?