mlprep
mlprep/ML Breadthmedium10 min

Walk me through exactly what happens inside an LLM when it generates a response — from receiving the raw prompt string to producing each output token. Cover tokenization, the forward pass, logit projection, and how temperature and top-p change the output distribution.

formulate your answer, then —

tldr

Generation pipeline: tokenize → embed → transformer forward pass (prefill) → project to logits → temperature/top-k/top-p → sample token → decode to text → repeat. Prefill processes the full prompt in parallel; decode generates one token at a time using KV cache. Temperature scales logit sharpness. Top-p keeps the minimal token set covering p cumulative probability. Each decode step is O(n) attention over cached KVs.

follow-up

  • Why does lowering temperature make the model more deterministic but not necessarily more accurate?
  • What is repetition penalty and how is it applied in logit space?
  • How does beam search differ from nucleus sampling, and when would you prefer each?