Walk me through exactly what happens inside an LLM when it generates a response — from receiving the raw prompt string to producing each output token. Cover tokenization, the forward pass, logit projection, and how temperature and top-p change the output distribution.

Question

Accepted Answer

Stage 1: Tokenization

Input text is split into subword tokens using BPE (byte-pair encoding) or SentencePiece with a vocabulary of 32k–128k tokens. "unbelievable" → ["un", "believ", "able"]. Tokens are mapped to integer IDs.

Tokens are not words. "2024" might be one token or four depending on vocabulary. This matters for arithmetic — the model reasons over token sequences, not decimal digits — and for counting characters, which requires knowing how text maps to tokens.

Stage 2: Embedding lookup + position encoding

Token IDs → dense vectors via embedding matrix E ∈ ℝ^(vocab × d_model). RoPE (Rotary Position Encoding) injects position information into the Q, K projections during attention rather than as an additive offset. Output: sequence of vectors, shape [n_tokens, d_model].

Stage 3: Transformer forward pass (prefill)

The full prompt is processed in one parallel forward pass through L transformer layers:
- Self-attention with causal mask (each token attends only to previous positions)
- FFN: two-layer MLP with activation function
- Layer norm + residual connections

Output: hidden states [n_tokens, d_model]. KV pairs for all positions cached for future decode steps.

Stage 4: Logit projection

Final hidden state at the last token position → logit vector via unembedding matrix (shape [d_model → vocab_size]). These are raw, unnormalized scores for every token in the vocabulary.

Stage 5: Sampling

Convert logits to a probability distribution and sample:

Temperature scaling: logits = logits / T
- T=1.0: model's native distribution
- T<1: distribution sharpens — high-probability tokens dominate
- T→0: argmax / greedy decoding (deterministic)
- T>1: distribution flattens — more randomness, higher diversity

Top-k: zero out all but top-k logits before softmax. Prevents tail sampling from very improbable tokens.

Top-p (nucleus): keep the minimal set of tokens whose cumulative probability ≥ p. Adaptive — allows more tokens when distribution is flat (uncertain), fewer when peaked (confident).

Typical application order: temperature → top-k → top-p → softmax → categorical sample.

Stage 6: Decode token → text

Sampled token ID → token string via vocabulary lookup. Tokens concatenated and decoded to UTF-8. BPE tokens often encode a leading space (Ġ prefix in GPT-style vocabularies), which determines word spacing without needing a separate delimiter.

Stage 7: Next decode step (autoregressive loop)

The new token is appended to the sequence. Run forward pass for this single new token — O(1) projection + O(n) attention over KV cache. Repeat until EOS token sampled or max_tokens reached.

Intervention points

| Stage | What you control |
|---|---|
| Logits | Temperature, logit bias (boost/suppress specific tokens) |
| Post-logit | Top-k, top-p, min-p, repetition penalty |
| Sampling | Greedy vs stochastic vs beam search |
| Training time | RLHF reshapes the underlying distribution before any sampling |

Common misconception

Lowering temperature makes the model "more confident" but not more accurate. It amplifies the model's existing distribution — if the model's most likely token is wrong, greedy decoding will confidently produce the wrong answer. Temperature is a sharpness knob, not an accuracy knob.