mlprep
mlprep/ML Breadthmedium15 min

Walk me through how transformers work. Start from the architecture — what's the core idea and why did it replace RNNs?

formulate your answer, then —

You mentioned the attention formula — queries, keys, and values. Why that specific framing? And what does "scaling by √d_k" actually prevent?

formulate your answer, then —

One more: transformers need positional encoding. Why — and what are the tradeoffs between sinusoidal and learned embeddings?

formulate your answer, then —

tldr

Transformers replaced RNNs by computing attention across all token pairs simultaneously — no sequential bottleneck, no vanishing gradient over long distances. Self-attention is a learned soft lookup: queries find matching keys, retrieve blended values. Scaling by √d_k prevents softmax saturation. Positional encoding is necessary because attention itself is order-blind.

follow-up

  • How does masked self-attention in a decoder differ from encoder self-attention, and why is the mask necessary?
  • What are the computational complexity tradeoffs of attention, and how do approaches like flash attention or sparse attention address them?
  • How would you explain to a product team why a transformer fine-tuned on domain data often beats a larger general model?