mlprep
mlprep/ML Breadthhard12 min

Why do residual connections make very deep neural networks easier to train? Explain the mechanism, not just "they help gradients."

formulate your answer, then —

tldr

Residual connections make each block learn an update F(x) around an identity path: y = x + F(x). This improves gradient flow and makes deep networks easier to optimize because information and gradients have a direct route through the model. ResNets and transformers both rely on this pattern.

follow-up

  • Why are Pre-LN transformers usually more stable than Post-LN transformers?
  • Can residual connections hurt? What happens if residual updates become too large?
  • How do residual connections relate to highway networks and dense connections?