Why do residual connections make very deep neural networks easier to train? Explain the mechanism, not just "they help gradients."
formulate your answer, then —
tldr
Residual connections make each block learn an update F(x) around an identity path: y = x + F(x). This improves gradient flow and makes deep networks easier to optimize because information and gradients have a direct route through the model. ResNets and transformers both rely on this pattern.
follow-up
- Why are Pre-LN transformers usually more stable than Post-LN transformers?
- Can residual connections hurt? What happens if residual updates become too large?
- How do residual connections relate to highway networks and dense connections?