Walk me through weight initialization. Why do Xavier and He initialization exist, and what breaks if initialization is wrong?
formulate your answer, then —
tldr
Initialization keeps activation and gradient variance stable at the start of training. Xavier fits tanh-like activations; He fits ReLU-like activations. In modern deep models, initialization still matters because it interacts with residual paths, normalization, warmup, and fine-tuning.
follow-up
- Why does ReLU need a different initialization scale than tanh?
- How would you debug a model that produces NaNs in the first few hundred steps?
- How should initialization differ when fine-tuning a pretrained model with a new head?