mlprep
mlprep/ML Breadthmedium12 min

Walk me through weight initialization. Why do Xavier and He initialization exist, and what breaks if initialization is wrong?

formulate your answer, then —

tldr

Initialization keeps activation and gradient variance stable at the start of training. Xavier fits tanh-like activations; He fits ReLU-like activations. In modern deep models, initialization still matters because it interacts with residual paths, normalization, warmup, and fine-tuning.

follow-up

  • Why does ReLU need a different initialization scale than tanh?
  • How would you debug a model that produces NaNs in the first few hundred steps?
  • How should initialization differ when fine-tuning a pretrained model with a new head?