How do you initialize deep networks, and why does it matter?

Question

Accepted Answer

Walk me through weight initialization. Why do Xavier and He initialization exist, and what breaks if initialization is wrong? Think about: activation variance as signals move forward. Gradient variance as signals move backward. Why ReLU changes the right scale. Why normalization helps but does not make initialization irrelevant. **The problem** If weights are too small, activations and gradients shrink layer by layer. Learning is slow or stalls. If weights are too large, activations and gradient