mlprep
mlprep/ML Breadthhard12 min

Compare batch normalization, layer normalization, and group normalization. When would you use each, and why did transformers switch from batch norm to layer norm?

formulate your answer, then —

tldr

BatchNorm: normalize across batch (per feature). Effective for CNNs, large batches; fails with small batches and variable-length sequences, train-inference mismatch. LayerNorm: normalize across features (per example). Batch-size-independent, train=inference behavior — default for transformers and NLP. GroupNorm: normalize across channel groups per example — for CV with small batches. Pick BatchNorm for large-batch CV, LayerNorm for transformers, GroupNorm for small-batch detection.

follow-up

  • Pre-layer norm vs post-layer norm in transformers — what's the difference and why does it matter for training stability?
  • Why does batch norm act as a regularizer, and how does this interact with dropout?
  • In what scenarios would you use RMS Norm instead of Layer Norm, and what does it drop from LayerNorm?