Compare batch normalization, layer normalization, and group normalization. When would you use each, and why did transformers switch from batch norm to layer norm?
formulate your answer, then —
tldr
BatchNorm: normalize across batch (per feature). Effective for CNNs, large batches; fails with small batches and variable-length sequences, train-inference mismatch. LayerNorm: normalize across features (per example). Batch-size-independent, train=inference behavior — default for transformers and NLP. GroupNorm: normalize across channel groups per example — for CV with small batches. Pick BatchNorm for large-batch CV, LayerNorm for transformers, GroupNorm for small-batch detection.
follow-up
- Pre-layer norm vs post-layer norm in transformers — what's the difference and why does it matter for training stability?
- Why does batch norm act as a regularizer, and how does this interact with dropout?
- In what scenarios would you use RMS Norm instead of Layer Norm, and what does it drop from LayerNorm?