Explain Batch Normalization and Layer Normalization. What problem does each solve, what are the key differences, and why do transformers universally use LayerNorm instead of BatchNorm?
formulate your answer, then —
tldr
BatchNorm normalizes across the batch (per feature); LayerNorm normalizes across features (per sample). Transformers use LayerNorm because: batch statistics are meaningless for variable-length sequences, LayerNorm works at batch size 1, and behavior is identical at train and inference time. Modern architectures use Pre-LN (before the sublayer) for training stability.
follow-up
- What is RMSNorm and why might it be preferred over standard LayerNorm in LLM training?
- How does Group Normalization fit into this picture, and when would you use it over BatchNorm for computer vision?
- Explain the role of the learned γ and β parameters in normalization layers — what would happen if you removed them?