mlprep
mlprep/Deep Learningmedium12 min

Explain Batch Normalization and Layer Normalization. What problem does each solve, what are the key differences, and why do transformers universally use LayerNorm instead of BatchNorm?

formulate your answer, then —

tldr

BatchNorm normalizes across the batch (per feature); LayerNorm normalizes across features (per sample). Transformers use LayerNorm because: batch statistics are meaningless for variable-length sequences, LayerNorm works at batch size 1, and behavior is identical at train and inference time. Modern architectures use Pre-LN (before the sublayer) for training stability.

follow-up

  • What is RMSNorm and why might it be preferred over standard LayerNorm in LLM training?
  • How does Group Normalization fit into this picture, and when would you use it over BatchNorm for computer vision?
  • Explain the role of the learned γ and β parameters in normalization layers — what would happen if you removed them?