Batch Normalization vs. Layer Normalization — what each does and when to use which

Question

Accepted Answer

Explain Batch Normalization and Layer Normalization. What problem does each solve, what are the key differences, and why do transformers universally use LayerNorm instead of BatchNorm? Both normalizations solve "internal covariate shift" — the problem of activations changing distribution as training progresses, making each layer have to continuously re-adapt to its inputs. But what's the key dimension along which each normalizes? Batc