Walk me through how transformers work

Question

Accepted Answer

Walk me through how transformers work. Start from the architecture — what's the core idea and why did it replace RNNs? Think about: what problem RNNs had that transformers solved. What "attention" actually means computationally. The encoder-decoder split. What the residual connections and layer norms are doing. The transformer replaced RNNs by eliminating sequential computation entirely. RNNs process tokens one at a time — each hidden state depends on the previous one, which makes parallelizatio