Explain the transformer attention mechanism — self-attention, multi-head, positional encoding

Question

Accepted Answer

Explain how the transformer attention mechanism works — self-attention, multi-head attention, and positional encoding. Why did it replace RNNs, and what are its computational tradeoffs? The key question attention is answering: for each token in a sequence, which other tokens are most relevant to understanding it? Think about how RNNs handle this differently (sequential state) vs. how attention handles it (direct pairwise comparison).