Explain how the transformer attention mechanism works — self-attention, multi-head attention, and positional encoding. Why did it replace RNNs, and what are its computational tradeoffs?
formulate your answer, then —
tldr
Self-attention lets every token directly attend to every other token — computing relevance as scaled dot products of Q and K, then taking a weighted sum of V. Multi-head parallelizes this across multiple relationship types. It replaced RNNs because it parallelizes over sequence length (enabling GPU utilization) and eliminates vanishing gradients over long distances. The cost: O(n²) in sequence length.
follow-up
- How does masked attention (causal masking) work in decoder-only models like GPT, and why is it necessary for autoregressive generation?
- Explain the architectural difference between an encoder-only transformer (BERT), a decoder-only (GPT), and an encoder-decoder (T5). When would you use each?
- What is Flash Attention and how does it reduce the memory footprint of attention computation without changing the result?