mlprep

Explain how the transformer attention mechanism works — self-attention, multi-head attention, and positional encoding. Why did it replace RNNs, and what are its computational tradeoffs?

formulate your answer, then —

tldr

Self-attention lets every token directly attend to every other token — computing relevance as scaled dot products of Q and K, then taking a weighted sum of V. Multi-head parallelizes this across multiple relationship types. It replaced RNNs because it parallelizes over sequence length (enabling GPU utilization) and eliminates vanishing gradients over long distances. The cost: O(n²) in sequence length.

follow-up

  • How does masked attention (causal masking) work in decoder-only models like GPT, and why is it necessary for autoregressive generation?
  • Explain the architectural difference between an encoder-only transformer (BERT), a decoder-only (GPT), and an encoder-decoder (T5). When would you use each?
  • What is Flash Attention and how does it reduce the memory footprint of attention computation without changing the result?