You mentioned transformers use self-attention. Walk me through the different types of attention — self vs cross attention, multi-head, and the scaling problem. What does "attention is O(n²)" actually mean in practice?
formulate your answer, then —
tldr
Attention = softmax(QK^T/√d_k)V. Three matrices because you want separate "what to look for" (Q), "what I offer" (K), and "what I contribute" (V) projections. Multi-head learns multiple relationship types. Self-attention: Q,K,V from same sequence. Cross-attention: Q from decoder, K,V from encoder. O(n²) memory from storing QK^T — Flash Attention avoids this via tiling without changing the math. Linear attention reorders Q(K^TV) for sublinear complexity.
follow-up
- How does Flash Attention achieve the same result as standard attention while using O(n) memory instead of O(n²)?
- What is rotary position encoding (RoPE) and why did it replace learned or sinusoidal position encodings?
- When would you use an encoder-only model (BERT) vs a decoder-only model (GPT) vs an encoder-decoder (T5)?