Explain attention mechanisms beyond transformers — types and trade-offs

Question

Accepted Answer

You mentioned transformers use self-attention. Walk me through the different types of attention — self vs cross attention, multi-head, and the scaling problem. What does "attention is O(n²)" actually mean in practice? Think about: what the Q, K, V matrices represent and why there are three of them instead of two. Why you divide by `√d_k`. What multi-head attention gains over single-head. What the memory cost of storing the n×n attention matrix means for a document with 100k tokens. Why causal ma