Walk me through how transformers work. Start from the architecture — what's the core idea and why did it replace RNNs?
You mentioned the attention formula — queries, keys, and values. Why that specific framing? And what does "scaling by √d_k" actually prevent?
One more: transformers need positional encoding. Why — and what are the tradeoffs between sinusoidal and learned embeddings?
tldr
Transformers replaced RNNs by computing attention across all token pairs simultaneously — no sequential bottleneck, no vanishing gradient over long distances. Self-attention is a learned soft lookup: queries find matching keys, retrieve blended values. Scaling by √d_k prevents softmax saturation. Positional encoding is necessary because attention itself is order-blind.
follow-up
- How does masked self-attention in a decoder differ from encoder self-attention, and why is the mask necessary?
- What are the computational complexity tradeoffs of attention, and how do approaches like flash attention or sparse attention address them?
- How would you explain to a product team why a transformer fine-tuned on domain data often beats a larger general model?