Implement scaled dot-product attention from scratch

Question

Accepted Answer

Implement scaled dot-product attention in NumPy. Then extend it to multi-head attention. Walk me through the shapes at each step. Start from the formula: `Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V`. Think about what Q, K, V are, what shape each matrix has, and why we divide by `sqrt(d_k)`. Then think about what "multi-head" adds — why split into heads rather than just using a bigger attention? **Single-head attention** **Why divide by `sqrt(d_k)`?** The dot product `Q @ K.T` grows in ma