Walk me through how convolutional neural networks work. What does a convolution actually compute, and why does the architecture make sense for images?
You mentioned weight sharing enforces translation equivariance — what's the difference between equivariance and invariance, and where does that distinction matter in practice?
tldr
CNNs exploit locality and translation structure in images. A convolution applies a learned filter at every spatial position (weight sharing), producing a feature map that tracks where patterns appear. Stacked layers build from low-level edges to high-level concepts. CNNs are translation-equivariant (features track position) and approximately invariant after pooling. Standard CNNs aren't rotation or scale invariant — data augmentation is the practical fix.
follow-up
- How does a 1×1 convolution differ from a standard convolution and when is it useful?
- What is batch normalization doing in a CNN and why does it usually go between the convolution and the activation?
- How would you adapt a classification CNN backbone for a dense prediction task like semantic segmentation?