Walk me through backpropagation. How does a neural network actually learn from a mistake?
formulate your answer, then —
You mentioned the activation derivative — what happens to gradients in very deep networks, and how do residual connections address it?
formulate your answer, then —
tldr
Backprop applies the chain rule backward through the network, computing ∂L/∂w at each layer using cached forward-pass activations. Vanishing gradients occur when activation derivatives are small and multiply across many layers — ReLU reduced this by having derivative 1 for positive inputs. Residual connections add a direct identity gradient path that bypasses the activation derivatives entirely, making very deep networks trainable.
follow-up
- How does batch normalization interact with backpropagation, and why does it help with training stability?
- What's the difference between gradient checkpointing and standard backprop, and when would you use it?
- How would you debug a network where training loss isn't decreasing at all from epoch 1?