Walk me through the main gradient descent variants — batch GD, SGD, mini-batch SGD, and Adam. What are the tradeoffs, when would you choose each, and how do learning rate schedules fit in?
formulate your answer, then —
tldr
Mini-batch SGD is the practical foundation; Adam is the de facto default for deep learning because it adapts per-parameter learning rates automatically. Pair any optimizer with a learning rate schedule (especially warmup for transformers). When in doubt, start with AdamW + cosine decay with warmup.
follow-up
- Why does the choice of batch size affect generalization, not just training speed?
- What is the "exploding/vanishing gradient" problem and how do modern architectures address it?
- How would you debug training instability — loss spikes, NaN gradients — in a production training job?