mlprep
mlprep/ML Breadthmedium12 min

Walk me through gradient descent variants — SGD, momentum, Adam. How do you decide which to use?

formulate your answer, then —

You mentioned Adam's bias correction — what exactly does it fix, and are there situations where Adam is the wrong choice?

formulate your answer, then —

tldr

Adam combines momentum (smooth gradient direction) + RMSProp (adaptive per-parameter learning rate) + bias correction (fix for zero initialization of moments). It's the default for transformers and NLP. SGD + momentum often generalizes better for vision because it finds flatter minima. Always prefer AdamW over Adam when weight decay is used.

follow-up

  • What is learning rate scheduling and how does warmup interact with Adam's bias correction?
  • How does weight decay in AdamW differ from L2 regularization in standard Adam?
  • What are the practical signs that your optimizer has converged to a sharp minimum and how would you address it?