mlprep
mlprep/ML Breadthmedium12 min

Compare SGD, Adam, and AdamW. What does each actually do, and when would you prefer SGD over Adam despite Adam being adaptive?

formulate your answer, then —

tldr

SGD: one LR for all params, robust generalization in many vision settings, needs tuning. Adam: adaptive per-parameter LR, fast convergence, useful for sparse or unstable gradients. AdamW: Adam + decoupled weight decay — preferred regularization behavior for modern neural networks and default for transformers. Pick SGD+momentum for many CNN baselines where you can tune, AdamW for transformers and fine-tuning. Be cautious with Adam + L2-style regularization; AdamW is usually the cleaner choice.

follow-up

  • What is learning rate warmup, why do transformers need it, and what happens if you skip it?
  • Why does Adam find sharper minima than SGD, and how does the Lion optimizer try to address this?
  • What is gradient clipping, when should you use it, and how does it interact with adaptive optimizers?