Explain learning rate scheduling strategies and how to pick one

Question

Accepted Answer

Walk me through learning rate scheduling. What are the common strategies, and how do you choose between them? Why do transformers specifically need warmup? Think about: why a constant learning rate is often suboptimal — what happens in the early steps when gradients are large vs late in training when you're near a minimum. Why Adam with a large initial LR and no warmup can diverge. What cosine annealing is actually doing geometrically. Why cyclic schedules work. **Why scheduling matters** Early