Walk me through learning rate scheduling. What are the common strategies, and how do you choose between them? Why do transformers specifically need warmup?
formulate your answer, then —
tldr
Warmup: start LR small, ramp to target — required for Adam/AdamW because early moment estimates are unreliable and cause large unstable updates. Cosine annealing: smooth LR decay following cos curve — best general-purpose schedule. Step decay: legacy CNN schedule, abrupt drops, cosine preferred. Linear decay: simple, common for BERT fine-tuning. Use LR range test to find base LR. Transformers: warmup + cosine or linear. CNNs: cosine or step.
follow-up
- How do you set the warmup duration? Is there a principled approach or is it heuristic?
- What is the "1cycle policy" and why does cycling momentum inversely with LR help?
- In distributed training across many GPUs, the effective batch size scales up. How should you adjust the learning rate?