Explain gradient clipping. What problem does it solve, and how can it hide deeper training issues?
formulate your answer, then —
tldr
Gradient clipping limits extreme gradient updates, usually by global norm. It stabilizes training for sequence models and large networks, but frequent clipping can signal bad learning rate, outliers, initialization problems, or mixed-precision issues.
follow-up
- Why is global norm clipping usually preferred over value clipping?
- Should clipping happen before or after gradient accumulation?
- What would you monitor to know whether clipping is masking a training problem?