Two models differ by 0.3% AUC or 1% NDCG offline. How do you decide whether that difference is real?
formulate your answer, then —
tldr
Use confidence intervals to separate real metric movement from noise. Bootstrap paired model differences, resampling at the correct unit: examples for classification, queries/users/sessions for ranking. Also check practical significance, segment behavior, and evaluation bias.
follow-up
- Why should NDCG bootstrap by query instead of item?
- What is the difference between statistical and practical significance?
- Why can a tight confidence interval still be misleading?