mlprep
mlprep/ML Breadthhard12 min

Two models differ by 0.3% AUC or 1% NDCG offline. How do you decide whether that difference is real?

formulate your answer, then —

tldr

Use confidence intervals to separate real metric movement from noise. Bootstrap paired model differences, resampling at the correct unit: examples for classification, queries/users/sessions for ranking. Also check practical significance, segment behavior, and evaluation bias.

follow-up

  • Why should NDCG bootstrap by query instead of item?
  • What is the difference between statistical and practical significance?
  • Why can a tight confidence interval still be misleading?