How do you put confidence intervals on ML metrics?

Question

Accepted Answer

Two models differ by 0.3% AUC or 1% NDCG offline. How do you decide whether that difference is real? Think about: sampling variance, bootstrap, paired comparisons, query-level resampling, correlated predictions, and practical significance. **Why point estimates are not enough** An offline metric is computed on a finite evaluation set. A small improvement may be noise. Senior candidates should ask for uncertainty, not just the metric delta. **Bootstrap** Bootstrap estimates uncertainty by resampl