mlprep

Walk me through the most common mistakes teams make when running A/B tests in production. What are peeking, the multiple comparisons problem, and underpowered tests — and what are the practical fixes for each?

formulate your answer, then —

tldr

The three big pitfalls: peeking (check p-value only after pre-committed sample size, or use sequential testing); multiple comparisons (pre-specify one primary metric, apply FDR corrections for secondary); underpowered tests (calculate sample size from a power analysis before launching). Don't ship based on non-significant results from small tests — underpowered tests prove nothing either way.

follow-up

  • What is the difference between statistical significance and practical significance? How would you communicate this to a PM who wants to ship based on a 0.001% lift?
  • How do you handle experiment holdout groups — users who are permanently excluded from all experiments? What are the benefits and costs?
  • Explain the difference between a t-test and a Mann-Whitney U test for analyzing A/B test results. When would you choose the non-parametric option?