mlprep
mlprep/MLOpsmedium12 min

How would you run an A/B test to evaluate whether a new ML model is better than the current one? What makes ML A/B tests different from standard product experiments?

formulate your answer, then —

You mentioned interference effects when users interact — how do you handle experimentation in systems where the model's output for one user affects other users?

formulate your answer, then —

tldr

A/B tests measure business outcomes on live traffic — the authoritative evaluation that offline metrics can't replicate. Randomize at the user level with consistent hashing, pre-register your primary metric, and size the experiment for sufficient statistical power before running. ML A/B tests face interference effects (when users affect each other), novelty bias, and metric delays — cluster-based randomization contains interference but requires more traffic to achieve power.

follow-up

  • How would you handle a situation where your A/B test shows the new model wins on CTR but loses on a long-term engagement metric?
  • What is a holdout group and why might you maintain one permanently in a recommendation system?
  • How do you detect and correct for experiment contamination — users who were exposed to both variants?