How would you run an A/B test to evaluate whether a new ML model is better than the current one? What makes ML A/B tests different from standard product experiments?
You mentioned interference effects when users interact — how do you handle experimentation in systems where the model's output for one user affects other users?
tldr
A/B tests measure business outcomes on live traffic — the authoritative evaluation that offline metrics can't replicate. Randomize at the user level with consistent hashing, pre-register your primary metric, and size the experiment for sufficient statistical power before running. ML A/B tests face interference effects (when users affect each other), novelty bias, and metric delays — cluster-based randomization contains interference but requires more traffic to achieve power.
follow-up
- How would you handle a situation where your A/B test shows the new model wins on CTR but loses on a long-term engagement metric?
- What is a holdout group and why might you maintain one permanently in a recommendation system?
- How do you detect and correct for experiment contamination — users who were exposed to both variants?