Walk me through how you'd design and interpret an A/B test for a new ranking model. What are the common failure modes?
formulate your answer, then —
tldr
A/B tests for ML require: user-level randomization, pre-specified primary metric and guardrails, sufficient power, and fixed runtime covering weekly seasonality. Flat results are ambiguous — check that treatment actually differed, the experiment had power, and no confounders were present. Always check for sample ratio mismatch before interpreting results.
follow-up
- What is sample ratio mismatch and why does it invalidate an experiment?
- Your experiment ran for two weeks and shows p=0.08 on the primary metric. Do you ship?
- How do you handle experiments where the label takes 30 days to observe but you need a decision in two weeks?