mlprep

Your PM asks: "How long do we need to run this A/B test?" Walk me through how you'd calculate the required sample size. What inputs do you need, and what happens if you get the inputs wrong?

formulate your answer, then —

tldr

Sample size depends on α (false positive tolerance), power/β (false negative tolerance), MDE (smallest effect worth detecting — business decision), and baseline variance. For α=0.05, 80% power, 10% baseline rate, detecting a 1pp lift needs ~147k users per variant. Overestimating MDE is the most common mistake — leads to underpowered tests that miss real effects. CUPED reduces required sample by 40-70% using pre-experiment data. Never stop early based on significance — pre-commit or use sequential testing.

follow-up

  • How does CUPED reduce variance, and why does it require pre-experiment data for the same users?
  • What is the difference between practical significance (MDE) and statistical significance, and why do both matter?
  • How would you calculate sample size for a metric like revenue per user, which has very high variance and a heavy tail?