Your team runs 20 A/B tests simultaneously, each at α=0.05. Why is this a problem? Walk me through the multiple testing issue and explain Bonferroni correction and FDR control. When would you use each?
formulate your answer, then —
tldr
Running m tests at α=0.05 gives ~64% chance of at least one false positive with 20 tests. Bonferroni (α/m threshold) controls FWER — probability of any false positive. BH procedure controls FDR — fraction of rejections that are false. Use FWER when any false positive is costly; use FDR when false positives are acceptable if rare. Peeking at results mid-experiment is a hidden multiple testing problem — use sequential testing methods to fix it.
follow-up
- Why is BH more powerful than Bonferroni when many of your hypotheses are actually true?
- What is the peeking problem in A/B testing and how do sequential testing methods solve it?
- How would you handle multiple testing when running subgroup analyses on an A/B test result?