mlprep

Your team runs 20 A/B tests simultaneously, each at α=0.05. Why is this a problem? Walk me through the multiple testing issue and explain Bonferroni correction and FDR control. When would you use each?

formulate your answer, then —

tldr

Running m tests at α=0.05 gives ~64% chance of at least one false positive with 20 tests. Bonferroni (α/m threshold) controls FWER — probability of any false positive. BH procedure controls FDR — fraction of rejections that are false. Use FWER when any false positive is costly; use FDR when false positives are acceptable if rare. Peeking at results mid-experiment is a hidden multiple testing problem — use sequential testing methods to fix it.

follow-up

  • Why is BH more powerful than Bonferroni when many of your hypotheses are actually true?
  • What is the peeking problem in A/B testing and how do sequential testing methods solve it?
  • How would you handle multiple testing when running subgroup analyses on an A/B test result?