Your team runs 20 A/B tests simultaneously, each at α=0.05. Why is this a problem? Walk me through the multiple testing issue and explain Bonferroni correction and FDR control. When would you use each?

Question

Accepted Answer

The problem

At α=0.05, you accept a 5% chance of a false positive per test. With 20 independent tests, all under the null:

P(at least one false positive) = 1 - (1 - 0.05)^20 = 1 - 0.95^20 ≈ 0.64

A 64% chance of at least one spurious "significant" result. The more hypotheses you test, the more likely you are to find something significant purely by chance. This is the multiple testing problem.

This shows up everywhere in ML:
- Running 20 A/B tests simultaneously
- Testing 1000 features for significance in feature selection
- Evaluating a model on 50 subgroups
- Comparing 10 hyperparameter configurations

Family-wise error rate (FWER)

FWER = P(at least one false positive) across all tests. Bonferroni controls FWER ≤ α.

Bonferroni correction

Divide α by the number of tests m. Reject H_i if p_i ≤ α/m.

For 20 tests at α=0.05: reject if p_i ≤ 0.05/20 = 0.0025.

Why it works: by union bound, P(any false positive) ≤ Σ P(false positive for test i) = m × (α/m) = α.

Cost: very conservative. Requires much smaller p-values → needs much larger samples for the same power. If most hypotheses are true positives, Bonferroni over-penalizes.

Holm-Bonferroni is a strictly more powerful variant: sort p-values, apply decreasing thresholds. Rejects at least as many as Bonferroni, often more.

False Discovery Rate (FDR)

FDR = E[false positives / total rejections] — the expected fraction of your "significant" findings that are wrong.

FDR is a weaker (less stringent) criterion than FWER. You accept some false positives, but control their proportion.

Benjamini-Hochberg (BH) procedure controls FDR ≤ q:
1. Sort p-values: p_(1) ≤ p_(2) ≤ ... ≤ p_(m)
2. Find largest k such that p_(k) ≤ q·k/m
3. Reject all H_(1), ..., H_(k)

At q=0.05, BH guarantees at most 5% of your rejected hypotheses are false positives in expectation.

Example: 100 tests, q=0.05. If BH rejects 20, you expect ≤1 false positive among those 20.

FWER vs FDR: when to use which

| Context | Method | Reason |
|---|---|---|
| Clinical trial (one treatment decision) | Bonferroni/FWER | One false positive = wrong treatment |
| Genome-wide association (GWAS, 1M SNPs) | Bonferroni | Very high multiple comparison burden |
| Exploratory feature selection | BH/FDR | False positives caught by downstream validation |
| Simultaneous A/B tests | BH/FDR | Some false positives acceptable; power matters |
| Subgroup analysis in a product experiment | Bonferroni | High-stakes decisions per subgroup |

Practical guideline

If a false positive leads directly to a costly decision (ship a product, treat a patient, drop a feature permanently), control FWER. If false positives will be filtered by a subsequent experiment or analysis, control FDR.

Running many A/B tests: the peeking problem

Multiple testing also appears within a single test: checking significance daily and stopping early if p < 0.05 inflates false positive rate beyond α. Fix: sequential testing methods (always-valid p-values, mSPRT) or pre-commit to a fixed sample size and check once.