Your PM asks: "How long do we need to run this A/B test?" Walk me through how you'd calculate the required sample size. What inputs do you need, and what happens if you get the inputs wrong?

Question

Accepted Answer

Four parameters that determine sample size

1. α (significance level): acceptable false positive rate. Typically 0.05.
2. β (Type II error rate): acceptable false negative rate. Power = 1-β. Typically target 80% or 90% power.
3. MDE (Minimum Detectable Effect): smallest effect size you care about detecting. This is a business decision.
4. Baseline metric and variance: current conversion rate (or metric mean and variance).

Formula for two-proportion z-test

For comparing two proportions p_control and p_treatment = p_control + δ:

n = (z_α/2 + z_β)² · [p_c(1-p_c) + p_t(1-p_t)] / δ²

For α=0.05 (two-tailed): z_α/2 = 1.96
For 80% power (β=0.20): z_β = 0.84
For 90% power (β=0.10): z_β = 1.28

Worked example

Baseline conversion rate: p_c = 0.10. You want to detect a 1 percentage point lift (MDE = δ = 0.01, so p_t = 0.11). α=0.05, 80% power.

n = (1.96 + 0.84)² · [0.10·0.90 + 0.11·0.89] / 0.01²
  = 7.84 · [0.09 + 0.0979] / 0.0001
  = 7.84 · 0.1879 / 0.0001
  ≈ 147,000 per variant

295k total users needed. At 50k users/day: ~6 days minimum.

Key levers: how to reduce sample size

Increase MDE: only detect larger effects. Costs you: you'll miss small but real improvements.

Increase α: accept more false positives. Uncommon but appropriate for low-stakes exploratory tests.

Reduce variance: use CUPED (controlled experiment using pre-experiment data). Regress post-experiment metric on pre-experiment metric, use residuals. Variance reduction of 50-70% is common, cutting sample size by same factor.

One-tailed test: if you only care about improvement (not degradation). Reduces n by ~20%. Use with caution.

What goes wrong with bad inputs

Overestimating the effect (optimism bias): the most common mistake. If you expect a 10% lift but actually get 2%, your test is wildly underpowered. You'll see p=0.3 and "fail to detect" an effect that's real.

Underestimating variance: metrics like revenue have high variance and heavy tails. Using a normal variance estimate from a sample can badly underestimate true variance. Run a pilot or use historical data.

Ignoring metric correlation: if control and treatment users aren't independent (e.g., social network effects, household-level experiments), standard error formulas underestimate true variance. Leads to false positives.

CUPED: variance reduction without more users

CUPED (Deng et al., 2013) was developed at Microsoft and is now standard at Airbnb, Netflix, Booking.com.

Y_cuped = Y - θ · (X - E[X])

Where X is the pre-experiment value of the metric (same users, before the test) and θ = Cov(Y, X) / Var(X).

Y_cuped has the same mean as Y but lower variance because X captures user-level baseline differences. With 60% variance reduction, you need 40% of the original sample size — a 2.5× speedup.

Running time vs sample size

Sample size is a user count, not a time budget. Convert:
days = n_total / (daily_active_users · randomization_rate)

But never stop early because you hit significance — this is peeking (multiple testing). Pre-commit to your sample size or use sequential testing.

Novelty effects

For UI changes, initial engagement is often inflated by novelty. Run for at least 1-2 weeks to let novelty decay, even if you hit your sample size faster.