Explain the Central Limit Theorem in plain terms. Why does it matter for A/B testing, confidence intervals, and building ML systems? When does it break down?

Question

Accepted Answer

What CLT says

Take any distribution with mean μ and finite variance σ². Draw samples of size n and compute the sample mean X̄. As n → ∞, the distribution of X̄ approaches:

X̄ ~ N(μ, σ²/n)

This holds regardless of the shape of the original distribution — uniform, exponential, bimodal, whatever. The sampling distribution of the mean becomes normal.

Why this is powerful

It decouples "what distribution is my data?" from "can I do inference?". You don't need to know the true data distribution. If your sample is large enough, the sample mean is approximately normally distributed, which means you can:
- Construct confidence intervals: X̄ ± 1.96 · σ/√n covers 95% of sample means
- Run z-tests and t-tests
- Compute p-values using normal or t-distribution tables

This is why essentially all classical frequentist inference works: it relies on CLT as its foundation.

In A/B testing

When you compare conversion rates between control and treatment, you're comparing two sample means (or proportions — which are means of 0/1 variables). CLT lets you assume those means are normally distributed, enabling the z-test:

z = (X̄_treatment - X̄_control) / SE_diff

Without CLT, you'd need to know the exact distribution of your metric to do hypothesis testing. With CLT, you only need a large enough sample.

Standard error = σ/√n

The CLT tells you the spread of the sampling distribution: σ/√n. This is the standard error — how much sample means vary from sample to sample. Key implication: doubling sample size halves the standard error (not halves the variance — halves the SE). Precision grows slowly.

When CLT breaks down

Heavy-tailed distributions: if σ² is infinite (Pareto with α ≤ 2, Cauchy), CLT doesn't apply. Sample means don't converge to normal. Revenue metrics, session length, and file sizes are often heavy-tailed — this is why median-based tests or log transforms are used for these metrics.

Small n: "n=30 is enough" is a myth. For symmetric, light-tailed distributions, n=30 is often fine. For skewed distributions (binary events with p=0.001), you may need n=10,000+ before the normal approximation is reasonable.

Correlated samples: CLT assumes i.i.d. samples. Time series data, spatially correlated data, and clustered data violate this. Standard error estimates are wrong — typically underestimated, inflating false positive rates.

ML connection

- Loss functions averaged over a mini-batch are approximately normally distributed by CLT — this is why stochastic gradient estimates have manageable variance
- Ensemble methods (bagging, random forests) work partly through variance reduction: averaging n predictions reduces variance by σ²/n per CLT
- Evaluation metrics on a test set are estimates subject to CLT-driven standard errors — a 0.1% accuracy improvement with no error bars is meaningless