Your A/B test returns p=0.001 — highly significant. Your PM is excited. What questions do you ask before declaring victory? Explain the difference between statistical significance and practical significance, and when a highly significant result can still be useless.

Question

Accepted Answer

What p-values don't tell you

A p-value answers: "If the null hypothesis were true (no effect), how likely is data this extreme or more?" It does not tell you:
- How large the effect is
- Whether the effect is meaningful in practice
- Whether you should ship the change

With a large enough sample, even a 0.001% improvement in conversion rate will be statistically significant (p < 0.05). The p-value conflates effect size with sample size.

z = effect_size × √n / σ

Fixing effect size: doubling n multiplies z by √2, making p smaller. A small effect + huge n → small p. A large effect + small n → large p.

Effect size measures

Cohen's d (for continuous metrics):
d = (μ_treatment - μ_control) / σ_pooled
- d=0.2: small effect
- d=0.5: medium effect  
- d=0.8: large effect

Relative lift: (μ_t - μ_c) / μ_c. The most interpretable for business metrics. A 2% lift in conversion vs a 0.002% lift are very different decisions even if both are significant.

Cohen's h for proportions, Cohen's f² for regression R² differences.

Concrete example: search ranking at Google scale

Suppose you run an A/B test with 100M users. Treatment shows 0.0001% improvement in click-through rate (1 extra click per million queries). p=0.0001 — extremely significant.

Is this meaningful? That depends on:
- Cost of shipping: if it's a minor algorithm tweak with no maintenance cost, even tiny gains compound at 100M scale
- User experience impact: 0.0001% is imperceptible. If the change degraded latency by 1ms, that latency cost might outweigh the tiny CTR gain.
- Opportunity cost: could engineering time spent on this change have found a 2% improvement elsewhere?

Statistical vs practical significance framework

|  | Statistically significant | Not statistically significant |
|---|---|---|
| Practically significant | Ship it | Need more data / ran underpowered |
| Not practically significant | Don't ship — large sample amplified a tiny effect | Correctly "no effect" |

The dangerous cell: statistically significant but not practically significant. This is how teams ship features that genuinely work (improve the metric) but have no impact on the product.

Define the MDE (minimum detectable effect) before the test

The right approach: before running the test, define the minimum effect size you'd care to detect. This is a business decision:

"We only want to ship this change if it improves conversion by at least 1%." Run the test powered to detect 1%. If the result is significant but shows only 0.1% lift, the answer is: not worth shipping despite significance.

This prevents the post-hoc reasoning trap: "it's significant at p=0.03 and shows +0.1% lift, let's ship" — without having pre-defined whether 0.1% is meaningful.

Confidence interval framing

Report the effect with a confidence interval, not just p-value:

"Treatment increased conversion by 0.12% ± 0.03% (95% CI: 0.06% to 0.18%), p=0.001."

Now stakeholders can see: the effect is tiny, even if precisely measured. Compare to your MDE (say, 1%) and the decision is clear.

When significance without practical impact matters

- Safety/quality: even a tiny but statistically significant degradation in error rate matters if errors affect users badly
- Regression detection: flag any significant regressions regardless of size before shipping
- Long-term compounding: small effects compound over years in a large system (1% × 12 product decisions ≈ 12% cumulative)