Explain Simpson's paradox with a concrete example. When does it appear in ML workflows? How do you detect it, and what does it tell you about the relationship between correlation and causation?

Question

Accepted Answer

The paradox

A trend that appears across subgroups can disappear or reverse when the subgroups are combined.

Classic example — kidney stone treatment:

| | Treatment A | Treatment B |
|---|---|---|
| Small stones | 93% (81/87) | 87% (234/270) |
| Large stones | 73% (192/263) | 69% (55/80) |
| Overall | 78% (273/350) | 83% (289/350) |

Treatment A is better in both subgroups, yet Treatment B looks better overall. How?

Because large stones (harder to treat) were overwhelmingly assigned to Treatment A (263 vs 80). Treatment A treated more of the hard cases. The size-of-stones variable is a confounder — it correlates with both the treatment assigned and the outcome.

The math

Weighted averages combine subgroup rates with different weights. If subgroup sizes differ across groups, the aggregate can be dominated by whichever group got more weight — not by the true treatment effect.

Overall_A = 0.73 × (263/350) + 0.93 × (87/350) ≈ 0.78
Overall_B = 0.69 × (80/350) + 0.87 × (270/350) ≈ 0.83

B's aggregate is higher because most of B's patients had small stones (easy cases).

Where ML practitioners encounter this

Model evaluation: overall accuracy looks fine, but the model performs poorly on minority subgroups. Reporting aggregate accuracy hides this. Fix: stratified evaluation by subgroup.

Feature importance: a feature may look positively correlated with the target in aggregate but negatively correlated within each user segment if segment size is confounded.

A/B testing: treatment and control groups aren't perfectly balanced on a confounding variable. If heavy users go to control (because of ramp-up order), aggregate results favor control even if treatment is genuinely better. Fix: randomization and pre-stratification.

Recommendation systems: items that are popular get shown more, accumulate more clicks, look more appealing — but popularity itself is the confounder. The item didn't get more clicks because it's better; it got more clicks because it was shown more.

Training data: a model trained on data where gender correlates with job level (due to historical hiring bias) will learn that correlation — causally wrong.

Detection

1. Identify potential confounders (variables that could affect both your predictor and outcome)
2. Stratify your analysis by those confounders and compare direction of effects
3. If direction reverses or magnitude dramatically changes after stratification, Simpson's paradox is at play

What to trust: aggregate or subgroup?

Depends on the causal question. If you're assigning treatment to a random patient, condition on the confounder (use subgroup rates). If you're measuring the overall population effect regardless of stone size, aggregate is correct. The paradox forces you to be explicit about what causal question you're answering.

The aggregate is never "wrong" — it's answering a different question. Know which question you're asking.