Your model predicts that users who receive a discount are 30% more likely to purchase. A colleague says "let's give everyone discounts." What's wrong with this reasoning? Walk me through causal inference — why observational data is tricky, and how techniques like propensity score matching and difference-in-differences help.

Question

Accepted Answer

The discount example: selection bias

Discounts aren't randomly assigned. They're typically given to users who are churning, already engaged, or high-value. Users who receive discounts are different from users who don't — they were selected for a reason correlated with purchase intent.

Observing "discount recipients purchase 30% more" mixes the discount effect with the pre-existing difference between recipient and non-recipient populations. If you gave everyone a discount, the effect would likely be much smaller — because you'd include the users who were never selected.

The observed correlation ≠ the causal effect of the discount.

The fundamental problem of causal inference

For each user, you want to know: what would happen with the treatment vs without the treatment? But you can only observe one outcome per user. The counterfactual is missing.

Randomized Controlled Trials (RCTs / A/B tests) solve this by randomly assigning treatment — making the treated and untreated groups identical in expectation on all confounders, observed and unobserved. The only difference is the treatment.

When RCTs aren't possible (ethical, costly, or historical data only), you need observational methods.

Propensity score matching

Propensity score: p(treatment=1 | X) — probability of receiving treatment given observed covariates X. Estimated via logistic regression on user features.

Matching: for each treated user, find an untreated user with a similar propensity score. Compare outcomes between matched pairs. The matched untreated user serves as the counterfactual.

Why propensity scores work: if treatment assignment is fully explained by observed covariates X (ignorability assumption), matching on those covariates (or their propensity score) balances the groups as if treatment were randomly assigned.

ATE = E[Y(1) - Y(0)] ≈ mean(Y_treated) - mean(Y_matched_control)

Limitation: only controls for observed confounders. If unobserved confounders exist (users who received discounts are also more engaged in ways not captured in features), estimates are still biased.

Difference-in-differences (DiD)

Used when you have pre- and post-treatment measurements for both treated and control groups.

DiD = (Y_treated_post - Y_treated_pre) - (Y_control_post - Y_control_pre)

Parallel trends assumption: in the absence of treatment, treated and control groups would have followed the same trend over time. The control group provides the counterfactual trend.

Example: a city introduces a new policy. You compare outcome changes in the treated city vs a similar untreated city over the same period. Shared external trends (economy, season) cancel out.

Instrumental variables (IV)

When you have a variable Z that:
1. Affects treatment assignment (Z → T)
2. Affects outcome only through treatment (Z → Y only via T)
3. Is independent of confounders

Z is an "instrument." Example: proximity to a hospital affects whether you get a treatment, but doesn't affect health outcomes directly.

IV estimates the Local Average Treatment Effect (LATE) — the effect among users whose treatment is actually changed by the instrument. More limited than ATE but causally valid.

When correlation-based ML is fine

Prediction tasks: if you want to predict who will churn, a correlation between features and churn is sufficient — you don't need causation.

Causal inference is needed when you want to answer "what happens if I intervene?" — setting a discount, changing a UI element, adjusting a policy.

Common traps in production ML

- Feature importance from a model ≠ causal importance. A feature can be highly predictive while having zero causal effect (e.g., a proxy for a confounder).
- Optimizing a model on logged data where actions were taken by a previous policy creates feedback loops — the model learns from a biased distribution.
- Counterfactual evaluation of rankers and recommendation systems requires causal thinking, not just offline metrics.