You have a dataset collected from production logs to train a new ML model. How do you detect whether it has sampling bias, and what techniques can you use to correct for it?
You mentioned inverse propensity weighting. In practice, how do you estimate the propensity scores when the selection mechanism is a complex recommendation model with billions of parameters? And what goes wrong when propensity estimates are inaccurate?
tldr
Production data is always biased: selection bias (model filters observations), survivorship bias, popularity bias, temporal bias. Detect with KS tests on feature distributions, PSI, propensity analysis, and subgroup evaluation. Correct with: inverse propensity weighting (upweight underrepresented examples), doubly-robust estimators (robust to propensity errors), stratified resampling, or collecting unbiased exploration data. Log propensity at serving time when possible — estimating it retroactively is error-prone.
follow-up
- How does Simpson's paradox relate to sampling bias, and can you give an ML example where ignoring a confounding variable reverses the apparent relationship?
- Your recommendation model has a feedback loop — it only recommends items it's confident about, so it never collects data on uncertain items. How do you break this loop without hurting user experience?
- When is reweighting insufficient and you fundamentally need new data collection to fix the bias?