What is data leakage in ML? Give me examples of how it appears, and how you'd catch and prevent it.
formulate your answer, then —
tldr
Four leakage types: (1) target leakage — feature contains the label (apply temporal/causal audit), (2) train-test contamination — fit preprocessing on train only, use Pipelines, (3) temporal leakage — no shuffling of time-series, use walk-forward CV, (4) group leakage — split by entity ID not rows. Red flags: suspiciously high accuracy, single dominant feature, immediate production degradation. Leakage produces models that look great in training and fail in deployment.
follow-up
- How would you structure a sklearn Pipeline to guarantee no leakage during cross-validation?
- Your colleague ran CV and got 0.98 AUC on what should be a hard problem. What's your checklist for diagnosing leakage?
- In a recommender system, how do you prevent leakage when evaluating offline metrics?