How do you handle missing data in a production ML system? When is missingness itself predictive, and when does imputation introduce bias?
formulate your answer, then —
tldr
Handle missing data based on why it is missing. Use simple imputation plus missing indicators as a strong baseline, fit imputers inside training/CV only, and monitor missing rates in production. Missingness can be predictive, biased, or a pipeline failure.
follow-up
- What is the difference between MCAR, MAR, and MNAR?
- Why can dropping rows with missing values bias a model?
- How would you distinguish legitimate missingness from a broken feature pipeline?