mlprep
mlprep/ML Breadthhard12 min

How do you handle missing data in a production ML system? When is missingness itself predictive, and when does imputation introduce bias?

formulate your answer, then —

tldr

Handle missing data based on why it is missing. Use simple imputation plus missing indicators as a strong baseline, fit imputers inside training/CV only, and monitor missing rates in production. Missingness can be predictive, biased, or a pipeline failure.

follow-up

  • What is the difference between MCAR, MAR, and MNAR?
  • Why can dropping rows with missing values bias a model?
  • How would you distinguish legitimate missingness from a broken feature pipeline?