How do you handle missing data without introducing bias?

Question

Accepted Answer

How do you handle missing data in a production ML system? When is missingness itself predictive, and when does imputation introduce bias? Think about: MCAR, MAR, MNAR, imputation inside CV, missing indicators, train-serve skew, and whether missingness is caused by the product or the user. **Types of missingness** Missing data is not one problem. - MCAR: missing completely at random. Example: random logging dropout.
- MAR: missingness depends on observed variables. Example: income missing more of