mlprep
mlprep/ML Breadthmedium12 min

You're building a fraud detection model. Only 0.1% of transactions are fraud. How do you handle this class imbalance?

formulate your answer, then —

tldr

Class imbalance approaches: (1) class-weighted loss — simplest, no data modification; (2) SMOTE oversampling — synthetic minority examples, only on training data, breaks calibration; (3) threshold tuning — train normally, adjust decision boundary post-hoc; (4) anomaly detection if minority class is tiny and variable. Use AUC-PR not AUC-ROC for imbalanced evaluation. Accuracy is useless at 1:1000 ratios.

follow-up

  • Why is AUC-ROC misleading for imbalanced classification, and why is AUC-PR better?
  • SMOTE creates interpolated examples between minority class neighbors. What can go wrong with this assumption?
  • How do you recalibrate a model's predicted probabilities after oversampling to reflect the true class prior?