You're building a fraud detection model. Only 0.1% of transactions are fraud. How do you handle this class imbalance?
formulate your answer, then —
tldr
Class imbalance approaches: (1) class-weighted loss — simplest, no data modification; (2) SMOTE oversampling — synthetic minority examples, only on training data, breaks calibration; (3) threshold tuning — train normally, adjust decision boundary post-hoc; (4) anomaly detection if minority class is tiny and variable. Use AUC-PR not AUC-ROC for imbalanced evaluation. Accuracy is useless at 1:1000 ratios.
follow-up
- Why is AUC-ROC misleading for imbalanced classification, and why is AUC-PR better?
- SMOTE creates interpolated examples between minority class neighbors. What can go wrong with this assumption?
- How do you recalibrate a model's predicted probabilities after oversampling to reflect the true class prior?