Imbalanced Data
Collect More Data
1. A larger dataset might be more balanced on the classes
2. More examples of minor classes may be useful in resampling
Change Performance Metric
Accuracy may mislead in an imbalanced dataset
Metrics: confusion matrix, precision, recall, F-1, Kappa, ROC curves, etc.
Resampling Datasets
1. Over-sampling: boostrap resample with replacement instances of under-represented class
2. Under-sampling: delete instances from the over-represented class
Generate Synthetic Samples
Randomly sample the attributes from instances in the minority class
1. Generative Models, Naive Bayes, KDE
2. Synthetic Minority Over-sampling Technique (SMOTE)
- Selects two or more similar instances, and perturb an instance attribute at a time by a random amount within the difference to the neighboring instances
Try Different Algorithms
Tree models often perform well on imbalanced dataset
Try Penalized Models
Imposes a penal cost on the model for making classification mistakes on the minority class during training
Try a Different Perspective
1. Anomaly detection, take the instances in the minor class as the outliers
2. Change detection, detecting anomalous sequences/states in a time series
Try Getting Creative
Decompose your larger class into smaller number of other classes
Reference