Data Science

Imbalanced Data

Collect More Data

1. A larger dataset might be more balanced on the classes

2. More examples of minor classes may be useful in resampling

Change Performance Metric

Accuracy may mislead in an imbalanced dataset

Metrics: confusion matrix, precision, recall, F-1, Kappa, ROC curves, etc.

Resampling Datasets

1. Over-sampling: boostrap resample with replacement instances of under-represented class

2. Under-sampling: delete instances from the over-represented class

Generate Synthetic Samples

Randomly sample the attributes from instances in the minority class

1. Generative Models, Naive Bayes, KDE

2. Synthetic Minority Over-sampling Technique (SMOTE)

Selects two or more similar instances, and perturb an instance attribute at a time by a random amount within the difference to the neighboring instances

Try Different Algorithms

Tree models often perform well on imbalanced dataset

Try Penalized Models

Imposes a penal cost on the model for making classification mistakes on the minority class during training

Try a Different Perspective

1. Anomaly detection, take the instances in the minor class as the outliers

2. Change detection, detecting anomalous sequences/states in a time series

Try Getting Creative

Decompose your larger class into smaller number of other classes

Reference