Imbalanced Data
Collect More Data
  • 1. A larger dataset might be more balanced on the classes
  • 2. More examples of minor classes may be useful in resampling
  • Change Performance Metric
  • Accuracy may mislead in an imbalanced dataset
  • Metrics: confusion matrix, precision, recall, F-1, Kappa, ROC curves, etc.
  • Resampling Datasets
  • 1. Over-sampling: boostrap resample with replacement instances of under-represented class
  • 2. Under-sampling: delete instances from the over-represented class
  • Generate Synthetic Samples
  • Randomly sample the attributes from instances in the minority class
  • 1. Generative Models, Naive Bayes, KDE
  • 2. Synthetic Minority Over-sampling Technique (SMOTE)
  • Try Different Algorithms
  • Tree models often perform well on imbalanced dataset
  • Try Penalized Models
  • Imposes a penal cost on the model for making classification mistakes on the minority class during training
  • Try a Different Perspective
  • 1. Anomaly detection, take the instances in the minor class as the outliers
  • 2. Change detection, detecting anomalous sequences/states in a time series
  • Try Getting Creative
  • Decompose your larger class into smaller number of other classes
  • Reference
  • Classification when 80% of my training set is of one class at Reddit
  • A Brief Introduction to Change Point Detection using Python
  • Cohen's kappa
  • 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset