SMOTE: Synthetic Minority Over-sampling Technique. An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ”normal” examples with only a small percentage of ”abnormal” or ”interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Keywords for this software
References in zbMATH (referenced in 152 articles , 1 standard article )
Showing results 141 to 152 of 152.
- Mazurowski, Maciej A.; Habas, Piotr A.; Zurada, Jacek M.; Lo, Joseph Y.; Baker, Jay A.; Tourassi, Georgia D.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance (2008) ioport
- Peng, Xiang; King, Irwin: Robust BMPM training based on second-order cone programming and its application in medical diagnosis (2008) ioport
- Robinson, Mark; Castellano, Cristina González; Rezwan, Faisal; Adams, Rod; Davey, Neil; Rust, Alastair; Sun, Yi: Combining experts in order to identify binding sites in yeast and mouse genomic data (2008)
- Zhao, Huimin: Instance weighting versus threshold adjusting for cost-sensitive classification (2008) ioport
- Sun, Yanmin; Kamel, Mohamed S.; Wong, Andrew K. C.; Wang, Yang: Cost-sensitive boosting for classification of imbalanced data (2007)
- Xie, Jigang; Qiu, Zhengding: The effect of imbalanced data sets on LDA: a theoretical and empirical analysis (2007)
- Yoon, Kihoon; Kwek, Stephen: A data reduction approach for resolving the imbalanced data issue in functional genomics (2007) ioport
- Bandyopadhyay, Sanghamitra; Giannella, Chris; Maulik, Ujjwal; Kargupta, Hillol; Liu, Kun; Datta, Souptik: Clustering distributed data streams in peer-to-peer environments (2006) ioport
- Crone, Sven F.; Lessmann, Stefan; Stahlbock, Robert: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing (2006)
- Huang, Yueh-Min; Hung, Chun-Min; Jiau, Hewijin Christine: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem (2006)
- Buckinx, Wouter; Van den Poel, Dirk: Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting (2005)
- Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P.: SMOTE: Synthetic minority over-sampling technique (2002)