SMOTE

SMOTE: Synthetic Minority Over-sampling Technique. An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ”normal” examples with only a small percentage of ”abnormal” or ”interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.


References in zbMATH (referenced in 152 articles , 1 standard article )

Showing results 141 to 152 of 152.
Sorted by year (citations)

previous 1 2 3 ... 6 7 8

  1. Mazurowski, Maciej A.; Habas, Piotr A.; Zurada, Jacek M.; Lo, Joseph Y.; Baker, Jay A.; Tourassi, Georgia D.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance (2008) ioport
  2. Peng, Xiang; King, Irwin: Robust BMPM training based on second-order cone programming and its application in medical diagnosis (2008) ioport
  3. Robinson, Mark; Castellano, Cristina González; Rezwan, Faisal; Adams, Rod; Davey, Neil; Rust, Alastair; Sun, Yi: Combining experts in order to identify binding sites in yeast and mouse genomic data (2008)
  4. Zhao, Huimin: Instance weighting versus threshold adjusting for cost-sensitive classification (2008) ioport
  5. Sun, Yanmin; Kamel, Mohamed S.; Wong, Andrew K. C.; Wang, Yang: Cost-sensitive boosting for classification of imbalanced data (2007)
  6. Xie, Jigang; Qiu, Zhengding: The effect of imbalanced data sets on LDA: a theoretical and empirical analysis (2007)
  7. Yoon, Kihoon; Kwek, Stephen: A data reduction approach for resolving the imbalanced data issue in functional genomics (2007) ioport
  8. Bandyopadhyay, Sanghamitra; Giannella, Chris; Maulik, Ujjwal; Kargupta, Hillol; Liu, Kun; Datta, Souptik: Clustering distributed data streams in peer-to-peer environments (2006) ioport
  9. Crone, Sven F.; Lessmann, Stefan; Stahlbock, Robert: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing (2006)
  10. Huang, Yueh-Min; Hung, Chun-Min; Jiau, Hewijin Christine: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem (2006)
  11. Buckinx, Wouter; Van den Poel, Dirk: Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting (2005)
  12. Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P.: SMOTE: Synthetic minority over-sampling technique (2002)

previous 1 2 3 ... 6 7 8