SMOTE
SMOTE: Synthetic Minority Over-sampling Technique. An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ”normal” examples with only a small percentage of ”abnormal” or ”interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Keywords for this software
References in zbMATH (referenced in 152 articles , 1 standard article )
Showing results 61 to 80 of 152.
Sorted by year (- Feng, Shou; Fu, Ping; Zheng, Wenbin: A hierarchical multi-label classification algorithm for gene function prediction (2017)
- Gong, Joonho; Kim, Hyunjoong: RHSBoost: improving classification performance in imbalance data (2017)
- Hara, Kota; Chellappa, Rama: Growing regression tree forests by classification for continuous object pose estimation (2017)
- Koziarski, Michał; Wożniak, Michał: CCR: a combined cleaning and resampling algorithm for imbalanced data classification (2017)
- Krautenbacher, Norbert; Theis, Fabian J.; Fuchs, Christiane: Correcting classifiers for sample selection bias in two-phase case-control studies (2017)
- Li, Qian; Li, Gang; Niu, Wenjia; Cao, Yanan; Chang, Liang; Tan, Jianlong; Guo, Li: Boosting imbalanced data learning with Wiener process oversampling (2017)
- Maldonado, Sebastián; Pérez, Juan; Bravo, Cristián: Cost-based feature selection for support vector machines: an application in credit scoring (2017)
- Núñez, Haydemar; Gonzalez-Abril, Luis; Angulo, Cecilio: Improving SVM classification on imbalanced datasets by introducing a new bias (2017)
- Roy, Asis; Bhattacharya, Sourangshu; Guin, Kalyan: Prediction of esophageal cancer using demographic, lifestyle, patient history, and basic clinical tests (2017)
- Wojciechowski, Szymon; Wilk, Szymon: Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data (2017)
- Chen, Yan-Cheng; Su, Chao-Ton: Distance-based margin support vector machine for classification (2016)
- Chmielnicki, Wiesław; Stąpor, Katarzyna: Using the one-versus-rest strategy with samples balancing to improve pairwise coupling classification (2016)
- Dong, Aimei; Chung, Fu-lai; Wang, Shitong: Semi-supervised classification method through oversampling and common hidden space (2016)
- Gámez, Juan Carlos; García, David; González, Antonio; Pérez, Raúl: Ordinal classification based on the sequential covering strategy (2016)
- Gong, Chunlin; Gu, Liangxian: A novel SMOTE-based classification approach to online data imbalance problem (2016)
- Cheng, Fan; Yang, Kang; Zhang, Lei: A structural SVM based approach for binary classification under class imbalance (2015)
- Datta, Shounak; Das, Swagatam: Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs (2015)
- Fernandez-Lozano, Carlos; Cuiñas, Rubén F.; Seoane, José A.; Fernández-Blanco, Enrique; Dorado, Julian; Munteanu, Cristian R.: Classification of signaling proteins based on molecular star graph descriptors using machine learning models (2015)
- Krempl, Georg; Kottke, Daniel; Lemaire, Vincent: Optimised probabilistic active learning (OPAL) (2015)
- Lee, J.; Wu, Y.; Kim, H.: Unbalanced data classification using support vector machines with active learning on scleroderma lung disease patterns (2015)