SMOTE
SMOTE: Synthetic Minority Over-sampling Technique. An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ”normal” examples with only a small percentage of ”abnormal” or ”interesting” examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Keywords for this software
References in zbMATH (referenced in 152 articles , 1 standard article )
Showing results 21 to 40 of 152.
Sorted by year (- Steininger, Michael; Kobs, Konstantin; Davidson, Padraig; Krause, Anna; Hotho, Andreas: Density-based weighting for imbalanced regression (2021)
- Vargaftik, Shay; Keslassy, Isaac; Orda, Ariel; Ben-Itzhak, Yaniv: RADE: resource-efficient supervised anomaly detection using decision tree-based ensemble methods (2021)
- Abdallah, Zahraa S.; Gaber, Mohamed Medhat: Co-eye: a multi-resolution ensemble classifier for symbolically approximated time series (2020)
- Chaabane, Ikram; Guermazi, Radhouane; Hammami, Mohamed: Enhancing techniques for learning decision trees from imbalanced data (2020)
- Gubela, Robin M.; Lessmann, Stefan; Jaroszewicz, Szymon: Response transformation and profit decomposition for revenue uplift modeling (2020)
- Halbersberg, Dan; Wienreb, Maydan; Lerner, Boaz: Joint maximization of accuracy and information for learning the structure of a Bayesian network classifier (2020)
- Lázaro, Marcelino; Herrera, Francisco; Figueiras-Vidal, Aníbal R.: Ensembles of cost-diverse Bayesian neural learners for imbalanced binary classification (2020)
- Mahajan, Pravar Dilip; Maurya, Abhinav; Megahed, Aly; Elwany, Alaa; Strong, Ray; Blomberg, Jeanette: Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction (2020)
- Ruehle, Fabian: Data science applications to string theory (2020)
- Sun, Hongwei; Cui, Yuehua; Gao, Qian; Wang, Tong: Trimmed LASSO regression estimator for binary response data (2020)
- Tao, Xinmin; Li, Qing; Guo, Wenjie; Ren, Chao; He, Qing; Liu, Rui; Zou, JunRong: Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering (2020)
- Tsukuda, Koji; Mano, Shuhei; Yamamoto, Toshimichi: Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset (2020)
- Wu, Di; Zhang, Jiangjiang; Geng, Shaojin; Cai, Xingjuan; Zhang, Guoyou: A multi-objective bat algorithm for software defect prediction (2020)
- Xie, Jinhan; Hao, Meiling; Liu, Wenxin; Lin, Yuanyuan: Fused variable screening for massive imbalanced data (2020)
- Ahmad, Jamal; Hayat, Maqsood: MFSC: multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou’s PseAAC components (2019)
- Jia, Jianhua; Li, Xiaoyan; Qiu, Wangren; Xiao, Xuan; Chou, Kuo-Chen: iPPI-PseAAC(CGR): identify protein-protein interactions by incorporating chaos game representation into PseAAC (2019)
- Kocheturov, Anton; Pardalos, Panos M.; Karakitsiou, Athanasia: Massive datasets and machine learning for computational biomedicine: trends and challenges (2019)
- Lai, Chun Sing; Tao, Yingshan; Xu, Fangyuan; Ng, Wing W. Y.; Jia, Youwei; Yuan, Haoliang; Huang, Chao; Lai, Loi Lei; Xu, Zhao; Locatelli, Giorgio: A robust correlation analysis framework for imbalanced and dichotomous data with uncertainty (2019)
- Park, Soyoung; Carriquiry, Alicia: Learning algorithms to evaluate forensic glass evidence (2019)
- Poterie, A.; Dupuy, J.-F.; Monbet, V.; Rouvière, L.: Classification tree algorithm for grouped variables (2019)