Computer Science ›› 2018, Vol. 45 ›› Issue (7): 31-37.doi: 10.11896/j.issn.1002-137X.2018.07.005

• CCF Big Data 2017 • Previous Articles     Next Articles

Ensemble Learning Method for Imbalanced Data Based on Sample Weight Updating

CHEN Sheng-ling ,SHEN Si-qi, LI Dong-sheng   

  1. National Laboratory for Parallel and Distributed Processing,National University of Defense Technology,Changsha 410073,China
  • Received:2017-07-30 Online:2018-07-30 Published:2018-07-30

Abstract: The problem of imbalanced data is prevalent in various applications of big data and machine learning,like medical diagnosis and abnormal detection.Researchers have proposed or used a number of methods for imbalanced learning,including data sampling(e.g.SMOTE) and ensemble learning(e.g.EasyEnsemble) methods.The oversamp-ling methods in data sampling may have problems such as over-fitting or low classification accuracy of boundary samples,while the under-sampling methods may lead to under-fitting.The Rotation SMOTE algorithm was proposed in this paper incorporating the basic idea of SMOTE,Bagging,Boosting and other algorithms,and SMOTE was used to indirectly increase the weight of minority samples based on the prediction result of the base classifier in the Boosting process.According to the basic idea of Focal Loss,this paper proposed FocalBoost algorithm that directly optimizes the sample weight updating strategy of AdaBoost based on the prediction results of the base classifier.Based on the experiment with multiple evaluation metrics on 11 unbalanced data sets in different application fields,Rotation SMOTE can obtain the highest recall score on all datasets compared with other imbalanced data learning algorithms (including SMOTEBoost and EasyEnsemble),and achieves the best or the second best G-means and F1Score on most datasets,while FocalBoost achieves better performance on 9 of these unbalanced datasets compared to the original AdaBoost.

Key words: Boosting, Ensemble learning, Imbalanced data, SMOTE

CLC Number: 

  • TP181
[1]HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2]周志华.机器学习[M].北京:清华大学出版社,2016.
[3]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[4]CHAWLA N,LAZAREVIC A,HALL L,et al.SMOTEBoost:Improving prediction of the minority class in boosting[C]∥European Conference on Knowledge Discovery in Databased:PKDD.2003:107-119.
[5]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive syn-thetic sampling approach for imbalanced learning[C]∥IEEE International Joint Conference on Neural Networks.IEEE,2008:1322-1328.
[6]JIA A L,SHEN S,CHEN S,et al.An Analysis on a YouTube-like UGC site with Enhanced Social Features[C]∥Proceedings of the 26th International Conference on World Wide Web Companion.2017:1477-1483.
[7]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a newover-sampling method in imbalanced data sets learning[C]∥International Conference on Intelligent Computing.Berlin,Springer,Heidelberg,2005:878-887.
[8]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]∥IEEE International Conference on Granular Computing.IEEE,2006:732-737.
[9]LI M,FAN S.CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J].Bmc Bioinformatics,2017,18(1):169.
[10]LI J,FONG S,SUNG Y,et al.Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification[J].Biodata Mining,2016,9(1):37.
[11]LIU X Y,WU J,ZHOU Z H.Exploratory Undersampling for Class-Imbalance Learning[J].IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society,2009,39(2):539-550.
[12]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197.
[13]RODRGUEZ J J,KUNCHEVA L I,ALONSO C J.Rotation forest:A new classifier ensemble method[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2006,28(10):1619-1630.
[14]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[OL].http://www.researchgate.net/publication/322059369-Focal-Loss-for-Dense_Object-Detection.
[15]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]∥International Conference on Neural Information Processing Systems.MIT Press,2014:2672-2680.
[16]ARTHUR A,DAVID N.The UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/datasets.html.
[17]CHEN S,HE H,GARCIA E A.RAMOBoost:Ranked Minority Oversampling in Boosting[J].IEEE Transactions on Neural Networks,2010,21(10):1624-1642.
[1] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[2] KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[3] ZHOU Zhi-hao, CHEN Lei, WU Xiang, QIU Dong-liang, LIANG Guang-sheng, ZENG Fan-qiao. SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm [J]. Computer Science, 2022, 49(6A): 562-570.
[4] WANG Yu-fei, CHEN Wen. Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment [J]. Computer Science, 2022, 49(6): 127-133.
[5] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[6] DONG Qi-da, WANG Zhe, WU Song-yang. Feature Fusion Framework Combining Attention Mechanism and Geometric Information [J]. Computer Science, 2022, 49(5): 129-134.
[7] REN Shou-peng, LI Jin, WANG Jing-ru, YUE Kun. Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction [J]. Computer Science, 2022, 49(2): 265-271.
[8] CHEN Wei, LI Hang, LI Wei-hua. Ensemble Learning Method for Nucleosome Localization Prediction [J]. Computer Science, 2022, 49(2): 285-291.
[9] JIANG Hao-chen, WEI Zi-qi, LIU Lin, CHEN Jun. Imbalanced Data Classification:A Survey and Experiments in Medical Domain [J]. Computer Science, 2022, 49(1): 80-88.
[10] LIU Zhen-yu, SONG Xiao-ying. Multivariate Regression Forest for Categorical Attribute Data [J]. Computer Science, 2022, 49(1): 108-114.
[11] CHEN Le, GAO Ling, REN Jie, DANG Xin, WANG Yi-hao, CAO Rui, ZHENG Jie, WANG Hai. Adaptive Bitrate Streaming for Energy-Efficiency Mobile Augmented Reality [J]. Computer Science, 2022, 49(1): 194-203.
[12] ZHOU Xin-min, HU Yi-gui, LIU Wen-jie, SUN Rong-jun. Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method [J]. Computer Science, 2021, 48(9): 50-58.
[13] CHEN Jing-jie, WANG Kun. Interval Prediction Method for Imbalanced Fuel Consumption Data [J]. Computer Science, 2021, 48(7): 178-183.
[14] ZHOU Gang, GUO Fu-liang. Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data [J]. Computer Science, 2021, 48(6A): 250-254.
[15] DAI Zong-ming, HU Kai, XIE Jie, GUO Ya. Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets [J]. Computer Science, 2021, 48(6A): 270-274.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!