计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 31-37.doi: 10.11896/j.issn.1002-137X.2018.07.005

• 第五届CCF 大数据学术会议 • 上一篇    下一篇

基于样本权重更新的不平衡数据集成学习方法

陈圣灵,沈思淇,李东升   

  1. 国防科技大学并行与分布处理国家重点实验室 长沙410073
  • 收稿日期:2017-07-30 出版日期:2018-07-30 发布日期:2018-07-30
  • 作者简介:陈圣灵(1993-),男,硕士,主要研究方向为大数据及机器学习,E-mail:waitsl@126.com;沈思淇(1985-),男,博士,助理研究员,主要研究方向为大数据及机器学习,E-mail:shensiqi@nudt.edu.cn(通信作者);李东升(1978-),男,博士,研究员,CCF会员,主要研究方向为大数据及机器学习,E-mail:dsli@nudt.edu.cn。
  • 基金资助:
    本文受国家重点基础研究发展计划(0800067314001),国家自然科学基金项目(61602500,61502500)资助。

Ensemble Learning Method for Imbalanced Data Based on Sample Weight Updating

CHEN Sheng-ling ,SHEN Si-qi, LI Dong-sheng   

  1. National Laboratory for Parallel and Distributed Processing,National University of Defense Technology,Changsha 410073,China
  • Received:2017-07-30 Online:2018-07-30 Published:2018-07-30

摘要: 不平衡数据的问题普遍存在于大数据、机器学习的各个应用领域,如医疗诊断、异常检测等。研究者提出或采用了多种方法来进行不平衡数据的学习,比如数据采样(如SMOTE)或者集成学习(如EasyEnsemble)的方法。数据采样中的过采样方法可能存在过拟合或边界样本分类准确率较低等问题,而欠采样方法则可能导致欠拟合。文中将SMOTE,Bagging,Boosting等算法的基本思想进行融合,提出了Rotation SMOTE算法。该算法通过在Boosting过程中根据基分类器的预测结果对少数类样本进行SMOTE来间接地增大少数类样本的权重,并借鉴Focal Loss的基本思想提出了根据基分类器预测结果直接优化AdaBoost权重更新策略的FocalBoost算法。对不同应用领域共11个不平衡数据集的多个评价指标进行实验测试,结果表明,相比于其他不平衡数据算法(包括SMOTEBoost算法和EasyEnsemble算法),Rotation SMOTE算法在所有数据集上具有最高的召回率,并且在大多数数据集上具有最佳或者次佳的G-mean以及F1Score;而相比于原始的AdaBoost,FocalBoost则在其中9个不平衡数据集上都获得了更优的性能指标。

关键词: Boosting, SMOTE, 不平衡数据, 集成学习

Abstract: The problem of imbalanced data is prevalent in various applications of big data and machine learning,like medical diagnosis and abnormal detection.Researchers have proposed or used a number of methods for imbalanced learning,including data sampling(e.g.SMOTE) and ensemble learning(e.g.EasyEnsemble) methods.The oversamp-ling methods in data sampling may have problems such as over-fitting or low classification accuracy of boundary samples,while the under-sampling methods may lead to under-fitting.The Rotation SMOTE algorithm was proposed in this paper incorporating the basic idea of SMOTE,Bagging,Boosting and other algorithms,and SMOTE was used to indirectly increase the weight of minority samples based on the prediction result of the base classifier in the Boosting process.According to the basic idea of Focal Loss,this paper proposed FocalBoost algorithm that directly optimizes the sample weight updating strategy of AdaBoost based on the prediction results of the base classifier.Based on the experiment with multiple evaluation metrics on 11 unbalanced data sets in different application fields,Rotation SMOTE can obtain the highest recall score on all datasets compared with other imbalanced data learning algorithms (including SMOTEBoost and EasyEnsemble),and achieves the best or the second best G-means and F1Score on most datasets,while FocalBoost achieves better performance on 9 of these unbalanced datasets compared to the original AdaBoost.

Key words: Boosting, Ensemble learning, Imbalanced data, SMOTE

中图分类号: 

  • TP181
[1]HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284.
[2]周志华.机器学习[M].北京:清华大学出版社,2016.
[3]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[4]CHAWLA N,LAZAREVIC A,HALL L,et al.SMOTEBoost:Improving prediction of the minority class in boosting[C]∥European Conference on Knowledge Discovery in Databased:PKDD.2003:107-119.
[5]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive syn-thetic sampling approach for imbalanced learning[C]∥IEEE International Joint Conference on Neural Networks.IEEE,2008:1322-1328.
[6]JIA A L,SHEN S,CHEN S,et al.An Analysis on a YouTube-like UGC site with Enhanced Social Features[C]∥Proceedings of the 26th International Conference on World Wide Web Companion.2017:1477-1483.
[7]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a newover-sampling method in imbalanced data sets learning[C]∥International Conference on Intelligent Computing.Berlin,Springer,Heidelberg,2005:878-887.
[8]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]∥IEEE International Conference on Granular Computing.IEEE,2006:732-737.
[9]LI M,FAN S.CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J].Bmc Bioinformatics,2017,18(1):169.
[10]LI J,FONG S,SUNG Y,et al.Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification[J].Biodata Mining,2016,9(1):37.
[11]LIU X Y,WU J,ZHOU Z H.Exploratory Undersampling for Class-Imbalance Learning[J].IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society,2009,39(2):539-550.
[12]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197.
[13]RODRGUEZ J J,KUNCHEVA L I,ALONSO C J.Rotation forest:A new classifier ensemble method[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2006,28(10):1619-1630.
[14]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[OL].http://www.researchgate.net/publication/322059369-Focal-Loss-for-Dense_Object-Detection.
[15]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]∥International Conference on Neural Information Processing Systems.MIT Press,2014:2672-2680.
[16]ARTHUR A,DAVID N.The UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/datasets.html.
[17]CHEN S,HE H,GARCIA E A.RAMOBoost:Ranked Minority Oversampling in Boosting[J].IEEE Transactions on Neural Networks,2010,21(10):1624-1642.
[1] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[2] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[3] 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧.
基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法
SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm
计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106
[4] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[5] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[6] 董奇达, 王喆, 吴松洋.
结合注意力机制与几何信息的特征融合框架
Feature Fusion Framework Combining Attention Mechanism and Geometric Information
计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180
[7] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[8] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[9] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[10] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[11] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
[12] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[13] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[14] 戴宗明, 胡凯, 谢捷, 郭亚.
基于直觉模糊集的集成学习算法
Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets
计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036
[15] 张人之, 朱焱.
基于主动学习的社交网络恶意用户检测方法
Malicious User Detection Method for Social Network Based on Active Learning
计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!