计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 31-37.doi: 10.11896/j.issn.1002-137X.2018.07.005
陈圣灵,沈思淇,李东升
CHEN Sheng-ling ,SHEN Si-qi, LI Dong-sheng
摘要: 不平衡数据的问题普遍存在于大数据、机器学习的各个应用领域,如医疗诊断、异常检测等。研究者提出或采用了多种方法来进行不平衡数据的学习,比如数据采样(如SMOTE)或者集成学习(如EasyEnsemble)的方法。数据采样中的过采样方法可能存在过拟合或边界样本分类准确率较低等问题,而欠采样方法则可能导致欠拟合。文中将SMOTE,Bagging,Boosting等算法的基本思想进行融合,提出了Rotation SMOTE算法。该算法通过在Boosting过程中根据基分类器的预测结果对少数类样本进行SMOTE来间接地增大少数类样本的权重,并借鉴Focal Loss的基本思想提出了根据基分类器预测结果直接优化AdaBoost权重更新策略的FocalBoost算法。对不同应用领域共11个不平衡数据集的多个评价指标进行实验测试,结果表明,相比于其他不平衡数据算法(包括SMOTEBoost算法和EasyEnsemble算法),Rotation SMOTE算法在所有数据集上具有最高的召回率,并且在大多数数据集上具有最佳或者次佳的G-mean以及F1Score;而相比于原始的AdaBoost,FocalBoost则在其中9个不平衡数据集上都获得了更优的性能指标。
中图分类号:
[1]HE H,GARCIA E A.Learning from Imbalanced Data[J].IEEE Transactions on Knowledge & Data Engineering,2009,21(9):1263-1284. [2]周志华.机器学习[M].北京:清华大学出版社,2016. [3]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [4]CHAWLA N,LAZAREVIC A,HALL L,et al.SMOTEBoost:Improving prediction of the minority class in boosting[C]∥European Conference on Knowledge Discovery in Databased:PKDD.2003:107-119. [5]HE H,BAI Y,GARCIA E A,et al.ADASYN:Adaptive syn-thetic sampling approach for imbalanced learning[C]∥IEEE International Joint Conference on Neural Networks.IEEE,2008:1322-1328. [6]JIA A L,SHEN S,CHEN S,et al.An Analysis on a YouTube-like UGC site with Enhanced Social Features[C]∥Proceedings of the 26th International Conference on World Wide Web Companion.2017:1477-1483. [7]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a newover-sampling method in imbalanced data sets learning[C]∥International Conference on Intelligent Computing.Berlin,Springer,Heidelberg,2005:878-887. [8]CIESLAK D A,CHAWLA N V,STRIEGEL A.Combating imbalance in network intrusion datasets[C]∥IEEE International Conference on Granular Computing.IEEE,2006:732-737. [9]LI M,FAN S.CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests[J].Bmc Bioinformatics,2017,18(1):169. [10]LI J,FONG S,SUNG Y,et al.Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification[J].Biodata Mining,2016,9(1):37. [11]LIU X Y,WU J,ZHOU Z H.Exploratory Undersampling for Class-Imbalance Learning[J].IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society,2009,39(2):539-550. [12]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197. [13]RODRGUEZ J J,KUNCHEVA L I,ALONSO C J.Rotation forest:A new classifier ensemble method[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2006,28(10):1619-1630. [14]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[OL].http://www.researchgate.net/publication/322059369-Focal-Loss-for-Dense_Object-Detection. [15]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]∥International Conference on Neural Information Processing Systems.MIT Press,2014:2672-2680. [16]ARTHUR A,DAVID N.The UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/datasets.html. [17]CHEN S,HE H,GARCIA E A.RAMOBoost:Ranked Minority Oversampling in Boosting[J].IEEE Transactions on Neural Networks,2010,21(10):1624-1642. |
[1] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[2] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[3] | 周志豪, 陈磊, 伍翔, 丘东亮, 梁广升, 曾凡巧. 基于SMOTE-SDSAE-SVM的车载CAN总线入侵检测算法 SMOTE-SDSAE-SVM Based Vehicle CAN Bus Intrusion Detection Algorithm 计算机科学, 2022, 49(6A): 562-570. https://doi.org/10.11896/jsjkx.210700106 |
[4] | 王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043 |
[5] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[6] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[7] | 任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132 |
[8] | 陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195 |
[9] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[10] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[11] | 郑建华, 李小敏, 刘双印, 李迪. 融合级联上采样与下采样的改进随机森林不平衡数据分类算法 Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling 计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120 |
[12] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法 Interval Prediction Method for Imbalanced Fuel Consumption Data 计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145 |
[13] | 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102 |
[14] | 戴宗明, 胡凯, 谢捷, 郭亚. 基于直觉模糊集的集成学习算法 Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets 计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036 |
[15] | 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法 Malicious User Detection Method for Social Network Based on Active Learning 计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151 |
|