计算机科学 ›› 2020, Vol. 47 ›› Issue (6): 98-103.doi: 10.11896/jsjkx.191200138
宋玲玲1, 王时绘1,2, 杨超1,2,3, 盛潇1
SONG Ling-ling1, WANG Shi-hui1,2, YANG Chao1,2,3, SHENG Xiao1
摘要: 传统分类器在处理不平衡数据时,往往会倾向于保证多数类的准确率而牺牲少数类的准确率,导致少数类的误分率较高。针对这一问题,提出一种面向二分类不平衡数据的XGBoost(eXtreme Gradient Boosting)改进方法。其主要思想是分别从数据、特征以及算法3个层面针对不平衡数据的特点进行改进。首先在数据层面,通过条件生成式对抗网络(Conditional Generative Adversarial Nets,CGAN)学习少数类样本的分布信息,训练生成器生成少数类补充样本,调节数据的不平衡性;其次在特征层面,先利用XGBoost进行特征组合生成新的特征,再通过最大相关最小冗余(minimal Redundancy-Maximal Relevance,mRMR)算法筛选出更适合不平衡数据分类的特征子集;最后在算法层面,引入针对不平衡数据分类问题的焦点损失函数(Focal Loss)来改进XGBoost,改进后的XGBoost通过新的数据集训练得到最终模型。在实验阶段,选择G-mean和AUC作为评价指标,6组KEEL数据集上的实验结果验证了所提改进方法的可行性;同时将该方法与现有的4种不平衡分类模型进行比较,实验结果表明所提改进方法具有较好的分类效果。
中图分类号:
[1]LIN W,TSAI C,HU Y,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26. [2]BHATTACHARYA S,RAJAN V,SHRIVASTAVA H.ICU mortality prediction:A classification algorithm for imbalanced datasets[C]//Proc of the 31st AAAI Conf on Artificial Intelligence.San Francisco:AAAI,2017:1288-1294. [3]CHEN X,LIU P H,SUN Y Z,et al.Research on Disease Prediction Models Based on Imbalanced Medical Data Sets[J].Chinese Journal of Computers,2019,42(3):596-609. [4]HU M M,CHEN X,SUN Y Z,et al.A Disease Prediction Model Based on Dynamic Sampling and Transfer Learning[J].Chinese Journal of Computers,2019,42(10),2339-2354. [5]DUAN L,XIE M,BAI T,et al.A new support vector data description method for machinery fault diagnosis with unbalanced datasets[J].Expert Systems with Applications.2016,64:239-246. [6]WANG F,XU T,TANG T,et al.Bilevel feature extractionbased text mining for fault diagnosis of railway systems[J].IEEE Trans on Intelligent Transportation Systems,2016,18(1):49-58. [7]WANG S,YAO X.Using class imbalance learning for software defect prediction[J].IEEE Trans on Reliability,2013,62(2):434-443. [8]XIONG W,LI B,HE L,et al.Collaborative web service QoS prediction on unbalanced data distribution[C]//IEEE Int Conf on Web Services.Anchorage:IEEE,2014:377-384. [9]SHEN W,WANG X,WANG Y,et al.Deepcontour:A deep convolutional feature learned by positive-sharing loss for contour detection[C]//Proc of the IEEE Conf on Computer Vision and Pattern Recognition.Boston:IEEE,2015:3982-3991. [10]POUYANFAR S,CHEN S C.Automatic video event detection for imbalance data using enhanced ensemble deep learning[J].Int J of Semantic Computing,2017,11(1):85-109. [11]RAO R B.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter,2006,8(1):3-10. [12]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688. [13]GARCIA V,SANCHEZ J S,MOLLINEDAR A.On the eff ectiveness of preprocessing methods when dealing with different levels of class imbalance[J].Knowledge-Based Systems,2011,25(1):13-21. [14]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357. [15] CIESLAK D A,CHAWLA N V,STRIEGELA.Combating imbalance in network intrusion datasets[C]//Proceedings of IEEE International Conference on Granular Computing.IEEE,2006:732-737. [16]GUO H P,ZHOU J,WU C A,et al.K-nearest neighbor classification method for class-imbalanced problem[J].Journal of Computer Applications,2018,38(4):955-959,977. [17]MEMBER M W,CHEN X W.Combating the Small Sample Class Imbalance Problem Using Feature Selection[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400. [18]WANG J,LI D Y,WANG S G.Feature Selection Method for Imbalanced Text Sentiment Classification[J].ComputerScie-nce,2016,43(10):206-210,224. [19]ZHAO N,ZHANG X F,ZHANG L J.Overview of imbalanced data classification[J].Computer Science,2018,45(S1):22-27 [20]WU Y X,WANG J L,YANG L,et al.Survey on Cost-sensitive Deep Learning Methods[J].Computer Science,2019,46(5):1-12. [21]CAO Y X,HUANG H Y.Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning[J].Computer Science,2019,46(5):203-208. [22]YUAN X M,YANG M,YANG Y.An Ensemble Classifier Based on Structural Support Vector Machine for Imbalanced Data.[J].Pattern Recognition and Artificial Intelligence,2013,26(3):315-320. [23]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:improving prediction of the minority class in boosting[C]//Proceedings of the 2003 European Conference on Know-ledge Discovery in Databases,LNCS 2838.Berlin:Springer,2003:107-119. [24]RAYHAN F,AHMED S,MAHBUB A,et al.CUSBoost:Cluster-based Under-sampling with Boosting for Imbalanced Classification[C]// 11th International Conference on Software Know-ledge Information Management and Applications (SKIMA).2017:1-6. [25]SEIFFERT C,KHOSHGOFTAAR T M,VAN H J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man and Cybernetics,Part A:Systems and Humans,2010,40(1):185-197. [26]WANG Z Z,HUANG B,FAN Z J.et,al.Improved SMOTE unbalanced data integration classification algorithm[J].Computer Application,2019,39(9):2591-2596. [27]MIRZA M,OSINDER S.Conditional Generative Adversarial Nets[J].arXiv:1411.1784,2014. [28]PENG H,LONG F,DING C.Feature selection based on mutual information:Criteria of Max-Dependency,Max-Relevance,and Min-Redundancy[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(8):1226-1238. [29]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern on Pattern Analysis & Machine Intelligence,2017,PP(99):2999-3007. [30]CHEN T,GUESTRIN C.XGBoost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:785-794. [31]CAI Z X,WANG X Y,X J,etal.Sample Adaptive Classifier for Imbalanced Data[J].Computer Science,2019,46(1):94-99. [32]XIONG B Y,WANG G Y,DENG W B.Under-Sampling Method Based on Sample Weight for Imbalanced Data[J].Journal of Computer Research and Development,2016,53(11):2613-2622. [33]MATHEW J,PANG C K,LUO M,et al.Classification ofImba-lanced Data by Oversampling in Kernel Space of Support Vector Machines[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076. [34]LI X F,L J,D Y F,et al.A New Learning Algorithm for Imba-lanced Data-PCBoost[J].Chinese Journal of Computers,2012,35(2):202-209. [35]ZHANG N,CHEN Q.Ensemble learning training method based on AUC and Q statistics[J].Journal of Computer Applications,2019,39(4):935-939. |
[1] | 帅剑波, 王金策, 黄飞虎, 彭舰. 基于神经架构搜索的点击率预测模型 Click-Through Rate Prediction Model Based on Neural Architecture Search 计算机科学, 2022, 49(7): 10-17. https://doi.org/10.11896/jsjkx.210600009 |
[2] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[3] | 孙福权, 梁莹. 基于XGBoost算法的水稻基因组6mA位点识别研究 Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm 计算机科学, 2022, 49(6A): 309-313. https://doi.org/10.11896/jsjkx.210700262 |
[4] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[5] | 李京泰, 王晓丹. 基于代价敏感激活函数XGBoost的不平衡数据分类方法 XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function 计算机科学, 2022, 49(5): 135-143. https://doi.org/10.11896/jsjkx.210400064 |
[6] | 郑建华, 李小敏, 刘双印, 李迪. 融合级联上采样与下采样的改进随机森林不平衡数据分类算法 Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling 计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120 |
[7] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法 Interval Prediction Method for Imbalanced Fuel Consumption Data 计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145 |
[8] | 张曼, 李杰, 朱新忠, 沈霁, 成昊天. 基于改进DCGAN算法的遥感数据集增广方法 Augmentation Technology of Remote Sensing Dataset Based on Improved DCGAN Algorithm 计算机科学, 2021, 48(6A): 80-84. https://doi.org/10.11896/jsjkx.200700185 |
[9] | 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法 Malicious User Detection Method for Social Network Based on Active Learning 计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151 |
[10] | 龚追飞, 魏传佳. 基于拓扑相似和XGBoost的复杂网络链路预测方法 Complex Network Link Prediction Method Based on Topology Similarity and XGBoost 计算机科学, 2021, 48(12): 226-230. https://doi.org/10.11896/jsjkx.200800026 |
[11] | 王萧萧, 王亭雯, 马玉玲, 范佳奕, 崔超然. 基于深度森林的P2P网贷借款人信用风险评估方法 Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest 计算机科学, 2021, 48(11A): 429-434. https://doi.org/10.11896/jsjkx.201000013 |
[12] | 王晓迪, 刘鑫, 于晓. 用于多元时间序列预测的自适应频域模型 Adaptive Frequency Domain Model for Multivariate Time Series Forecasting 计算机科学, 2021, 48(11A): 204-210. https://doi.org/10.11896/jsjkx.210500129 |
[13] | 王茂光, 杨行. 一种基于AP-Entropy选择集成的风控模型和算法 Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble 计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110 |
[14] | 张扬, 马小虎. 基于改进生成对抗网络的动漫人物头像生成算法 Anime Character Portrait Generation Algorithm Based on Improved Generative Adversarial Networks 计算机科学, 2021, 48(1): 182-189. https://doi.org/10.11896/jsjkx.191100092 |
[15] | 于文家, 丁世飞. 基于自注意力机制的条件生成对抗网络 Conditional Generative Adversarial Network Based on Self-attention Mechanism 计算机科学, 2021, 48(1): 241-246. https://doi.org/10.11896/jsjkx.200700187 |
|