计算机科学 ›› 2020, Vol. 47 ›› Issue (6): 98-103.doi: 10.11896/jsjkx.191200138

• 数据库&大数据&数据科学 • 上一篇    下一篇

改进的XGBoost在不平衡数据处理中的应用研究

宋玲玲1, 王时绘1,2, 杨超1,2,3, 盛潇1   

  1. 1 湖北大学计算机与信息工程学院 武汉430062
    2 湖北省教育信息化工程技术研究中心 武汉430062
    3 湖北大学数学与统计学学院应用数学湖北省重点实验室 武汉430062
  • 收稿日期:2019-12-23 出版日期:2020-06-15 发布日期:2020-06-10
  • 通讯作者: 杨超(stevenyc@hubu.edu.cn)
  • 作者简介:ling2_song@stu.hubu.edu.cn
  • 基金资助:
    国家自然科学基金(61977021);应用数学湖北省重点实验室开放基金资助项目(HBAM201902)

Application Research of Improved XGBoost in Imbalanced Data Processing

SONG Ling-ling1, WANG Shi-hui1,2, YANG Chao1,2,3, SHENG Xiao1   

  1. 1 School of Computer and Information Engineering,Hubei University,Wuhan 430062,China
    2 Hubei Provincial Education Information Engineering Technology Research Center,Wuhan 430062,China
    3 Hubei Key Laboratory of Applied Mathematics,School of Mathematics and Statistics,Hubei University,Wuhan 430062,China
  • Received:2019-12-23 Online:2020-06-15 Published:2020-06-10
  • About author:SONG Ling-ling,born in 1994,postgraduate.Her main research interests include machine learning and so on.
    YANG Chao,born in 1982,Ph.D,associa-te professor,postgraduate supervisor,is a member of China Computer Federation.His main research interests include information security and computer immunology.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61977021) and Open funded project of Hubei Key Laboratory of Applied Mathematics(HBAM201902)

摘要: 传统分类器在处理不平衡数据时,往往会倾向于保证多数类的准确率而牺牲少数类的准确率,导致少数类的误分率较高。针对这一问题,提出一种面向二分类不平衡数据的XGBoost(eXtreme Gradient Boosting)改进方法。其主要思想是分别从数据、特征以及算法3个层面针对不平衡数据的特点进行改进。首先在数据层面,通过条件生成式对抗网络(Conditional Generative Adversarial Nets,CGAN)学习少数类样本的分布信息,训练生成器生成少数类补充样本,调节数据的不平衡性;其次在特征层面,先利用XGBoost进行特征组合生成新的特征,再通过最大相关最小冗余(minimal Redundancy-Maximal Relevance,mRMR)算法筛选出更适合不平衡数据分类的特征子集;最后在算法层面,引入针对不平衡数据分类问题的焦点损失函数(Focal Loss)来改进XGBoost,改进后的XGBoost通过新的数据集训练得到最终模型。在实验阶段,选择G-mean和AUC作为评价指标,6组KEEL数据集上的实验结果验证了所提改进方法的可行性;同时将该方法与现有的4种不平衡分类模型进行比较,实验结果表明所提改进方法具有较好的分类效果。

关键词: CGAN, Focal Loss, mRMR, XGBoost, 不平衡数据, 特征组合

Abstract: When dealing with imbalanced data,traditional classifiers tend to guarantee the accuracy of the majority class and sacrifice the accuracy of the minority class,resulting in a higher error rate of the minority class.Aiming at this problem,an improved XGBoost method for binary imbalanced data is proposed.The main idea is to improve the characters of imbalanced data from three levels,data,features,and algorithms.Firstly,at the data level,Conditional Generative Adversarial Nets (CGAN) learns the distributive information of minority samples and then trains the generator to generate a few supple-mentary samples to adjust the imbalance of the data.Secondly,at the feature level,it uses XGBoost for feature combination to generate new features,and then uses the minimal Redundancy-Maximal Relevance (mRMR) algorithm to screen out a subset of features that are more suitable for imbalanced data classification.Finally,at the algorithm level,it introduces a Focal Loss function for imbalanced data classification to improve XGBoost.The improved XGBoost is trained on the new dataset to obtain the final model.In the experimental stage,G-mean and AUC are selected as the evaluation indicators.The experimental results on 6 sets of KEEL datasets verify the feasibility of the proposed improved method.At the same time,the method is compared with the existing four imbalanced classification models.The experimental results show that the proposed improved method has better classification effect.

Key words: CGAN, Feature combination, Focal Loss, Imbalanced data, mRMR, XGBoost

中图分类号: 

  • TP181
[1]LIN W,TSAI C,HU Y,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[2]BHATTACHARYA S,RAJAN V,SHRIVASTAVA H.ICU mortality prediction:A classification algorithm for imbalanced datasets[C]//Proc of the 31st AAAI Conf on Artificial Intelligence.San Francisco:AAAI,2017:1288-1294.
[3]CHEN X,LIU P H,SUN Y Z,et al.Research on Disease Prediction Models Based on Imbalanced Medical Data Sets[J].Chinese Journal of Computers,2019,42(3):596-609.
[4]HU M M,CHEN X,SUN Y Z,et al.A Disease Prediction Model Based on Dynamic Sampling and Transfer Learning[J].Chinese Journal of Computers,2019,42(10),2339-2354.
[5]DUAN L,XIE M,BAI T,et al.A new support vector data description method for machinery fault diagnosis with unbalanced datasets[J].Expert Systems with Applications.2016,64:239-246.
[6]WANG F,XU T,TANG T,et al.Bilevel feature extractionbased text mining for fault diagnosis of railway systems[J].IEEE Trans on Intelligent Transportation Systems,2016,18(1):49-58.
[7]WANG S,YAO X.Using class imbalance learning for software defect prediction[J].IEEE Trans on Reliability,2013,62(2):434-443.
[8]XIONG W,LI B,HE L,et al.Collaborative web service QoS prediction on unbalanced data distribution[C]//IEEE Int Conf on Web Services.Anchorage:IEEE,2014:377-384.
[9]SHEN W,WANG X,WANG Y,et al.Deepcontour:A deep convolutional feature learned by positive-sharing loss for contour detection[C]//Proc of the IEEE Conf on Computer Vision and Pattern Recognition.Boston:IEEE,2015:3982-3991.
[10]POUYANFAR S,CHEN S C.Automatic video event detection for imbalance data using enhanced ensemble deep learning[J].Int J of Semantic Computing,2017,11(1):85-109.
[11]RAO R B.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter,2006,8(1):3-10.
[12]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688.
[13]GARCIA V,SANCHEZ J S,MOLLINEDAR A.On the eff ectiveness of preprocessing methods when dealing with different levels of class imbalance[J].Knowledge-Based Systems,2011,25(1):13-21.
[14]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[15] CIESLAK D A,CHAWLA N V,STRIEGELA.Combating imbalance in network intrusion datasets[C]//Proceedings of IEEE International Conference on Granular Computing.IEEE,2006:732-737.
[16]GUO H P,ZHOU J,WU C A,et al.K-nearest neighbor classification method for class-imbalanced problem[J].Journal of Computer Applications,2018,38(4):955-959,977.
[17]MEMBER M W,CHEN X W.Combating the Small Sample Class Imbalance Problem Using Feature Selection[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400.
[18]WANG J,LI D Y,WANG S G.Feature Selection Method for Imbalanced Text Sentiment Classification[J].ComputerScie-nce,2016,43(10):206-210,224.
[19]ZHAO N,ZHANG X F,ZHANG L J.Overview of imbalanced data classification[J].Computer Science,2018,45(S1):22-27
[20]WU Y X,WANG J L,YANG L,et al.Survey on Cost-sensitive Deep Learning Methods[J].Computer Science,2019,46(5):1-12.
[21]CAO Y X,HUANG H Y.Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning[J].Computer Science,2019,46(5):203-208.
[22]YUAN X M,YANG M,YANG Y.An Ensemble Classifier Based on Structural Support Vector Machine for Imbalanced Data.[J].Pattern Recognition and Artificial Intelligence,2013,26(3):315-320.
[23]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:improving prediction of the minority class in boosting[C]//Proceedings of the 2003 European Conference on Know-ledge Discovery in Databases,LNCS 2838.Berlin:Springer,2003:107-119.
[24]RAYHAN F,AHMED S,MAHBUB A,et al.CUSBoost:Cluster-based Under-sampling with Boosting for Imbalanced Classification[C]// 11th International Conference on Software Know-ledge Information Management and Applications (SKIMA).2017:1-6.
[25]SEIFFERT C,KHOSHGOFTAAR T M,VAN H J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man and Cybernetics,Part A:Systems and Humans,2010,40(1):185-197.
[26]WANG Z Z,HUANG B,FAN Z J.et,al.Improved SMOTE unbalanced data integration classification algorithm[J].Computer Application,2019,39(9):2591-2596.
[27]MIRZA M,OSINDER S.Conditional Generative Adversarial Nets[J].arXiv:1411.1784,2014.
[28]PENG H,LONG F,DING C.Feature selection based on mutual information:Criteria of Max-Dependency,Max-Relevance,and Min-Redundancy[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(8):1226-1238.
[29]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern on Pattern Analysis & Machine Intelligence,2017,PP(99):2999-3007.
[30]CHEN T,GUESTRIN C.XGBoost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[31]CAI Z X,WANG X Y,X J,etal.Sample Adaptive Classifier for Imbalanced Data[J].Computer Science,2019,46(1):94-99.
[32]XIONG B Y,WANG G Y,DENG W B.Under-Sampling Method Based on Sample Weight for Imbalanced Data[J].Journal of Computer Research and Development,2016,53(11):2613-2622.
[33]MATHEW J,PANG C K,LUO M,et al.Classification ofImba-lanced Data by Oversampling in Kernel Space of Support Vector Machines[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076.
[34]LI X F,L J,D Y F,et al.A New Learning Algorithm for Imba-lanced Data-PCBoost[J].Chinese Journal of Computers,2012,35(2):202-209.
[35]ZHANG N,CHEN Q.Ensemble learning training method based on AUC and Q statistics[J].Journal of Computer Applications,2019,39(4):935-939.
[1] 帅剑波, 王金策, 黄飞虎, 彭舰.
基于神经架构搜索的点击率预测模型
Click-Through Rate Prediction Model Based on Neural Architecture Search
计算机科学, 2022, 49(7): 10-17. https://doi.org/10.11896/jsjkx.210600009
[2] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[3] 孙福权, 梁莹.
基于XGBoost算法的水稻基因组6mA位点识别研究
Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm
计算机科学, 2022, 49(6A): 309-313. https://doi.org/10.11896/jsjkx.210700262
[4] 董奇达, 王喆, 吴松洋.
结合注意力机制与几何信息的特征融合框架
Feature Fusion Framework Combining Attention Mechanism and Geometric Information
计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180
[5] 李京泰, 王晓丹.
基于代价敏感激活函数XGBoost的不平衡数据分类方法
XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function
计算机科学, 2022, 49(5): 135-143. https://doi.org/10.11896/jsjkx.210400064
[6] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
[7] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[8] 张曼, 李杰, 朱新忠, 沈霁, 成昊天.
基于改进DCGAN算法的遥感数据集增广方法
Augmentation Technology of Remote Sensing Dataset Based on Improved DCGAN Algorithm
计算机科学, 2021, 48(6A): 80-84. https://doi.org/10.11896/jsjkx.200700185
[9] 张人之, 朱焱.
基于主动学习的社交网络恶意用户检测方法
Malicious User Detection Method for Social Network Based on Active Learning
计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151
[10] 龚追飞, 魏传佳.
基于拓扑相似和XGBoost的复杂网络链路预测方法
Complex Network Link Prediction Method Based on Topology Similarity and XGBoost
计算机科学, 2021, 48(12): 226-230. https://doi.org/10.11896/jsjkx.200800026
[11] 王萧萧, 王亭雯, 马玉玲, 范佳奕, 崔超然.
基于深度森林的P2P网贷借款人信用风险评估方法
Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest
计算机科学, 2021, 48(11A): 429-434. https://doi.org/10.11896/jsjkx.201000013
[12] 王晓迪, 刘鑫, 于晓.
用于多元时间序列预测的自适应频域模型
Adaptive Frequency Domain Model for Multivariate Time Series Forecasting
计算机科学, 2021, 48(11A): 204-210. https://doi.org/10.11896/jsjkx.210500129
[13] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[14] 张扬, 马小虎.
基于改进生成对抗网络的动漫人物头像生成算法
Anime Character Portrait Generation Algorithm Based on Improved Generative Adversarial Networks
计算机科学, 2021, 48(1): 182-189. https://doi.org/10.11896/jsjkx.191100092
[15] 于文家, 丁世飞.
基于自注意力机制的条件生成对抗网络
Conditional Generative Adversarial Network Based on Self-attention Mechanism
计算机科学, 2021, 48(1): 241-246. https://doi.org/10.11896/jsjkx.200700187
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!