计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 135-143.doi: 10.11896/jsjkx.210400064

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于代价敏感激活函数XGBoost的不平衡数据分类方法

李京泰, 王晓丹   

  1. 空军工程大学防空反导学院 西安710051
  • 收稿日期:2021-04-07 修回日期:2021-09-07 出版日期:2022-05-15 发布日期:2022-05-06
  • 通讯作者: 王晓丹(afeu_wang@163.com)
  • 作者简介:(afeulijingtai@163.com)

XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function

LI Jing-tai, WANG Xiao-dan   

  1. Air and Missile Defense College,Air Force Engineering University,Xi’an 710051,China
  • Received:2021-04-07 Revised:2021-09-07 Online:2022-05-15 Published:2022-05-06
  • About author:LI Jing-tai,born in 1998,postgraduate.His main research interests include machine learning and steganalysis.
    WANG Xiao-dan,born in 1966,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include machine learning and intelligent information processing.

摘要: 为解决在数据不平衡条件下使用XGBoost框架处理二分类问题时算法对少数类样本的识别能力下降的问题,提出了基于代价敏感激活函数的XGBoost算法(Cost-sensitive Activation Function XGBoost,CSAF-XGBoost)。在XGBoost框架构建决策树时,数据不平衡会影响分裂点的选择,导致少数类样本被误分。通过引入代价敏感激活函数改变样本在不同预测结果下损失函数的梯度变化,来解决被误分的少数类样本因梯度变化小而无法在XGBoost迭代过程中被有效分类的问题。通过实验分析了激活函数的参数与数据不平衡度的关系,并对CSAF-XGBoost算法与SMOTE-XGBoost,ADASYN-XGBoost,Focal loss-XGBoost,Weight-XGBoost优化算法在UCI公共数据集上的分类性能进行了对比。结果表明,在F1值和AUC值相同或有提高的情况下,CSAF-XGBoost算法对少数类样本的检出率比最优算法平均提高了6.75%,最多提高了15%,证明了CSAF-XGBoost算法对少数类样本有更高的识别能力,且具有广泛的适用性。

关键词: Logistic回归, XGBoost, 代价敏感, 激活函数, 数据不平衡分类

Abstract: For binary classification with category imbalance,acost-sensitive activation function XGBoost algorithm(CSAF-XGBoost) is proposed to promote the ability of recognizing minority samples.When XGBoost algorithm constructs decision trees,unbalanced data will affect split point selection,which lead to misclassification of minority.By constructing cost-sensitive activation function (CSAF),samples in different estimation are under different gradient variations,which approach the problem that the gradient variation of misclassified minority sample is too small to make samples be recognized correctly in iterations.The experiments analyze the relation of imbalanced rate (IR) to parameters,and compare performance with SMOTE-XGBoost,ADASYN-XGBoost,Focal loss-XGBoost and Weight-XGBoost on UCI datasets.As for recall rate of minority,CSAF-XGBoost surpasses the best methods 6.75% in average and 15%in maximum with F1-score and AUC score in the same level.The results prove CSAF-XGBoost has better performance in recognizing minority class samples and wider applicability.

Key words: Activation function, Cost-sensitive, Data imbalanced classification, Logistic regression, XGBoost

中图分类号: 

  • TP391.4
[1]DENG M Y,GUO Y S,LIU T.Research on Imbalanced Data Sampling Method Based on Stratification and Recombination[J].Journal of Chongqing University of Technology(Natural Science),2021,35(8):122-128.
[2]GEORGIOS D,FERNADO B,et al.Effective data generation for imbalanced learning using conditional generative adversarial networks[J].Expert Systems with Application,2018,91(1):464-471.
[3]ZHANG H,HUANG L,WU C Q,et al.An Effective Convolutional Neural Network Based on SMOTE and Gaussian Mixture Model for Intrusion Detection in Imbalanced Dataset[J/OL].Computer Networks,2020,177.https://www.sciencedirect.com/science/article/abs/pii/S1389128620300712.
[4]YI H K,JIANG Q C,YAN X F,et al.Imbalanced Classification Based on Minority Clustering Synthetic Minority Oversampling Technique with Wind Turbine Fault Detection Application[J].IEEE Transactions on Industrial Informatics,2021,17(9):5867-5875.
[5]TAO X M,LI Q,REN C,et al.Real-value negative selectionoversampling for imbalanced data set learning[J].Expert Systems with Applications,2019,129:118-134.
[6]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,5:13-17.
[7]TURNEY P.Types of Cost in Inductive Concept Learning[J].arXiv:0212034,2002.
[8]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):4-19.
[9]BADRAN M F,SAHAR N M,SARI S,et al.Intrusion-Detection System Based on Hybrid Models:Review Paper[C]//IOP Conference Series:Materials Science and Engineering.2020.
[10]PING R,ZHOU S S,LI D.Cost sensitive random forest classification algorithm for highly unbalanced data[J].Pattern Recognition and Artificial Intelligence,2020,201(3):62-70.
[11]JING X Y,ZHANG X Y,ZHU X K,et al.Multiset Feature Learning for Highly Imbalanced Data Classification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(1):139-156.
[12]TAO X M,LI Q,GUO W,et al.Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imba-lanced data classification[J].Information Sciences,2019,487:31-56.
[13]GALAR M.A Review on Ensembles for the Class ImbalanceProblem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews,2012,42(4):463-484.
[14]GARCIA S,ZHANG Z L,ALTALHI A,et al.Dynamic ensemble selection for multi-class imbalanced datasets[J].Information Sciences,2018,445:22-37.
[15]CHEN Q W,WANG W,MA D,et al.Class-imbalance creditscoring using Ext-GBDT ensemble[J].Application Research of Computers,2018,35(2):421-427.
[16]TAO X M,CHEN W,LI X,et al.The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imba-lanced datasets[J/OL].Knowledge-Based Systems,2021,219(7).https://www.sciencedirect.com/science/article/abs/pii/S095070512100160X.
[17]ZHANG Z,QIU J X,DAI W.A New Improved Boosting for Imbalanced Data Classification[C]//IOP Conference Series Materials Science and Engineering.2019.
[18]SHI H T,WANG H R,HUANG Y X,et al.A hierarchicalmethod based on weighted extreme gradient boosting in ECG heartbeat classification[J].Computer Methods and Programs in Biomedicine,2019,171:1-10.
[19]DING H,LIU K,CHEN X Z,et al.Optimized SegmentationBased on the Weighted Aggregation Method for Loess Bank Gully Mapping[J].Remote Sensing,2020,12(5):793-813.
[20]THABTAH F,HAMMOUD S,KAMALOV F,et al.Data imbalance in classification:Experimental evaluation[J].Information Sciences,2020,513:429-441.
[21]ABAD Z S H,MASLOVE D M,LEE J.Predicting Discharge Destination of Critically Ill Patients Using Machine Learning[J].IEEE Journal of Biomedical Health Informatics,2021,25(3):827-837.
[22]CHANG Y C,CHANG K H,WU G J.Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions[J].Applied Soft Computing,2018,73:914-920.
[23]CHEN W B,FU K,ZUO J W,et al.Radar emitter classification for large data set based on weighted-xgboost[J].IET Radar Sonar and Navigation,2017,11(8):1203-1207.
[24]ZOU S H,SUN H Z,XU G S,et al.Ensemble Strategy for Insider Threat Detection from User Activity Logs[J].CMC-Computers Materials & Continua,2020,65(2):1321-1334.
[25]SANER C B,KESICI M,YASLAN Y,et al.Improving the Performance of Transient Stability Prediction using Resampling Methods[C]//Proceedings of the 2019 11th International Conference on Electrical and Electronics Engineering (ICEEE).Bursa:IEEE,2019:146-150.
[26]CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[M]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.San Francisco:Association for Computing Machinery,2016:785-794.
[27]ZHOU Z H,LIU X Y.On multi-class cost-sensitive learning[J].Computational Intelligence,2010,26(3):232-257.
[28]WAN J W,YANG M.Survey on Cost-sensitive Learning Me-thod[J].Journal of Software,2020,31(1):113-136.
[29]NASARIAN E,ABDAR M,FAHAMI M A,et al.Association between work-related features and coronary artery disease:A heterogeneous hybrid feature selection integrated with balancing approach[J].Pattern Recognition Letters,2020,133:33-40.
[30]WANG C,DENG C Y,WANG S Z.Imbalance-XGBoost:leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost[J].Pattern Recognition Letters,2020,136:190-197.
[31]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(2):318-327.
[32]TRAN G S,NGHIEM T P,NGUYEN V T,et al.Improving Accuracy of Lung Nodule Classification Using Deep Learning with Focal Loss[J/OL].Journal of Healthcare Engineering,2019.https://www.hindawi.com/journals/jhe/2019/5156416/.
[33]BERGSTRA J,BENGIO Y.Random Search for Hyper-Parameter Optimization[J].Journal of Machine Learning Research,2012,13:281-305.
[1] 孙福权, 梁莹.
基于XGBoost算法的水稻基因组6mA位点识别研究
Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm
计算机科学, 2022, 49(6A): 309-313. https://doi.org/10.11896/jsjkx.210700262
[2] 黄颖琦, 陈红梅.
基于代价敏感卷积神经网络的非平衡问题混合方法
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[3] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[4] 龚追飞, 魏传佳.
基于拓扑相似和XGBoost的复杂网络链路预测方法
Complex Network Link Prediction Method Based on Topology Similarity and XGBoost
计算机科学, 2021, 48(12): 226-230. https://doi.org/10.11896/jsjkx.200800026
[5] 王晓迪, 刘鑫, 于晓.
用于多元时间序列预测的自适应频域模型
Adaptive Frequency Domain Model for Multivariate Time Series Forecasting
计算机科学, 2021, 48(11A): 204-210. https://doi.org/10.11896/jsjkx.210500129
[6] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[7] 鲁淑霞, 张振莲.
基于最优间隔的AdaBoostv算法的非平衡数据分类
Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin
计算机科学, 2021, 48(11): 184-191. https://doi.org/10.11896/jsjkx.200900107
[8] 吴崇明, 王晓丹, 薛爱军, 来杰.
基于ECOC的多类代价敏感分类方法
Multiclass Cost-sensitive Classification Based on Error Correcting Output Codes
计算机科学, 2020, 47(6A): 89-94. https://doi.org/10.11896/JsJkx.190500089
[9] 宋玲玲, 王时绘, 杨超, 盛潇.
改进的XGBoost在不平衡数据处理中的应用研究
Application Research of Improved XGBoost in Imbalanced Data Processing
计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138
[10] 乔梦雨, 王鹏, 吴娇, 张宽.
面向陆战场目标识别的轻量级卷积神经网络
Lightweight Convolutional Neural Networks for Land Battle Target Recognition
计算机科学, 2020, 47(5): 161-165. https://doi.org/10.11896/jsjkx.190300062
[11] 赵瑞杰, 施勇, 张涵, 龙军, 薛质.
基于TF-IDF的Webshell文件检测
Webshell File Detection Method Based on TF-IDF
计算机科学, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064
[12] 王晓晖, 张亮, 李俊清, 孙玉翠, 田捷, 韩睿毅.
基于遗传算法与随机森林的XGBoost改进方法研究
Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest
计算机科学, 2020, 47(11A): 454-458. https://doi.org/10.11896/jsjkx.200600002
[13] 麦应潮,陈云华,张灵.
具有生物真实性的强抗噪性神经元激活函数
Bio-inspired Activation Function with Strong Anti-noise Ability
计算机科学, 2019, 46(7): 206-210. https://doi.org/10.11896/j.issn.1002-137X.2019.07.031
[14] 吴雨茜, 王俊丽, 杨丽, 余淼淼.
代价敏感深度学习方法研究综述
Survey on Cost-sensitive Deep Learning Methods
计算机科学, 2019, 46(5): 1-12. https://doi.org/10.11896/j.issn.1002-137X.2019.05.001
[15] 邱少健, 蔡子仪, 陆璐.
基于卷积神经网络的代价敏感软件缺陷预测模型
Cost-sensitive Convolutional Neural Network Model for Software Defect Prediction
计算机科学, 2019, 46(11): 156-160. https://doi.org/10.11896/jsjkx.191100502C
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!