计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 135-143.doi: 10.11896/jsjkx.210400064
李京泰, 王晓丹
LI Jing-tai, WANG Xiao-dan
摘要: 为解决在数据不平衡条件下使用XGBoost框架处理二分类问题时算法对少数类样本的识别能力下降的问题,提出了基于代价敏感激活函数的XGBoost算法(Cost-sensitive Activation Function XGBoost,CSAF-XGBoost)。在XGBoost框架构建决策树时,数据不平衡会影响分裂点的选择,导致少数类样本被误分。通过引入代价敏感激活函数改变样本在不同预测结果下损失函数的梯度变化,来解决被误分的少数类样本因梯度变化小而无法在XGBoost迭代过程中被有效分类的问题。通过实验分析了激活函数的参数与数据不平衡度的关系,并对CSAF-XGBoost算法与SMOTE-XGBoost,ADASYN-XGBoost,Focal loss-XGBoost,Weight-XGBoost优化算法在UCI公共数据集上的分类性能进行了对比。结果表明,在F1值和AUC值相同或有提高的情况下,CSAF-XGBoost算法对少数类样本的检出率比最优算法平均提高了6.75%,最多提高了15%,证明了CSAF-XGBoost算法对少数类样本有更高的识别能力,且具有广泛的适用性。
中图分类号:
[1]DENG M Y,GUO Y S,LIU T.Research on Imbalanced Data Sampling Method Based on Stratification and Recombination[J].Journal of Chongqing University of Technology(Natural Science),2021,35(8):122-128. [2]GEORGIOS D,FERNADO B,et al.Effective data generation for imbalanced learning using conditional generative adversarial networks[J].Expert Systems with Application,2018,91(1):464-471. [3]ZHANG H,HUANG L,WU C Q,et al.An Effective Convolutional Neural Network Based on SMOTE and Gaussian Mixture Model for Intrusion Detection in Imbalanced Dataset[J/OL].Computer Networks,2020,177.https://www.sciencedirect.com/science/article/abs/pii/S1389128620300712. [4]YI H K,JIANG Q C,YAN X F,et al.Imbalanced Classification Based on Minority Clustering Synthetic Minority Oversampling Technique with Wind Turbine Fault Detection Application[J].IEEE Transactions on Industrial Informatics,2021,17(9):5867-5875. [5]TAO X M,LI Q,REN C,et al.Real-value negative selectionoversampling for imbalanced data set learning[J].Expert Systems with Applications,2019,129:118-134. [6]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,5:13-17. [7]TURNEY P.Types of Cost in Inductive Concept Learning[J].arXiv:0212034,2002. [8]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):4-19. [9]BADRAN M F,SAHAR N M,SARI S,et al.Intrusion-Detection System Based on Hybrid Models:Review Paper[C]//IOP Conference Series:Materials Science and Engineering.2020. [10]PING R,ZHOU S S,LI D.Cost sensitive random forest classification algorithm for highly unbalanced data[J].Pattern Recognition and Artificial Intelligence,2020,201(3):62-70. [11]JING X Y,ZHANG X Y,ZHU X K,et al.Multiset Feature Learning for Highly Imbalanced Data Classification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(1):139-156. [12]TAO X M,LI Q,GUO W,et al.Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imba-lanced data classification[J].Information Sciences,2019,487:31-56. [13]GALAR M.A Review on Ensembles for the Class ImbalanceProblem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews,2012,42(4):463-484. [14]GARCIA S,ZHANG Z L,ALTALHI A,et al.Dynamic ensemble selection for multi-class imbalanced datasets[J].Information Sciences,2018,445:22-37. [15]CHEN Q W,WANG W,MA D,et al.Class-imbalance creditscoring using Ext-GBDT ensemble[J].Application Research of Computers,2018,35(2):421-427. [16]TAO X M,CHEN W,LI X,et al.The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imba-lanced datasets[J/OL].Knowledge-Based Systems,2021,219(7).https://www.sciencedirect.com/science/article/abs/pii/S095070512100160X. [17]ZHANG Z,QIU J X,DAI W.A New Improved Boosting for Imbalanced Data Classification[C]//IOP Conference Series Materials Science and Engineering.2019. [18]SHI H T,WANG H R,HUANG Y X,et al.A hierarchicalmethod based on weighted extreme gradient boosting in ECG heartbeat classification[J].Computer Methods and Programs in Biomedicine,2019,171:1-10. [19]DING H,LIU K,CHEN X Z,et al.Optimized SegmentationBased on the Weighted Aggregation Method for Loess Bank Gully Mapping[J].Remote Sensing,2020,12(5):793-813. [20]THABTAH F,HAMMOUD S,KAMALOV F,et al.Data imbalance in classification:Experimental evaluation[J].Information Sciences,2020,513:429-441. [21]ABAD Z S H,MASLOVE D M,LEE J.Predicting Discharge Destination of Critically Ill Patients Using Machine Learning[J].IEEE Journal of Biomedical Health Informatics,2021,25(3):827-837. [22]CHANG Y C,CHANG K H,WU G J.Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions[J].Applied Soft Computing,2018,73:914-920. [23]CHEN W B,FU K,ZUO J W,et al.Radar emitter classification for large data set based on weighted-xgboost[J].IET Radar Sonar and Navigation,2017,11(8):1203-1207. [24]ZOU S H,SUN H Z,XU G S,et al.Ensemble Strategy for Insider Threat Detection from User Activity Logs[J].CMC-Computers Materials & Continua,2020,65(2):1321-1334. [25]SANER C B,KESICI M,YASLAN Y,et al.Improving the Performance of Transient Stability Prediction using Resampling Methods[C]//Proceedings of the 2019 11th International Conference on Electrical and Electronics Engineering (ICEEE).Bursa:IEEE,2019:146-150. [26]CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[M]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.San Francisco:Association for Computing Machinery,2016:785-794. [27]ZHOU Z H,LIU X Y.On multi-class cost-sensitive learning[J].Computational Intelligence,2010,26(3):232-257. [28]WAN J W,YANG M.Survey on Cost-sensitive Learning Me-thod[J].Journal of Software,2020,31(1):113-136. [29]NASARIAN E,ABDAR M,FAHAMI M A,et al.Association between work-related features and coronary artery disease:A heterogeneous hybrid feature selection integrated with balancing approach[J].Pattern Recognition Letters,2020,133:33-40. [30]WANG C,DENG C Y,WANG S Z.Imbalance-XGBoost:leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost[J].Pattern Recognition Letters,2020,136:190-197. [31]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(2):318-327. [32]TRAN G S,NGHIEM T P,NGUYEN V T,et al.Improving Accuracy of Lung Nodule Classification Using Deep Learning with Focal Loss[J/OL].Journal of Healthcare Engineering,2019.https://www.hindawi.com/journals/jhe/2019/5156416/. [33]BERGSTRA J,BENGIO Y.Random Search for Hyper-Parameter Optimization[J].Journal of Machine Learning Research,2012,13:281-305. |
[1] | 孙福权, 梁莹. 基于XGBoost算法的水稻基因组6mA位点识别研究 Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm 计算机科学, 2022, 49(6A): 309-313. https://doi.org/10.11896/jsjkx.210700262 |
[2] | 黄颖琦, 陈红梅. 基于代价敏感卷积神经网络的非平衡问题混合方法 Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification 计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013 |
[3] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法 Interval Prediction Method for Imbalanced Fuel Consumption Data 计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145 |
[4] | 龚追飞, 魏传佳. 基于拓扑相似和XGBoost的复杂网络链路预测方法 Complex Network Link Prediction Method Based on Topology Similarity and XGBoost 计算机科学, 2021, 48(12): 226-230. https://doi.org/10.11896/jsjkx.200800026 |
[5] | 王晓迪, 刘鑫, 于晓. 用于多元时间序列预测的自适应频域模型 Adaptive Frequency Domain Model for Multivariate Time Series Forecasting 计算机科学, 2021, 48(11A): 204-210. https://doi.org/10.11896/jsjkx.210500129 |
[6] | 王茂光, 杨行. 一种基于AP-Entropy选择集成的风控模型和算法 Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble 计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110 |
[7] | 鲁淑霞, 张振莲. 基于最优间隔的AdaBoostv算法的非平衡数据分类 Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin 计算机科学, 2021, 48(11): 184-191. https://doi.org/10.11896/jsjkx.200900107 |
[8] | 吴崇明, 王晓丹, 薛爱军, 来杰. 基于ECOC的多类代价敏感分类方法 Multiclass Cost-sensitive Classification Based on Error Correcting Output Codes 计算机科学, 2020, 47(6A): 89-94. https://doi.org/10.11896/JsJkx.190500089 |
[9] | 宋玲玲, 王时绘, 杨超, 盛潇. 改进的XGBoost在不平衡数据处理中的应用研究 Application Research of Improved XGBoost in Imbalanced Data Processing 计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138 |
[10] | 乔梦雨, 王鹏, 吴娇, 张宽. 面向陆战场目标识别的轻量级卷积神经网络 Lightweight Convolutional Neural Networks for Land Battle Target Recognition 计算机科学, 2020, 47(5): 161-165. https://doi.org/10.11896/jsjkx.190300062 |
[11] | 赵瑞杰, 施勇, 张涵, 龙军, 薛质. 基于TF-IDF的Webshell文件检测 Webshell File Detection Method Based on TF-IDF 计算机科学, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064 |
[12] | 王晓晖, 张亮, 李俊清, 孙玉翠, 田捷, 韩睿毅. 基于遗传算法与随机森林的XGBoost改进方法研究 Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest 计算机科学, 2020, 47(11A): 454-458. https://doi.org/10.11896/jsjkx.200600002 |
[13] | 麦应潮,陈云华,张灵. 具有生物真实性的强抗噪性神经元激活函数 Bio-inspired Activation Function with Strong Anti-noise Ability 计算机科学, 2019, 46(7): 206-210. https://doi.org/10.11896/j.issn.1002-137X.2019.07.031 |
[14] | 吴雨茜, 王俊丽, 杨丽, 余淼淼. 代价敏感深度学习方法研究综述 Survey on Cost-sensitive Deep Learning Methods 计算机科学, 2019, 46(5): 1-12. https://doi.org/10.11896/j.issn.1002-137X.2019.05.001 |
[15] | 邱少健, 蔡子仪, 陆璐. 基于卷积神经网络的代价敏感软件缺陷预测模型 Cost-sensitive Convolutional Neural Network Model for Software Defect Prediction 计算机科学, 2019, 46(11): 156-160. https://doi.org/10.11896/jsjkx.191100502C |
|