Computer Science ›› 2022, Vol. 49 ›› Issue (5): 135-143.doi: 10.11896/jsjkx.210400064

• Database & Big Data & Data Science • Previous Articles     Next Articles

XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function

LI Jing-tai, WANG Xiao-dan   

  1. Air and Missile Defense College,Air Force Engineering University,Xi’an 710051,China
  • Received:2021-04-07 Revised:2021-09-07 Online:2022-05-15 Published:2022-05-06
  • About author:LI Jing-tai,born in 1998,postgraduate.His main research interests include machine learning and steganalysis.
    WANG Xiao-dan,born in 1966,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.Her main research interests include machine learning and intelligent information processing.

Abstract: For binary classification with category imbalance,acost-sensitive activation function XGBoost algorithm(CSAF-XGBoost) is proposed to promote the ability of recognizing minority samples.When XGBoost algorithm constructs decision trees,unbalanced data will affect split point selection,which lead to misclassification of minority.By constructing cost-sensitive activation function (CSAF),samples in different estimation are under different gradient variations,which approach the problem that the gradient variation of misclassified minority sample is too small to make samples be recognized correctly in iterations.The experiments analyze the relation of imbalanced rate (IR) to parameters,and compare performance with SMOTE-XGBoost,ADASYN-XGBoost,Focal loss-XGBoost and Weight-XGBoost on UCI datasets.As for recall rate of minority,CSAF-XGBoost surpasses the best methods 6.75% in average and 15%in maximum with F1-score and AUC score in the same level.The results prove CSAF-XGBoost has better performance in recognizing minority class samples and wider applicability.

Key words: Activation function, Cost-sensitive, Data imbalanced classification, Logistic regression, XGBoost

CLC Number: 

  • TP391.4
[1]DENG M Y,GUO Y S,LIU T.Research on Imbalanced Data Sampling Method Based on Stratification and Recombination[J].Journal of Chongqing University of Technology(Natural Science),2021,35(8):122-128.
[2]GEORGIOS D,FERNADO B,et al.Effective data generation for imbalanced learning using conditional generative adversarial networks[J].Expert Systems with Application,2018,91(1):464-471.
[3]ZHANG H,HUANG L,WU C Q,et al.An Effective Convolutional Neural Network Based on SMOTE and Gaussian Mixture Model for Intrusion Detection in Imbalanced Dataset[J/OL].Computer Networks,2020,177.https://www.sciencedirect.com/science/article/abs/pii/S1389128620300712.
[4]YI H K,JIANG Q C,YAN X F,et al.Imbalanced Classification Based on Minority Clustering Synthetic Minority Oversampling Technique with Wind Turbine Fault Detection Application[J].IEEE Transactions on Industrial Informatics,2021,17(9):5867-5875.
[5]TAO X M,LI Q,REN C,et al.Real-value negative selectionoversampling for imbalanced data set learning[J].Expert Systems with Applications,2019,129:118-134.
[6]LI Y,LIU Z D,ZHANG H J.Review on ensemble algorithms for imbalanced data classification[J].Application Research of Computers,2014,5:13-17.
[7]TURNEY P.Types of Cost in Inductive Concept Learning[J].arXiv:0212034,2002.
[8]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):4-19.
[9]BADRAN M F,SAHAR N M,SARI S,et al.Intrusion-Detection System Based on Hybrid Models:Review Paper[C]//IOP Conference Series:Materials Science and Engineering.2020.
[10]PING R,ZHOU S S,LI D.Cost sensitive random forest classification algorithm for highly unbalanced data[J].Pattern Recognition and Artificial Intelligence,2020,201(3):62-70.
[11]JING X Y,ZHANG X Y,ZHU X K,et al.Multiset Feature Learning for Highly Imbalanced Data Classification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(1):139-156.
[12]TAO X M,LI Q,GUO W,et al.Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imba-lanced data classification[J].Information Sciences,2019,487:31-56.
[13]GALAR M.A Review on Ensembles for the Class ImbalanceProblem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews,2012,42(4):463-484.
[14]GARCIA S,ZHANG Z L,ALTALHI A,et al.Dynamic ensemble selection for multi-class imbalanced datasets[J].Information Sciences,2018,445:22-37.
[15]CHEN Q W,WANG W,MA D,et al.Class-imbalance creditscoring using Ext-GBDT ensemble[J].Application Research of Computers,2018,35(2):421-427.
[16]TAO X M,CHEN W,LI X,et al.The ensemble of density-sensitive SVDD classifier based on maximum soft margin for imba-lanced datasets[J/OL].Knowledge-Based Systems,2021,219(7).https://www.sciencedirect.com/science/article/abs/pii/S095070512100160X.
[17]ZHANG Z,QIU J X,DAI W.A New Improved Boosting for Imbalanced Data Classification[C]//IOP Conference Series Materials Science and Engineering.2019.
[18]SHI H T,WANG H R,HUANG Y X,et al.A hierarchicalmethod based on weighted extreme gradient boosting in ECG heartbeat classification[J].Computer Methods and Programs in Biomedicine,2019,171:1-10.
[19]DING H,LIU K,CHEN X Z,et al.Optimized SegmentationBased on the Weighted Aggregation Method for Loess Bank Gully Mapping[J].Remote Sensing,2020,12(5):793-813.
[20]THABTAH F,HAMMOUD S,KAMALOV F,et al.Data imbalance in classification:Experimental evaluation[J].Information Sciences,2020,513:429-441.
[21]ABAD Z S H,MASLOVE D M,LEE J.Predicting Discharge Destination of Critically Ill Patients Using Machine Learning[J].IEEE Journal of Biomedical Health Informatics,2021,25(3):827-837.
[22]CHANG Y C,CHANG K H,WU G J.Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions[J].Applied Soft Computing,2018,73:914-920.
[23]CHEN W B,FU K,ZUO J W,et al.Radar emitter classification for large data set based on weighted-xgboost[J].IET Radar Sonar and Navigation,2017,11(8):1203-1207.
[24]ZOU S H,SUN H Z,XU G S,et al.Ensemble Strategy for Insider Threat Detection from User Activity Logs[J].CMC-Computers Materials & Continua,2020,65(2):1321-1334.
[25]SANER C B,KESICI M,YASLAN Y,et al.Improving the Performance of Transient Stability Prediction using Resampling Methods[C]//Proceedings of the 2019 11th International Conference on Electrical and Electronics Engineering (ICEEE).Bursa:IEEE,2019:146-150.
[26]CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[M]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.San Francisco:Association for Computing Machinery,2016:785-794.
[27]ZHOU Z H,LIU X Y.On multi-class cost-sensitive learning[J].Computational Intelligence,2010,26(3):232-257.
[28]WAN J W,YANG M.Survey on Cost-sensitive Learning Me-thod[J].Journal of Software,2020,31(1):113-136.
[29]NASARIAN E,ABDAR M,FAHAMI M A,et al.Association between work-related features and coronary artery disease:A heterogeneous hybrid feature selection integrated with balancing approach[J].Pattern Recognition Letters,2020,133:33-40.
[30]WANG C,DENG C Y,WANG S Z.Imbalance-XGBoost:leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost[J].Pattern Recognition Letters,2020,136:190-197.
[31]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(2):318-327.
[32]TRAN G S,NGHIEM T P,NGUYEN V T,et al.Improving Accuracy of Lung Nodule Classification Using Deep Learning with Focal Loss[J/OL].Journal of Healthcare Engineering,2019.https://www.hindawi.com/journals/jhe/2019/5156416/.
[33]BERGSTRA J,BENGIO Y.Random Search for Hyper-Parameter Optimization[J].Journal of Machine Learning Research,2012,13:281-305.
[1] SUN Fu-quan, LIANG Ying. Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm [J]. Computer Science, 2022, 49(6A): 309-313.
[2] HUANG Ying-qi, CHEN Hong-mei. Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification [J]. Computer Science, 2021, 48(9): 77-85.
[3] HU Yan-mei, YANG Bo, DUO Bin. Logistic Regression with Regularization Based on Network Structure [J]. Computer Science, 2021, 48(7): 281-291.
[4] CHEN Jing-jie, WANG Kun. Interval Prediction Method for Imbalanced Fuel Consumption Data [J]. Computer Science, 2021, 48(7): 178-183.
[5] GONG Zhui-fei, WEI Chuan-jia. Complex Network Link Prediction Method Based on Topology Similarity and XGBoost [J]. Computer Science, 2021, 48(12): 226-230.
[6] WANG Mao-guang, YANG Hang. Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble [J]. Computer Science, 2021, 48(11A): 71-76.
[7] WANG Xiao-di, LIU Xin, YU Xiao. Adaptive Frequency Domain Model for Multivariate Time Series Forecasting [J]. Computer Science, 2021, 48(11A): 204-210.
[8] WU Chong-ming, WANG Xiao-dan, XUE Ai-Jun and LAI Jie. Multiclass Cost-sensitive Classification Based on Error Correcting Output Codes [J]. Computer Science, 2020, 47(6A): 89-94.
[9] YU Meng-chi, MU Jia-peng, CAI Jian, XU Jian. Noisy Label Classification Learning Based on Relabeling Method [J]. Computer Science, 2020, 47(6): 79-84.
[10] SONG Ling-ling, WANG Shi-hui, YANG Chao, SHENG Xiao. Application Research of Improved XGBoost in Imbalanced Data Processing [J]. Computer Science, 2020, 47(6): 98-103.
[11] QIAO Meng-yu, WANG Peng, WU Jiao, ZHANG Kuan. Lightweight Convolutional Neural Networks for Land Battle Target Recognition [J]. Computer Science, 2020, 47(5): 161-165.
[12] BIAN Yu-ning, LU Li-kun, LI Ye-li, ZENG Qing-tao, SUN Yan-xiong. Implementation of Financial Venture Capital Score Card Model Based on Logistic Regression [J]. Computer Science, 2020, 47(11A): 116-118.
[13] ZHAO Rui-jie, SHI Yong, ZHANG Han, LONG Jun, XUE Zhi. Webshell File Detection Method Based on TF-IDF [J]. Computer Science, 2020, 47(11A): 363-367.
[14] WANG Xiao-hui, ZHANG Liang, LI Jun-qing, SUN Yu-cui, TIAN Jie, HAN Rui-yi. Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest [J]. Computer Science, 2020, 47(11A): 454-458.
[15] LIU Meng-juan,ZENG Gui-chuan,YUE Wei,QIU Li-zhou,WANG Jia-chang. Review on Click-through Rate Prediction Models for Display Advertising [J]. Computer Science, 2019, 46(7): 38-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!