Computer Science ›› 2020, Vol. 47 ›› Issue (6): 98-103.doi: 10.11896/jsjkx.191200138

• Databωe & Big Data & Data Science • Previous Articles     Next Articles

Application Research of Improved XGBoost in Imbalanced Data Processing

SONG Ling-ling1, WANG Shi-hui1,2, YANG Chao1,2,3, SHENG Xiao1   

  1. 1 School of Computer and Information Engineering,Hubei University,Wuhan 430062,China
    2 Hubei Provincial Education Information Engineering Technology Research Center,Wuhan 430062,China
    3 Hubei Key Laboratory of Applied Mathematics,School of Mathematics and Statistics,Hubei University,Wuhan 430062,China
  • Received:2019-12-23 Online:2020-06-15 Published:2020-06-10
  • About author:SONG Ling-ling,born in 1994,postgraduate.Her main research interests include machine learning and so on.
    YANG Chao,born in 1982,Ph.D,associa-te professor,postgraduate supervisor,is a member of China Computer Federation.His main research interests include information security and computer immunology.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61977021) and Open funded project of Hubei Key Laboratory of Applied Mathematics(HBAM201902)

Abstract: When dealing with imbalanced data,traditional classifiers tend to guarantee the accuracy of the majority class and sacrifice the accuracy of the minority class,resulting in a higher error rate of the minority class.Aiming at this problem,an improved XGBoost method for binary imbalanced data is proposed.The main idea is to improve the characters of imbalanced data from three levels,data,features,and algorithms.Firstly,at the data level,Conditional Generative Adversarial Nets (CGAN) learns the distributive information of minority samples and then trains the generator to generate a few supple-mentary samples to adjust the imbalance of the data.Secondly,at the feature level,it uses XGBoost for feature combination to generate new features,and then uses the minimal Redundancy-Maximal Relevance (mRMR) algorithm to screen out a subset of features that are more suitable for imbalanced data classification.Finally,at the algorithm level,it introduces a Focal Loss function for imbalanced data classification to improve XGBoost.The improved XGBoost is trained on the new dataset to obtain the final model.In the experimental stage,G-mean and AUC are selected as the evaluation indicators.The experimental results on 6 sets of KEEL datasets verify the feasibility of the proposed improved method.At the same time,the method is compared with the existing four imbalanced classification models.The experimental results show that the proposed improved method has better classification effect.

Key words: CGAN, Feature combination, Focal Loss, Imbalanced data, mRMR, XGBoost

CLC Number: 

  • TP181
[1]LIN W,TSAI C,HU Y,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[2]BHATTACHARYA S,RAJAN V,SHRIVASTAVA H.ICU mortality prediction:A classification algorithm for imbalanced datasets[C]//Proc of the 31st AAAI Conf on Artificial Intelligence.San Francisco:AAAI,2017:1288-1294.
[3]CHEN X,LIU P H,SUN Y Z,et al.Research on Disease Prediction Models Based on Imbalanced Medical Data Sets[J].Chinese Journal of Computers,2019,42(3):596-609.
[4]HU M M,CHEN X,SUN Y Z,et al.A Disease Prediction Model Based on Dynamic Sampling and Transfer Learning[J].Chinese Journal of Computers,2019,42(10),2339-2354.
[5]DUAN L,XIE M,BAI T,et al.A new support vector data description method for machinery fault diagnosis with unbalanced datasets[J].Expert Systems with Applications.2016,64:239-246.
[6]WANG F,XU T,TANG T,et al.Bilevel feature extractionbased text mining for fault diagnosis of railway systems[J].IEEE Trans on Intelligent Transportation Systems,2016,18(1):49-58.
[7]WANG S,YAO X.Using class imbalance learning for software defect prediction[J].IEEE Trans on Reliability,2013,62(2):434-443.
[8]XIONG W,LI B,HE L,et al.Collaborative web service QoS prediction on unbalanced data distribution[C]//IEEE Int Conf on Web Services.Anchorage:IEEE,2014:377-384.
[9]SHEN W,WANG X,WANG Y,et al.Deepcontour:A deep convolutional feature learned by positive-sharing loss for contour detection[C]//Proc of the IEEE Conf on Computer Vision and Pattern Recognition.Boston:IEEE,2015:3982-3991.
[10]POUYANFAR S,CHEN S C.Automatic video event detection for imbalance data using enhanced ensemble deep learning[J].Int J of Semantic Computing,2017,11(1):85-109.
[11]RAO R B.Data mining for improved cardiac care[J].ACM SIGKDD Explorations Newsletter,2006,8(1):3-10.
[12]LI Y X,CHAI Y,HU Y Q,et al.Review of imbalanced data classification methods[J].Control and Decision,2019,34(4):673-688.
[13]GARCIA V,SANCHEZ J S,MOLLINEDAR A.On the eff ectiveness of preprocessing methods when dealing with different levels of class imbalance[J].Knowledge-Based Systems,2011,25(1):13-21.
[14]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic Minority Over-sampling Technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[15] CIESLAK D A,CHAWLA N V,STRIEGELA.Combating imbalance in network intrusion datasets[C]//Proceedings of IEEE International Conference on Granular Computing.IEEE,2006:732-737.
[16]GUO H P,ZHOU J,WU C A,et al.K-nearest neighbor classification method for class-imbalanced problem[J].Journal of Computer Applications,2018,38(4):955-959,977.
[17]MEMBER M W,CHEN X W.Combating the Small Sample Class Imbalance Problem Using Feature Selection[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1388-1400.
[18]WANG J,LI D Y,WANG S G.Feature Selection Method for Imbalanced Text Sentiment Classification[J].ComputerScie-nce,2016,43(10):206-210,224.
[19]ZHAO N,ZHANG X F,ZHANG L J.Overview of imbalanced data classification[J].Computer Science,2018,45(S1):22-27
[20]WU Y X,WANG J L,YANG L,et al.Survey on Cost-sensitive Deep Learning Methods[J].Computer Science,2019,46(5):1-12.
[21]CAO Y X,HUANG H Y.Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning[J].Computer Science,2019,46(5):203-208.
[22]YUAN X M,YANG M,YANG Y.An Ensemble Classifier Based on Structural Support Vector Machine for Imbalanced Data.[J].Pattern Recognition and Artificial Intelligence,2013,26(3):315-320.
[23]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:improving prediction of the minority class in boosting[C]//Proceedings of the 2003 European Conference on Know-ledge Discovery in Databases,LNCS 2838.Berlin:Springer,2003:107-119.
[24]RAYHAN F,AHMED S,MAHBUB A,et al.CUSBoost:Cluster-based Under-sampling with Boosting for Imbalanced Classification[C]// 11th International Conference on Software Know-ledge Information Management and Applications (SKIMA).2017:1-6.
[25]SEIFFERT C,KHOSHGOFTAAR T M,VAN H J,et al.RUSBoost:a hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man and Cybernetics,Part A:Systems and Humans,2010,40(1):185-197.
[26]WANG Z Z,HUANG B,FAN Z J.et,al.Improved SMOTE unbalanced data integration classification algorithm[J].Computer Application,2019,39(9):2591-2596.
[27]MIRZA M,OSINDER S.Conditional Generative Adversarial Nets[J].arXiv:1411.1784,2014.
[28]PENG H,LONG F,DING C.Feature selection based on mutual information:Criteria of Max-Dependency,Max-Relevance,and Min-Redundancy[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(8):1226-1238.
[29]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern on Pattern Analysis & Machine Intelligence,2017,PP(99):2999-3007.
[30]CHEN T,GUESTRIN C.XGBoost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[31]CAI Z X,WANG X Y,X J,etal.Sample Adaptive Classifier for Imbalanced Data[J].Computer Science,2019,46(1):94-99.
[32]XIONG B Y,WANG G Y,DENG W B.Under-Sampling Method Based on Sample Weight for Imbalanced Data[J].Journal of Computer Research and Development,2016,53(11):2613-2622.
[33]MATHEW J,PANG C K,LUO M,et al.Classification ofImba-lanced Data by Oversampling in Kernel Space of Support Vector Machines[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(9):4065-4076.
[34]LI X F,L J,D Y F,et al.A New Learning Algorithm for Imba-lanced Data-PCBoost[J].Chinese Journal of Computers,2012,35(2):202-209.
[35]ZHANG N,CHEN Q.Ensemble learning training method based on AUC and Q statistics[J].Journal of Computer Applications,2019,39(4):935-939.
[1] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[2] SUN Fu-quan, LIANG Ying. Identification of 6mA Sites in Rice Genome Based on XGBoost Algorithm [J]. Computer Science, 2022, 49(6A): 309-313.
[3] DONG Qi-da, WANG Zhe, WU Song-yang. Feature Fusion Framework Combining Attention Mechanism and Geometric Information [J]. Computer Science, 2022, 49(5): 129-134.
[4] LI Jing-tai, WANG Xiao-dan. XGBoost for Imbalanced Data Based on Cost-sensitive Activation Function [J]. Computer Science, 2022, 49(5): 135-143.
[5] JIANG Hao-chen, WEI Zi-qi, LIU Lin, CHEN Jun. Imbalanced Data Classification:A Survey and Experiments in Medical Domain [J]. Computer Science, 2022, 49(1): 80-88.
[6] CHEN Jing-jie, WANG Kun. Interval Prediction Method for Imbalanced Fuel Consumption Data [J]. Computer Science, 2021, 48(7): 178-183.
[7] ZHANG Man, LI Jie, ZHU Xin-zhong, SHEN Ji, CHENG Hao-tian. Augmentation Technology of Remote Sensing Dataset Based on Improved DCGAN Algorithm [J]. Computer Science, 2021, 48(6A): 80-84.
[8] ZHANG Ren-zhi, ZHU Yan. Malicious User Detection Method for Social Network Based on Active Learning [J]. Computer Science, 2021, 48(6): 332-337.
[9] GONG Zhui-fei, WEI Chuan-jia. Complex Network Link Prediction Method Based on Topology Similarity and XGBoost [J]. Computer Science, 2021, 48(12): 226-230.
[10] WANG Mao-guang, YANG Hang. Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble [J]. Computer Science, 2021, 48(11A): 71-76.
[11] WANG Xiao-di, LIU Xin, YU Xiao. Adaptive Frequency Domain Model for Multivariate Time Series Forecasting [J]. Computer Science, 2021, 48(11A): 204-210.
[12] LU Shu-xia, ZHANG Zhen-lian. Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin [J]. Computer Science, 2021, 48(11): 184-191.
[13] YANG Chun-de, JIA Zhu, LI Xin-wei. Study on ECG Signal Recognition and Classification Based on U-Net++ [J]. Computer Science, 2021, 48(10): 121-126.
[14] ZHANG Yang, MA Xiao-hu. Anime Character Portrait Generation Algorithm Based on Improved Generative Adversarial Networks [J]. Computer Science, 2021, 48(1): 182-189.
[15] YU Wen-jia, DING Shi-fei. Conditional Generative Adversarial Network Based on Self-attention Mechanism [J]. Computer Science, 2021, 48(1): 241-246.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!