Computer Science ›› 2021, Vol. 48 ›› Issue (7): 145-154.doi: 10.11896/jsjkx.200800120

• Database & Big Data & Data Science • Previous Articles     Next Articles

Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling

ZHENG Jian-hua1,2, LI Xiao-min3, LIU Shuang-yin1,2, LI Di4   

  1. 1 College of Information Science and Technology,Zhongkai University of Agriculture and Engineering,Guangzhou 510225,China
    2 Guangdong Engineering &Technology Research Center for Smart Agriculture,Guangzhou 510225,China
    3 College of Mechanical and Electrical Engineering,Zhongkai University of Agriculture and Engineering,Guangzhou 510225,China
    4 School of Mechanical and Automotive Engineering,South China University of Technology,Guangzhou 510640,China
  • Received:2020-08-19 Revised:2020-09-21 Online:2021-07-15 Published:2021-07-02
  • About author:ZHENG Jian-hua,born in 1977,Ph.D,associate professor,master supervisor.His main research interests include big data mining,machine learning and AI in smart agricultural.(zhengjianhua@mail.zhku.edu.cn)
    LI Xiao-min,born in 1981,Ph.D,asso-ciate professor.His main research interests include cyber-physical systems,smart manufacturing,big data and wireless network.
  • Supported by:
    National Key R&D Program of China(2018YFB1700500),National Natural Science Foundation of China(61471133,61871475),Science and Technology Planning Project of Guangdong Province of China(2017A070712019,2017B010126001,2020A1414050062), Project of Educational Commission of Guangdong Province of China(2016KZDXM001,2017GCZX001,2020KZDZX1121) and Science and Technology Planning Project of Guangzhou (201704030098).

Abstract: Data imbalance will seriously deteriorate the performance of traditional classification algorithms.Imbalance data classification has become a hot and difficult problem in the field of machine learning.In order to improve the detection rate of minority samples in imbalance data sets,an improved random forest algorithm is proposed in this paper.The core of the algorithm is to use hybrid sampling for each random forest subtree data set sampled by Bootsrap.Firstly,inverse weight up-sampling based on Gaussian mixture model is adopted,then cascade up-sampling based on SMOTE-borderline1 algorithm is carried out,and down-sampling is carried out in a random down-sampling way,so as to obtain a balanced training subset of each subtree.Finally,adecision tree-based improved random forest learner is used to implement the unbalanced data classification algorithm.In addition,this paper uses G-means and AUC as evaluation indexes,and compares them with 10 different algorithms on 15 public data sets.The results show that the average ranking and average value of the two indexes rank first.Furthermore,this paper compares with 6 state-of-the-art algorithms on 9 data sets.Among the 32 comparisons,the proposed algorithm achieves better results than that of other algorithms for 28 times.The experimental results show that the proposed algorithm is helpful to improve the detection rate of minority class and has better classification performance.

Key words: Cascaded up-sampling, Classification algorithm, Imbalance data, Random forest

CLC Number: 

  • TP181
[1]RANDHAWA K,LOO C K,SEERA M,et al.Credit Card Fraud Detection Using AdaBoost and Majority Voting[J].IEEE Access,2018,6:14277-14284.
[2]MOHD F,JALIL M A,NOORA N M M,et al.Improving Accuracy of Imbalanced Clinical Data Classification Using Synthetic Minority Over-Sampling Technique[C]//Advances in Data Scien-ce,Cyber Security and IT Applications.Cham:Springer International Publishing,2019:99-110.
[3]ZHANG Y,ZHANG H,ZHANG X,et al.Deep Learning Intrusion Detection Model Based on Optimized Imbalanced Network Data[C]//2018 IEEE 18th International Conference on Communication Technology (ICCT).IEEE,2018:1128-1132.
[4]LIU P Z,HONG M,HUANG D T,et al.Joint ADASYN and AdaBoostSVM for Imbalanced Learining[J].Journal of Beijing University of Technology,2017,43(3):368-375.
[5]HUANG Y Y,LI Y J,GU M Y,et al.Learning from class-imbalanced data:Review of methods and applications[J].Expert Systems with Applications,2017,73:220-239.
[6]LIU Y,WANG Y,REN X,et al.A Classification Method Based on Feature Selection for Imbalanced Data[J].IEEE Access,2019,7:81794-81807.
[7]KHAN S H,HAYAT M,BENNAMOUN M,et al.Cost-sensitive learning of deep feature representations from imbalanced data[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(8):3573-3587.
[8]WOZNIAK M,KRAWCZYK B,SCHAEFER G.Cost-sensitive decision tree ensembles for effective imbalanced classification[J].Applied Soft Computing,2013,14(1):554-562.
[9]KRAWCZYK B,SCHAEFER G.An improved ensemble ap-proach for imbalanced classification problems[C]//SACI 2013 - 8th IEEE International Symposium on Applied Computational Intelligence and Informatics.IEEE,2013:423-426.
[10]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Springer,2005:878-887.
[11]LEE J,PARK K.GAN-based imbalanced data intrusion detection system[J/OL].Personal and Ubiquitous Computing.http://doi.org/10.1007/s00779-019-01332-y.
[12]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M]//Data Mining.Springer,Boston,MA,2010:193-226.
[13]BELLINGER C,SHARMA S,JAPKOWICZ N,et al.Frame-work for extreme imbalance classification:SWIM-sampling with the majority class[J].Knowledge and Information Systems,2020,62:841-866.
[14]LAST F,DOUZAS G,BACAO F.Oversampling for Imbalanced Learning based on K-Means and SMOTE[J].arXiv:1711.00837,2017.
[15]LIN W C,TSAI C F,HU Y H,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[16]CHEN G,LIU Y,GE Z.K-means Bayes algorithm for imbal-anced fault classification and big data application[J].Journal of Process Control,2019,81:54-64.
[17]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(6A):22-27,57.
[18]ZHENG J H,LIU S Y,HE C B,et al.Improved Random Forest Classification Algorithm for Imbalance Data Based on Hybrid Sampling Strategy[J].Journal of Chongqing University of Technology(Natural Science),2019,33(7):113-123.
[19]SHI H,GAO Q,JI S,et al.A Hybrid Sampling Method Based on Safe Screening for Imbalanced Datasets with Sparse Structure[C]//International Joint Conference on Neural Networks (IJCNN).2018:1-8.
[20]HAN X,JIA N,ZHU N.Gauss mixture undersampling algo-rithm for credit imbalance data[J].Computer Engineering and Design,2020,41(1):65-70.
[21]STAUFFER C,GRIMSON W E L.Adaptive background mixture models for real-time tracking[C]//Proceedings of IEEE Conf.Computer Vision Patt.Recog.1999.
[22]ZHANG Y L,ZHOU Y J.A review of cluster algorithms[J].Journal of Computer Applications,2019,39(7):1869-1882.
[23]DEMPSTER A P,LAIRD N M,RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society:Series B (Methodological),1977,39(1):1-22.
[24]BHAGAT R C,PATIL S S.Enhanced SMOTE algorithm forclassification of imbalanced big-data using Random Forest[C]//2015 IEEE International Advance Computing Conference (IACC).2015:403-408.
[25]TAN X P,SU S J,HUANG Z P,et al.Wireless Sensor Networks Intrusion Detection Based on SMOTE and the Random Forest Algorithm[J].Sensors,2019,19(1):203.
[26]KUBAT M,HOLTE R,MATWIN S.Learning when negative examples abound[C]//European Conference on Machine Lear-ning.Springer,1997:146-153.
[27]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197.
[28]Balanced Bagging Classifier imbalanced-learn 0.5.0 documentation[EB/OL].[2020-02-18].http://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html.
[29]LIU X Y,WU J,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(2):539-550.
[30]REN H,YANG B.Clustering-Based Prototype Generation forImbalance Classification[C]//2019 International Conference on Smart Grid and Electrical Automation (ICSGEA).2019:422-426.
[31]RI J,KIM H.G-mean based extreme learning machine for imbalance learning[J].Digital Signal Processing,2020,98:102637.
[32]RICHHARIYA B,TANVEER M.A reduced universum twin support vector machine for class imbalance learning[J].Pattern Recognition,2020,102:107150.
[33]AHMED S,RAYHAN F,MAHBUB A,et al.LIUBoost:Locality Informed Under-Boosting for Imbalanced Data Classification[C]//Emerging Technologies in Data Mining and Information Security.Springer,2019:133-144.
[34]LIU Z N,CAO W,GAO Z F,et al.Self-paced Ensemble for Highly Imbalanced Massive Data Classification[J].arXiv:1909.03500,2019.
[1] GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[2] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[3] WANG Wen-qiang, JIA Xing-xing, LI Peng. Adaptive Ensemble Ordering Algorithm [J]. Computer Science, 2022, 49(6A): 242-246.
[4] QUE Hua-kun, FENG Xiao-feng, LIU Pan-long, GUO Wen-chong, LI Jian, ZENG Wei-liang, FAN Jing-min. Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection [J]. Computer Science, 2022, 49(6A): 790-794.
[5] ZHANG Xiao-qing, FANG Jian-sheng, XIAO Zun-jie, CHEN Bang, Risa HIGASHITA, CHEN Wan, YUAN Jin, LIU Jiang. Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image [J]. Computer Science, 2022, 49(3): 204-210.
[6] LIU Zhen-yu, SONG Xiao-ying. Multivariate Regression Forest for Categorical Attribute Data [J]. Computer Science, 2022, 49(1): 108-114.
[7] YANG Xiao-qin, LIU Guo-jun, GUO Jian-hui, MA Wen-tao. Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest [J]. Computer Science, 2021, 48(8): 99-105.
[8] LI Na-na, WANG Yong, ZHOU Lin, ZOU Chun-ming, TIAN Ying-jie, GUO Nai-wang. DDoS Attack Random Forest Detection Method Based on Secondary Screening of Feature Importance [J]. Computer Science, 2021, 48(6A): 464-467.
[9] CAO Yang-chen, ZHU Guo-sheng, QI Xiao-yun, ZOU Jie. Research on Intrusion Detection Classification Based on Random Forest [J]. Computer Science, 2021, 48(6A): 459-463.
[10] WANG Ying-ying, CHANG Jun, WU Hao, ZHOU Xiang, PENG Yu. Intrusion Detection Method Based on WiFi-CSI [J]. Computer Science, 2021, 48(6): 343-348.
[11] XU Jia-qing, HU Xiao-yue, TANG Fu-qiao, WANG Qiang, HE Jie. Detecting Blocking Failure in High Performance Interconnection Networks Based on Random Forest [J]. Computer Science, 2021, 48(6): 246-252.
[12] ZHOU Yi-min, LIU Fang-zheng , WANG Yong. IPSec VPN Encrypted Traffic Identification Based on Hybrid Method [J]. Computer Science, 2021, 48(4): 295-302.
[13] WANG Yun-xiao, ZHAO Li-na, MA Lin, LI Ning, LIU Zi-yan, ZHANG Jie. TCAM Multi-field Rule Coding Technique Based on Hypercube [J]. Computer Science, 2021, 48(11A): 490-494.
[14] ZHANG Tian-rui, WEI Ming-qi, GAO Xiu-xiu. Prediction Model of Bubble Dissolution Time in Selective Laser Sintering Based on IPSO-WRF [J]. Computer Science, 2021, 48(11A): 638-643.
[15] LIU Zhen-peng, SU Nan, QIN Yi-wen, LU Jia-huan, LI Xiao-fei. FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest [J]. Computer Science, 2020, 47(8): 185-188.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!