计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 145-154.doi: 10.11896/jsjkx.200800120

• 数据库&大数据&数据科学 • 上一篇    下一篇

融合级联上采样与下采样的改进随机森林不平衡数据分类算法

郑建华1,2, 李小敏3, 刘双印1,2, 李迪4   

  1. 1 仲恺农业工程学院信息科学与技术学院 广州510225
    2 广东省高校智慧农业工程技术研究中心 广州510225
    3 仲恺农业工程学院机电工程学院 广州510225
    4 华南理工大学机械与汽车工程学院 广州510640
  • 收稿日期:2020-08-19 修回日期:2020-09-21 出版日期:2021-07-15 发布日期:2021-07-02
  • 通讯作者: 李小敏(lixiaomin@zhku.edu.cn)
  • 基金资助:
    国家重点研发计划(2018YFB1700500);国家自然科学基金(61471133,61871475);广东省科技计划项目(2017A070712019,2017B010126001,2020A1414050062);广东省教育厅项目(2016KZDXM001,2017GCZX001,2020KZDZX1121);广州市科技计划项目(201704030098)

Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling

ZHENG Jian-hua1,2, LI Xiao-min3, LIU Shuang-yin1,2, LI Di4   

  1. 1 College of Information Science and Technology,Zhongkai University of Agriculture and Engineering,Guangzhou 510225,China
    2 Guangdong Engineering &Technology Research Center for Smart Agriculture,Guangzhou 510225,China
    3 College of Mechanical and Electrical Engineering,Zhongkai University of Agriculture and Engineering,Guangzhou 510225,China
    4 School of Mechanical and Automotive Engineering,South China University of Technology,Guangzhou 510640,China
  • Received:2020-08-19 Revised:2020-09-21 Online:2021-07-15 Published:2021-07-02
  • About author:ZHENG Jian-hua,born in 1977,Ph.D,associate professor,master supervisor.His main research interests include big data mining,machine learning and AI in smart agricultural.(zhengjianhua@mail.zhku.edu.cn)
    LI Xiao-min,born in 1981,Ph.D,asso-ciate professor.His main research interests include cyber-physical systems,smart manufacturing,big data and wireless network.
  • Supported by:
    National Key R&D Program of China(2018YFB1700500),National Natural Science Foundation of China(61471133,61871475),Science and Technology Planning Project of Guangdong Province of China(2017A070712019,2017B010126001,2020A1414050062), Project of Educational Commission of Guangdong Province of China(2016KZDXM001,2017GCZX001,2020KZDZX1121) and Science and Technology Planning Project of Guangzhou (201704030098).

摘要: 数据不平衡会严重影响传统分类算法的性能,不平衡数据分类是机器学习领域的一个热点和难点问题。为提高不平衡数据集中少数类样本的检出率,提出一种改进的随机森林算法。该算法的核心是对每一棵通过Bootstrap采样后的随机森林子树数据集进行混合采样。首先采用基于高斯混合模型的逆权重上采样,然后基于SMOTE-borderline1算法进行级联上采样,再用随机下采样方式进行下采样,得到每棵子树的平衡训练子集,最后以决策树为基学习器实现改进机随机森林不平衡数据分类算法。此外,以G-mean和AUC为评价指标,在15个公开数据集上将所提算法与10种不同算法进行比较,结果显示其两项指标的平均排名和平均值均为第一。进一步,在其中9个数据集上将其与6种state-of-the-art算法进行比较,在32次结果对比中,所提算法有28次取得的成绩都优于其他算法。实验结果表明,所提算法有助于提高少数类的检出率,具有更好的分类性能。

关键词: 不平衡数据, 分类算法, 级联上采样, 随机森林

Abstract: Data imbalance will seriously deteriorate the performance of traditional classification algorithms.Imbalance data classification has become a hot and difficult problem in the field of machine learning.In order to improve the detection rate of minority samples in imbalance data sets,an improved random forest algorithm is proposed in this paper.The core of the algorithm is to use hybrid sampling for each random forest subtree data set sampled by Bootsrap.Firstly,inverse weight up-sampling based on Gaussian mixture model is adopted,then cascade up-sampling based on SMOTE-borderline1 algorithm is carried out,and down-sampling is carried out in a random down-sampling way,so as to obtain a balanced training subset of each subtree.Finally,adecision tree-based improved random forest learner is used to implement the unbalanced data classification algorithm.In addition,this paper uses G-means and AUC as evaluation indexes,and compares them with 10 different algorithms on 15 public data sets.The results show that the average ranking and average value of the two indexes rank first.Furthermore,this paper compares with 6 state-of-the-art algorithms on 9 data sets.Among the 32 comparisons,the proposed algorithm achieves better results than that of other algorithms for 28 times.The experimental results show that the proposed algorithm is helpful to improve the detection rate of minority class and has better classification performance.

Key words: Cascaded up-sampling, Classification algorithm, Imbalance data, Random forest

中图分类号: 

  • TP181
[1]RANDHAWA K,LOO C K,SEERA M,et al.Credit Card Fraud Detection Using AdaBoost and Majority Voting[J].IEEE Access,2018,6:14277-14284.
[2]MOHD F,JALIL M A,NOORA N M M,et al.Improving Accuracy of Imbalanced Clinical Data Classification Using Synthetic Minority Over-Sampling Technique[C]//Advances in Data Scien-ce,Cyber Security and IT Applications.Cham:Springer International Publishing,2019:99-110.
[3]ZHANG Y,ZHANG H,ZHANG X,et al.Deep Learning Intrusion Detection Model Based on Optimized Imbalanced Network Data[C]//2018 IEEE 18th International Conference on Communication Technology (ICCT).IEEE,2018:1128-1132.
[4]LIU P Z,HONG M,HUANG D T,et al.Joint ADASYN and AdaBoostSVM for Imbalanced Learining[J].Journal of Beijing University of Technology,2017,43(3):368-375.
[5]HUANG Y Y,LI Y J,GU M Y,et al.Learning from class-imbalanced data:Review of methods and applications[J].Expert Systems with Applications,2017,73:220-239.
[6]LIU Y,WANG Y,REN X,et al.A Classification Method Based on Feature Selection for Imbalanced Data[J].IEEE Access,2019,7:81794-81807.
[7]KHAN S H,HAYAT M,BENNAMOUN M,et al.Cost-sensitive learning of deep feature representations from imbalanced data[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(8):3573-3587.
[8]WOZNIAK M,KRAWCZYK B,SCHAEFER G.Cost-sensitive decision tree ensembles for effective imbalanced classification[J].Applied Soft Computing,2013,14(1):554-562.
[9]KRAWCZYK B,SCHAEFER G.An improved ensemble ap-proach for imbalanced classification problems[C]//SACI 2013 - 8th IEEE International Symposium on Applied Computational Intelligence and Informatics.IEEE,2013:423-426.
[10]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Springer,2005:878-887.
[11]LEE J,PARK K.GAN-based imbalanced data intrusion detection system[J/OL].Personal and Ubiquitous Computing.http://doi.org/10.1007/s00779-019-01332-y.
[12]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M]//Data Mining.Springer,Boston,MA,2010:193-226.
[13]BELLINGER C,SHARMA S,JAPKOWICZ N,et al.Frame-work for extreme imbalance classification:SWIM-sampling with the majority class[J].Knowledge and Information Systems,2020,62:841-866.
[14]LAST F,DOUZAS G,BACAO F.Oversampling for Imbalanced Learning based on K-Means and SMOTE[J].arXiv:1711.00837,2017.
[15]LIN W C,TSAI C F,HU Y H,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26.
[16]CHEN G,LIU Y,GE Z.K-means Bayes algorithm for imbal-anced fault classification and big data application[J].Journal of Process Control,2019,81:54-64.
[17]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(6A):22-27,57.
[18]ZHENG J H,LIU S Y,HE C B,et al.Improved Random Forest Classification Algorithm for Imbalance Data Based on Hybrid Sampling Strategy[J].Journal of Chongqing University of Technology(Natural Science),2019,33(7):113-123.
[19]SHI H,GAO Q,JI S,et al.A Hybrid Sampling Method Based on Safe Screening for Imbalanced Datasets with Sparse Structure[C]//International Joint Conference on Neural Networks (IJCNN).2018:1-8.
[20]HAN X,JIA N,ZHU N.Gauss mixture undersampling algo-rithm for credit imbalance data[J].Computer Engineering and Design,2020,41(1):65-70.
[21]STAUFFER C,GRIMSON W E L.Adaptive background mixture models for real-time tracking[C]//Proceedings of IEEE Conf.Computer Vision Patt.Recog.1999.
[22]ZHANG Y L,ZHOU Y J.A review of cluster algorithms[J].Journal of Computer Applications,2019,39(7):1869-1882.
[23]DEMPSTER A P,LAIRD N M,RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society:Series B (Methodological),1977,39(1):1-22.
[24]BHAGAT R C,PATIL S S.Enhanced SMOTE algorithm forclassification of imbalanced big-data using Random Forest[C]//2015 IEEE International Advance Computing Conference (IACC).2015:403-408.
[25]TAN X P,SU S J,HUANG Z P,et al.Wireless Sensor Networks Intrusion Detection Based on SMOTE and the Random Forest Algorithm[J].Sensors,2019,19(1):203.
[26]KUBAT M,HOLTE R,MATWIN S.Learning when negative examples abound[C]//European Conference on Machine Lear-ning.Springer,1997:146-153.
[27]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197.
[28]Balanced Bagging Classifier imbalanced-learn 0.5.0 documentation[EB/OL].[2020-02-18].http://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html.
[29]LIU X Y,WU J,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(2):539-550.
[30]REN H,YANG B.Clustering-Based Prototype Generation forImbalance Classification[C]//2019 International Conference on Smart Grid and Electrical Automation (ICSGEA).2019:422-426.
[31]RI J,KIM H.G-mean based extreme learning machine for imbalance learning[J].Digital Signal Processing,2020,98:102637.
[32]RICHHARIYA B,TANVEER M.A reduced universum twin support vector machine for class imbalance learning[J].Pattern Recognition,2020,102:107150.
[33]AHMED S,RAYHAN F,MAHBUB A,et al.LIUBoost:Locality Informed Under-Boosting for Imbalanced Data Classification[C]//Emerging Technologies in Data Mining and Information Security.Springer,2019:133-144.
[34]LIU Z N,CAO W,GAO Z F,et al.Self-paced Ensemble for Highly Imbalanced Massive Data Classification[J].arXiv:1909.03500,2019.
[1] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[4] 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏.
Grassberger熵随机森林在窃电行为检测的应用
Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection
计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[5] 王文强, 贾星星, 李朋.
自适应的集成定序算法
Adaptive Ensemble Ordering Algorithm
计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[6] 董奇达, 王喆, 吴松洋.
结合注意力机制与几何信息的特征融合框架
Feature Fusion Framework Combining Attention Mechanism and Geometric Information
计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180
[7] 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江.
基于眼前节相干光断层扫描成像的核性白内障分类算法
Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image
计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085
[8] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[9] 杨小琴, 刘国军, 郭建慧, 马文涛.
基于随机森林的空域-频域联合特征全参考彩色图像质量评价方法
Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest
计算机科学, 2021, 48(8): 99-105. https://doi.org/10.11896/jsjkx.200700106
[10] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[11] 曹扬晨, 朱国胜, 祁小云, 邹洁.
基于随机森林的入侵检测分类研究
Research on Intrusion Detection Classification Based on Random Forest
计算机科学, 2021, 48(6A): 459-463. https://doi.org/10.11896/jsjkx.200600161
[12] 李娜娜, 王勇, 周林, 邹春明, 田英杰, 郭乃网.
基于特征重要度二次筛选的DDoS攻击随机森林检测方法
DDoS Attack Random Forest Detection Method Based on Secondary Screening of Feature Importance
计算机科学, 2021, 48(6A): 464-467. https://doi.org/10.11896/jsjkx.200900101
[13] 徐佳庆, 胡小月, 唐付桥, 王强, 何杰.
基于随机森林的高性能互连网络阻塞故障检测
Detecting Blocking Failure in High Performance Interconnection Networks Based on Random Forest
计算机科学, 2021, 48(6): 246-252. https://doi.org/10.11896/jsjkx.201200142
[14] 张人之, 朱焱.
基于主动学习的社交网络恶意用户检测方法
Malicious User Detection Method for Social Network Based on Active Learning
计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151
[15] 王颖颖, 常俊, 武浩, 周详, 彭予.
基于WiFi-CSI的入侵检测方法
Intrusion Detection Method Based on WiFi-CSI
计算机科学, 2021, 48(6): 343-348. https://doi.org/10.11896/jsjkx.200700006
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!