计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 145-154.doi: 10.11896/jsjkx.200800120
郑建华1,2, 李小敏3, 刘双印1,2, 李迪4
ZHENG Jian-hua1,2, LI Xiao-min3, LIU Shuang-yin1,2, LI Di4
摘要: 数据不平衡会严重影响传统分类算法的性能,不平衡数据分类是机器学习领域的一个热点和难点问题。为提高不平衡数据集中少数类样本的检出率,提出一种改进的随机森林算法。该算法的核心是对每一棵通过Bootstrap采样后的随机森林子树数据集进行混合采样。首先采用基于高斯混合模型的逆权重上采样,然后基于SMOTE-borderline1算法进行级联上采样,再用随机下采样方式进行下采样,得到每棵子树的平衡训练子集,最后以决策树为基学习器实现改进机随机森林不平衡数据分类算法。此外,以G-mean和AUC为评价指标,在15个公开数据集上将所提算法与10种不同算法进行比较,结果显示其两项指标的平均排名和平均值均为第一。进一步,在其中9个数据集上将其与6种state-of-the-art算法进行比较,在32次结果对比中,所提算法有28次取得的成绩都优于其他算法。实验结果表明,所提算法有助于提高少数类的检出率,具有更好的分类性能。
中图分类号:
[1]RANDHAWA K,LOO C K,SEERA M,et al.Credit Card Fraud Detection Using AdaBoost and Majority Voting[J].IEEE Access,2018,6:14277-14284. [2]MOHD F,JALIL M A,NOORA N M M,et al.Improving Accuracy of Imbalanced Clinical Data Classification Using Synthetic Minority Over-Sampling Technique[C]//Advances in Data Scien-ce,Cyber Security and IT Applications.Cham:Springer International Publishing,2019:99-110. [3]ZHANG Y,ZHANG H,ZHANG X,et al.Deep Learning Intrusion Detection Model Based on Optimized Imbalanced Network Data[C]//2018 IEEE 18th International Conference on Communication Technology (ICCT).IEEE,2018:1128-1132. [4]LIU P Z,HONG M,HUANG D T,et al.Joint ADASYN and AdaBoostSVM for Imbalanced Learining[J].Journal of Beijing University of Technology,2017,43(3):368-375. [5]HUANG Y Y,LI Y J,GU M Y,et al.Learning from class-imbalanced data:Review of methods and applications[J].Expert Systems with Applications,2017,73:220-239. [6]LIU Y,WANG Y,REN X,et al.A Classification Method Based on Feature Selection for Imbalanced Data[J].IEEE Access,2019,7:81794-81807. [7]KHAN S H,HAYAT M,BENNAMOUN M,et al.Cost-sensitive learning of deep feature representations from imbalanced data[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(8):3573-3587. [8]WOZNIAK M,KRAWCZYK B,SCHAEFER G.Cost-sensitive decision tree ensembles for effective imbalanced classification[J].Applied Soft Computing,2013,14(1):554-562. [9]KRAWCZYK B,SCHAEFER G.An improved ensemble ap-proach for imbalanced classification problems[C]//SACI 2013 - 8th IEEE International Symposium on Applied Computational Intelligence and Informatics.IEEE,2013:423-426. [10]HAN H,WANG W Y,MAO B H.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Springer,2005:878-887. [11]LEE J,PARK K.GAN-based imbalanced data intrusion detection system[J/OL].Personal and Ubiquitous Computing.http://doi.org/10.1007/s00779-019-01332-y. [12]WEISS G M.The Impact of Small Disjuncts on Classifier Lear-ning[M]//Data Mining.Springer,Boston,MA,2010:193-226. [13]BELLINGER C,SHARMA S,JAPKOWICZ N,et al.Frame-work for extreme imbalance classification:SWIM-sampling with the majority class[J].Knowledge and Information Systems,2020,62:841-866. [14]LAST F,DOUZAS G,BACAO F.Oversampling for Imbalanced Learning based on K-Means and SMOTE[J].arXiv:1711.00837,2017. [15]LIN W C,TSAI C F,HU Y H,et al.Clustering-based undersampling in class-imbalanced data[J].Information Sciences,2017,409:17-26. [16]CHEN G,LIU Y,GE Z.K-means Bayes algorithm for imbal-anced fault classification and big data application[J].Journal of Process Control,2019,81:54-64. [17]ZHAO N,ZHANG X F,ZHANG L J.Overview of Imbalanced Data Classification[J].Computer Science,2018,45(6A):22-27,57. [18]ZHENG J H,LIU S Y,HE C B,et al.Improved Random Forest Classification Algorithm for Imbalance Data Based on Hybrid Sampling Strategy[J].Journal of Chongqing University of Technology(Natural Science),2019,33(7):113-123. [19]SHI H,GAO Q,JI S,et al.A Hybrid Sampling Method Based on Safe Screening for Imbalanced Datasets with Sparse Structure[C]//International Joint Conference on Neural Networks (IJCNN).2018:1-8. [20]HAN X,JIA N,ZHU N.Gauss mixture undersampling algo-rithm for credit imbalance data[J].Computer Engineering and Design,2020,41(1):65-70. [21]STAUFFER C,GRIMSON W E L.Adaptive background mixture models for real-time tracking[C]//Proceedings of IEEE Conf.Computer Vision Patt.Recog.1999. [22]ZHANG Y L,ZHOU Y J.A review of cluster algorithms[J].Journal of Computer Applications,2019,39(7):1869-1882. [23]DEMPSTER A P,LAIRD N M,RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society:Series B (Methodological),1977,39(1):1-22. [24]BHAGAT R C,PATIL S S.Enhanced SMOTE algorithm forclassification of imbalanced big-data using Random Forest[C]//2015 IEEE International Advance Computing Conference (IACC).2015:403-408. [25]TAN X P,SU S J,HUANG Z P,et al.Wireless Sensor Networks Intrusion Detection Based on SMOTE and the Random Forest Algorithm[J].Sensors,2019,19(1):203. [26]KUBAT M,HOLTE R,MATWIN S.Learning when negative examples abound[C]//European Conference on Machine Lear-ning.Springer,1997:146-153. [27]SEIFFERT C,KHOSHGOFTAAR T M,VAN HULSE J,et al.RUSBoost:A hybrid approach to alleviating class imbalance[J].IEEE Transactions on Systems,Man,and Cybernetics-Part A:Systems and Humans,2010,40(1):185-197. [28]Balanced Bagging Classifier imbalanced-learn 0.5.0 documentation[EB/OL].[2020-02-18].http://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html. [29]LIU X Y,WU J,ZHOU Z H.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B (Cybernetics),2009,39(2):539-550. [30]REN H,YANG B.Clustering-Based Prototype Generation forImbalance Classification[C]//2019 International Conference on Smart Grid and Electrical Automation (ICSGEA).2019:422-426. [31]RI J,KIM H.G-mean based extreme learning machine for imbalance learning[J].Digital Signal Processing,2020,98:102637. [32]RICHHARIYA B,TANVEER M.A reduced universum twin support vector machine for class imbalance learning[J].Pattern Recognition,2020,102:107150. [33]AHMED S,RAYHAN F,MAHBUB A,et al.LIUBoost:Locality Informed Under-Boosting for Imbalanced Data Classification[C]//Emerging Technologies in Data Mining and Information Security.Springer,2019:133-144. [34]LIU Z N,CAO W,GAO Z F,et al.Self-paced Ensemble for Highly Imbalanced Massive Data Classification[J].arXiv:1909.03500,2019. |
[1] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[2] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[3] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[4] | 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏. Grassberger熵随机森林在窃电行为检测的应用 Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection 计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032 |
[5] | 王文强, 贾星星, 李朋. 自适应的集成定序算法 Adaptive Ensemble Ordering Algorithm 计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108 |
[6] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[7] | 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江. 基于眼前节相干光断层扫描成像的核性白内障分类算法 Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image 计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085 |
[8] | 刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189 |
[9] | 杨小琴, 刘国军, 郭建慧, 马文涛. 基于随机森林的空域-频域联合特征全参考彩色图像质量评价方法 Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest 计算机科学, 2021, 48(8): 99-105. https://doi.org/10.11896/jsjkx.200700106 |
[10] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法 Interval Prediction Method for Imbalanced Fuel Consumption Data 计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145 |
[11] | 曹扬晨, 朱国胜, 祁小云, 邹洁. 基于随机森林的入侵检测分类研究 Research on Intrusion Detection Classification Based on Random Forest 计算机科学, 2021, 48(6A): 459-463. https://doi.org/10.11896/jsjkx.200600161 |
[12] | 李娜娜, 王勇, 周林, 邹春明, 田英杰, 郭乃网. 基于特征重要度二次筛选的DDoS攻击随机森林检测方法 DDoS Attack Random Forest Detection Method Based on Secondary Screening of Feature Importance 计算机科学, 2021, 48(6A): 464-467. https://doi.org/10.11896/jsjkx.200900101 |
[13] | 徐佳庆, 胡小月, 唐付桥, 王强, 何杰. 基于随机森林的高性能互连网络阻塞故障检测 Detecting Blocking Failure in High Performance Interconnection Networks Based on Random Forest 计算机科学, 2021, 48(6): 246-252. https://doi.org/10.11896/jsjkx.201200142 |
[14] | 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法 Malicious User Detection Method for Social Network Based on Active Learning 计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151 |
[15] | 王颖颖, 常俊, 武浩, 周详, 彭予. 基于WiFi-CSI的入侵检测方法 Intrusion Detection Method Based on WiFi-CSI 计算机科学, 2021, 48(6): 343-348. https://doi.org/10.11896/jsjkx.200700006 |
|