计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 196-201.doi: 10.11896/j.issn.1002-137X.2019.02.030

• 人工智能 • 上一篇    下一篇

基于类别随机化的随机森林算法

关晓蔷, 庞继芳, 梁吉业   

  1. 山西大学计算机与信息技术学院 太原030006
    山西大学计算智能与中文信息处理教育部重点实验室 太原030006
  • 收稿日期:2018-09-07 出版日期:2019-02-25 发布日期:2019-02-25
  • 通讯作者: 梁吉业(1962-),男,博士,教授,CCF会员,主要研究方向为粒计算、数据挖掘与机器学习,E-mail:ljy@sxu.edu.cn
  • 作者简介:关晓蔷(1979-),女,博士生,讲师,CCF会员,主要研究方向为数据挖掘与机器学习,E-mail:gxq0079@sxu.edu.cn;庞继芳(1980-),女,博士,讲师,CCF会员,主要研究方向为智能决策与数据挖掘
  • 基金资助:
    本文受国家自然科学基金项目(61876103),山西省青年科技基金项目(201701D221098),山西省重点研发项目(201603D111014),山西省留学基金项目(2016-003)资助。

Randomization of Classes Based Random Forest Algorithm

GUAN Xiao-qiang, PANG Ji-fang, LIANG Ji-ye   

  1. School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    Key Laboratory of Computational Intelligence and Chinese Information Processing (Shanxi University),Ministry of Education,Taiyuan 030006,China
  • Received:2018-09-07 Online:2019-02-25 Published:2019-02-25

摘要: 随机森林是数据挖掘和机器学习领域中一种常用的分类方法,已成为国内外学者共同关注的研究热点,并被广泛应用到各种实际问题中。传统的随机森林方法没有考虑类别个数对分类效果的影响,忽略了基分类器和类别之间的关联性,导致随机森林在处理多分类问题时的性能受到限制。为了更好地解决该问题,结合多分类问题的特点,提出一种基于类别随机化的随机森林算法(RCRF)。从类别的角度出发,在随机森林两种传统随机化的基础上增加类别随机化,为不同类别设计具有不同侧重点的基分类器。由于不同的分类器侧重区分的类别不同,所生成的决策树的结构也不同,这样既能够保证单个基分类器的性能,又可以进一步增大基分类器的多样性。为了验证所提算法的有效性,在UCI数据库中的21个数据集上将RCRF与其他算法进行了比较分析。实验从两个方面进行,一方面,通过准确率、F1-measure和Kappa系数3个指标来验证RCRF算法的性能;另一方面,利用κ-误差图从多样性角度对各种算法进行对比与分析。实验结果表明,所提算法能够有效提升集成模型的整体性能,在处理多分类问题时具有明显优势。

关键词: 多分类问题, 多样性, 类别随机化, 随机森林

Abstract: Random forest is a commonly used classification method in the field of data mining and machine learning,which has become a research focus of scholars at home and abroad,and has been widely applied to various practical problems.The traditional random forest methods do not consider the influence of the number of classes on the classification effect,and neglect the correlation between base classifiers and classes,limiting the performance of the random forest in dealing with multi-class classification problems.In order to solve the problem better,combined with the characteristics of multi-class classification problem,this paper proposed a randomization of classes based random forest algorithm (RCRF).From the perspective of classes,the randomization of classes is added on the basis of two kinds of traditional randomizations of random forest,and the corresponding base classifiers with different emphasis are designed for diffe-rent classes.The structures of the decision tree generated by the base classifier are different because different classifiers focus on different classes,which can not only guarantee the performance of the single base classifier,but also further increase the diversity of base classifier.In order to verify the validity of the proposed algorithm,RCRF is compared with other algorithms on 21 data sets in UCI database.The experiment is carried out from two aspects.On the one hand,the accuracy,F1-measure and Kappa coefficient are used to verify the performance of RCRF algorithm.On the other hand,the κ-error diagram is used to compare and analyze various algorithms from the perspective of diversity.Experimental results show that the proposed algorithm can effectively improve the overall performance of the integrated model and has obvious advantages in dealing with multi-class classification problems.

Key words: Diversity, Multi-class classification problems, Random forest, Randomization of classes

中图分类号: 

  • TP181
[1]BREIMAN L.Random Forests [J].Machine Learning,2001,45(1):5-23.
[2]FERNANDEZ-DELGADO M,CERNADAS E,BARRO S,et al. Do we need hundreds of classifiers to solve real world classification problems [J].Journal of Machine Learning Research,2014,15(1):3133-3181.
[3]MEHER P K,SAHU T K,RAO A R.Identification of species based on DNA barcode using k-mer feature vector and random forest classifier [J].Gene,2016,592(2):316-324.
[4]JOG A,CARASS A,ROY S,et al.Random forest regression for magnetic resonance image synthesis [J].Medical Image Analysis,2017,35:475-488.
[5]WANG S,LIU J,BI Y Y,et al.Automatic recognition of breast gland based on two-step clustering and random forest [J].Computer Science,2018,45(3):247-252.(in Chinese)
王帅,刘娟,毕姚姚,等.基于两步聚类和随机森林的乳腺腺管自动识别方法 [J].计算机科学,2018,45(3):247-252.
[6]FANELLI G,DANTONE M,GALL J,et al.Random forests for real time 3D face analysis [J].International Journal of Computer Vision,2013,101(3):437-458.
[7]GALL J,YAO A,RAZAVI N,et al.Hough forests for object detection,tracking,and action recognition [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(11):2188-2202.
[8]GEURTS P,ERNST D,WEHENKEL L.Extremely randomized trees [J].Machine Learning,2006,63(1):3-42.
[9]RODRIGUEZ J J,KUNCHEVA L I,ALONSO C J.Rotation forest:a new classifier ensemble method [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2006,28(10):1619-1630.
[10]ZHANG L,SUGANTHAN P N.Random forests with ensemble of feature spaces [J].Pattern Recognition,2014,47(10):3429-3437.
[11]ABELLÁN J,MANTAS C J,CASTELLANO J G.A random forest approach using imprecise probabilities [J].Knowledge- Based Systems,2017,134:72-84.
[12]WANG Y,XIA S T,TANG Q,et al.A novel consistent random forest framework:bernoulli random forests [J].IEEE Transactions on Neural Networks & Learning Systems,2018,29(8):3510-3523.
[13]YE Y,WU Q,HUANG J Z,et al.Stratified sampling for feature subspace selection in random forests for high dimensional data [J].Pattern Recognition,2013,46(3):769-787.
[14]XIA J,LI L,LI L,et al.Adjusted weight voting algorithm for random forests in handling missing values [J].Pattern Recognition,2017,69(C):52-60.
[15]HU C,CHEN Y,HU L,et al.A novel random forests based class incremental learning method for activity recognition [J].Pattern Recognition,2018,78:277-290.
[16]BREIMAN L.Bagging predictors [J].Machine Learning,1996,24(2):123-140.
[17]HO T K.The random subspace method for constructing decision forests [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(8):832-844.
[18]DEMSAR J.Statistical comparisons of classifiers over multiple data sets [J].Journal of Machine Learning Research,2006,7(1):1-30.
[19]MARGINEANTU D D,DIETTERICH T G.Pruning adaptive boosting [C]∥Fourteenth International Conference on Machine Learning.Morgan Kaufmann Publishers Inc.,1997:211-218.
[1] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏.
Grassberger熵随机森林在窃电行为检测的应用
Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection
计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[4] 王文强, 贾星星, 李朋.
自适应的集成定序算法
Adaptive Ensemble Ordering Algorithm
计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[5] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[6] 陈壮, 邹海涛, 郑尚, 于化龙, 高尚.
基于用户覆盖及评分差异的多样性推荐算法
Diversity Recommendation Algorithm Based on User Coverage and Rating Differences
计算机科学, 2022, 49(5): 159-164. https://doi.org/10.11896/jsjkx.210300263
[7] 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江.
基于眼前节相干光断层扫描成像的核性白内障分类算法
Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image
计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085
[8] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[9] 刘意, 毛莺池, 程杨堃, 高建, 王龙宝.
基于邻域一致性的异常检测序列集成方法
Locality and Consistency Based Sequential Ensemble Method for Outlier Detection
计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156
[10] 杨小琴, 刘国军, 郭建慧, 马文涛.
基于随机森林的空域-频域联合特征全参考彩色图像质量评价方法
Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest
计算机科学, 2021, 48(8): 99-105. https://doi.org/10.11896/jsjkx.200700106
[11] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
[12] 曹扬晨, 朱国胜, 祁小云, 邹洁.
基于随机森林的入侵检测分类研究
Research on Intrusion Detection Classification Based on Random Forest
计算机科学, 2021, 48(6A): 459-463. https://doi.org/10.11896/jsjkx.200600161
[13] 李娜娜, 王勇, 周林, 邹春明, 田英杰, 郭乃网.
基于特征重要度二次筛选的DDoS攻击随机森林检测方法
DDoS Attack Random Forest Detection Method Based on Secondary Screening of Feature Importance
计算机科学, 2021, 48(6A): 464-467. https://doi.org/10.11896/jsjkx.200900101
[14] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[15] 徐佳庆, 胡小月, 唐付桥, 王强, 何杰.
基于随机森林的高性能互连网络阻塞故障检测
Detecting Blocking Failure in High Performance Interconnection Networks Based on Random Forest
计算机科学, 2021, 48(6): 246-252. https://doi.org/10.11896/jsjkx.201200142
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!