计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 108-114.doi: 10.11896/jsjkx.201200189

• 数据库&大数据&数据科学 • 上一篇    下一篇

一种可用于分类型属性数据的多变量回归森林

刘振宇1, 宋晓莹2   

  1. 1 东北大学计算机科学与工程学院 沈阳110819
    2 大连东软信息学院计算机学院 辽宁 大连116023
  • 收稿日期:2020-12-22 修回日期:2021-03-14 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 刘振宇(liuzhenyu@neusoft.edu.cn)
  • 基金资助:
    国家自然科学基金(61772101)

Multivariate Regression Forest for Categorical Attribute Data

LIU Zhen-yu1, SONG Xiao-ying2   

  1. 1 School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China
    2 School of Computer,Dalian Neusoft University of Information,Dalian,Liaoning 116023,China
  • Received:2020-12-22 Revised:2021-03-14 Online:2022-01-15 Published:2022-01-18
  • About author:LIU Zhen-yu,born in 1978,postgra-duate,professor.His main research in-terests include machine learning and artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(61772101).

摘要: 针对线性回归、SVR以及大部分多变量回归树等回归模型不能直接利用分类型属性进行回归分析的问题,提出了一种可联合多种类型属性的决策树结点划分方法。该方法通过定义样本集合在分类型属性上的中心以及样本到中心的距离,使得分类型属性也可以像数值型属性一样参与样本的聚类过程,从而形成样本集的划分。之后,文中又为由该方法产生的决策树选择了合适的集成方案,生成的集成器被称为聚类回归森林(CRF)。最后,在12个UCI公开数据集上对比CRF与其他9个回归模型的回归平均绝对误差(MAE)和均方根误差(RMSE),实验结果表明,CRF在10个回归模型中具有最好的表现。

关键词: 多变量回归树, 集成学习, 决策树, 随机森林, 梯度提升

Abstract: As categorical attributes cannot be utilized directly in some regression models like the linear regression,SVR and most multivariate regression trees,a multivariate split method dealing with multiple types of data is prompted in this paper.We define the centers of the sample sets on the categorical attributes and the distances from the samples to the centers in order that thecate-gorical attributes can also participate in the clustering process like the numerical attributes.Then a reasonable ensemble scheme is selected for the decision trees generated by the method to get the ensemble called cluster regression forest(CRF).Finally,we use CRF and other 9 regression models to compare regression mean absolute error (MAE) and root mean square error (RMSE) on 12 UCI public data sets.The experimental results show that CRF has the best performance among the 10 regression models.

Key words: Decision trees, Ensemble learning, Gradient boosting, Multi-variable regression trees, Random forest

中图分类号: 

  • TP393
[1]PAN J H,WANG Y H,WU W.Physical quantity regression method based on optimized BP neural network[J].Computer Science,2018,45(12):170-176.
[2]CHEN W,LI H,HOU E K,et al.GIS-based groundwater potential analysis using novel ensemble weights-of-evidence with logistic regression and functional tree models[J].Science of the Total Environment,2018,634(9):853-867.
[3]WANG N,LI Z,CHENG X Y.Reversible visible watermarkingalgorithm for medical image based on support vector regression[J].Computer Science,2019,34(10):2243-2248.
[4]LOH W Y,SHIH Y S.Split selection methods for classification trees[J].Statistica Sinica,1999,7(4):815-840.
[5]QUINLAN J R.C4.5:programs for machine learning[J].Machine Learning,1994,16(3):235-240.
[6]BUNTINE W L.Learning classification trees[J].Statistics & Computing,1992,2(2):63-73.
[7]BUCY R S,DIESPOSTI R S.Decision tree design by simulated annealing[J].ESAIM Mathematical Modelling and Numerical Analysis,1993,27(5):515-534.
[8]MURTHY S K,KASIF S,SALZBERG S.A System for Induction of Oblique Decision Trees[J].Journal of Artificial Intelligence Research,1996,2(1):1-32.
[9]LÓPEZ-CHAU A,CERVANTES J,LÓPEZ-GARCÍA L,et al.Fisher's decision tree[J].Expert Systems with Applications,2013,40(16):6283-6291.
[10]HONG K S,OOI P L,YE C K,et al.Multivariate alternating decision trees[J].Pattern Recognition,2016,50(C):195-209.
[11]WICKRAMARACHCHI D C,ROBERTSON B L,REALE M,et al.HHCART:An Oblique Decision Tree[J].Computational Statistics & Data Analysis,2015,96:12-23.
[12]BJOERN H M,KELM B M,DANIEL N S,et al.On Oblique Random Forests[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Springer,2011:453-469.
[13]BREIMAN L.Random Forests[J].Machine Learning,2001,45(1):5-32.
[14]BREIMAN L.Bagging predictors[J].Machine Learning,1996,24(2):123-140.
[15]HO T K.The random subspace method for constructing decision forests[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(8):832-844.
[16]FREUND Y,SCHAPIRE R.A decision-theoretic generalization of on-line learning and an application to boosting[J].Journal of Computing System,1997,55:119-139.
[17]FRIEDMAN J H.Greedy function approximation:a gradientboosting machine[J].The Annals of Statistics,2001,29(5):1189-1232.
[18]WANG X H,ZHANG L,LI J Q,et al.Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest[J].Computer Science,2020,47(S2):454-458.
[19]QU W L,CHEN X Y,LI Y Y,et al.A regression prediction model of depth gradient boosting[J].Computer Applications and Software,2020,37(9):194-201.
[20]LIU Z Y,SONG X Y.An applicable multivariate decision tree algorithm for categorical attribute data[J].Journal of Northeastern University (Natural Science),2020,41(11):1521-1527.
[21]GENRIKHOV I E,DJUKOVA E V,ZHURAVLEV V I.On full regression decision trees[J].Pattern Recognition and Image Analysis,2017,27(1):1-7.
[22]LICHMAN M.UCI machine learning repository[EB/OL].(2019-09-23) [2019-10-11]. http://archive.ics.uci.edu/ml/index.php.
[1] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[4] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[5] 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏.
Grassberger熵随机森林在窃电行为检测的应用
Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection
计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[6] 王文强, 贾星星, 李朋.
自适应的集成定序算法
Adaptive Ensemble Ordering Algorithm
计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[7] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[8] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[9] 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江.
基于眼前节相干光断层扫描成像的核性白内障分类算法
Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image
计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085
[10] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[11] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[12] 陈乐, 高岭, 任杰, 党鑫, 王祎昊, 曹瑞, 郑杰, 王海.
基于自适应码率移动增强现实应用的能效优化研究
Adaptive Bitrate Streaming for Energy-Efficiency Mobile Augmented Reality
计算机科学, 2022, 49(1): 194-203. https://doi.org/10.11896/jsjkx.201100107
[13] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[14] 杨小琴, 刘国军, 郭建慧, 马文涛.
基于随机森林的空域-频域联合特征全参考彩色图像质量评价方法
Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest
计算机科学, 2021, 48(8): 99-105. https://doi.org/10.11896/jsjkx.200700106
[15] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!