Computer Science ›› 2016, Vol. 43 ›› Issue (6): 55-58.doi: 10.11896/j.issn.1002-137X.2016.06.011

Previous Articles     Next Articles

Parallelization of Random Forest Algorithm Based on Discretization and Selection of Weak-correlation Feature Subspaces

CHEN Min-cheng, YUAN Jing-ling, WANG Xiao-yan and ZHU Sai   

  • Online:2018-12-01 Published:2018-12-01

Abstract: With the coming of the big data age,data information is increasing exponentially at a dramatic rate.The traditional classification algorithm will encounter great challenges.In order to improve the efficiency of classification algorithm,this paper proposd a parallel random forest algorithm based on discretization and the selection of the weak-correlation feature subspaces.This algorithm discretizes continuous attributes in data pretreatment phase.At the step of the selection of feature subspaces for growing decision trees,we used vector space modal of attributes to calculate the correlation between attributes,and then constructed the weak-correlation feature subspaces.This algorithm not only reduces the correlation among decision trees,but also improves the classifying effect of the random forest.We also designed and realized a double parallel method for building random forest model based on the MapReduce framework.This strategy goes a step further with its own charity efforts.

Key words: Random forest,Discretization,Weak-correlation feature subspaces,Parallel classification

[1] HE Qing,LI Ning,LUO Wen-Juan,et al.A Survey of Machine Learning Algorithms for Big Data [J].Pattern Recognition and Artificial Intelligence,2014,27(4):327-336(in Chinese) 何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336
[2] Breiman L.Random forests [J].Machine learning,2001,45(1):5-32
[3] Wang Y,Goh W,Wong L,et al.Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes [J].BMC bioinformatics,2013,14(16):1-15
[4] Zhang Lei,Wang Lin-lin,Zhang Xu-dong,et al.The basic principle of random forest and its applications in ecology:a case study of Pinus yunnanensis [J].Acta Ecologica Sinica,2014,34(3):650-659(in Chinese) 张雷,王琳琳,张旭东,等.随机森林算法基本思想及其在生态学中的应用——以云南松分布模拟为例[J].生态学报,2014,34(3):650-659
[5] Lee S L A,Kouzani A Z,Hu E J.Random forest based lung nodu-le classification aided by clustering [J].Computerized Medical Imaging and Graphics,2010,34(7):535-542
[6] Luo Zhi-lin,Chen Ting,Cai Wan-dong.Microblogging Retweet Prediction Algorithm Based on Random Forest[J].Computer Science,2014,41(4):62-64,74(in Chinese) 罗知林,陈挺,蔡皖东.一个基于随机森林的微博转发预测算法[J].计算机科学,2014,41(4):62-64,74
[7] Wang De-wen,Sun Zhi-wei.Big Data Analysis and Parallel Load Forecasting of Electric Power User Side[J].Proceedings of the CSEE,2015,35(3):527-537(in Chinese) 王德文,孙志伟.电力用户侧大数据分析与并行负荷预测[J].中国电机工程学报,2015,5(3):527-537
[8] Guo Shan-qing,Gao Cong,Yao Jian,et al.An Intrusion Detection Model Based on Improved Random Forests Algorithm [J].Journal of Software,2005,16(8):1490-1498(in Chinese) 郭山清,高丛,姚建,等.基于改进的随机森林算法的入侵检测模型[J].软件学报,2005,16(8):1490-1498
[9] Yao Dong,Luo Jun-yong,Chen Wu-ping,et al.Online Double Random Forests Intrusion Detection Based on Non-extensive Entropy Features Extraction [J].Computer Science,2013,40(12):192-196(in Chinese) 姚东,罗军勇,陈武平,等.基于改进非广延熵特征提取的双随机森林实时入侵检测方法[J].计算机科学,2013,40(12):192-196
[10] Hu Qing,Sun Cai-xin,Du Lin,et al.Transformer Fault Diagnosis Method Using Random Forests and Kernel Principle Component Analysis[J].High Voltage Engineering,2010,36 (7):1725-1729(in Chinese) 胡青,孙才新,杜林,等.核主成分分析与随机森林相结合的变压器故障诊断方法[J].高电压技术,2010,36 (7):1725-1729
[11] Yao Ya-fu,Xing Liu-tao.Improvement of C4.5 decision tree continuous attributes segmentation threshold algorithm and its application [J].Journal of Central South University (Science and Technology),2011,42(12):3772-3776(in Chinese) 姚亚夫,邢留涛.决策树C4.5连续属性分割阈值算法改进及其应用[J].中南大学学报(自然科学版),2011,42(12):3772-3776
[12] Xu B,Huang J Z,Williams G,et al.Classifying very high-dimensional data with random forests built from small subspaces[J].International Journal of Data Warehousing and Mining (IJDWM),2012,8(2):44-63
[13] Xiang Yao,Yuan Jing-ling,Zhong Luo,et al.A Coarse-Grained Clustering Unit Based Parallel Algorithm for Big Data Set [J].Journal of Chinese Computer Systems,2014,35(10):2370-2374(in Chinese) 向尧,袁景凌,钟珞,等.一种面向大数据集的粗粒度并行聚类算法研究[J].小型微型计算机系统,2014,35(10):2370-2374

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!