基于Hadoop的并行共享决策树挖掘算法研究

摘要/Abstract

摘要： 共享知识挖掘是指通过学习不同事物之间的共享知识,将学习到的知识应用到未知事物来加快认知未知事物。针对大数据集中串行共享知识挖掘算法效率低下的问题,结合云计算技术,提出了一种基于Hadoop的并行共享决策树挖掘算法(PSDT)。该算法采用传统的属性表结构实现并行挖掘,但其I/O操作过多,影响算法性能,为此,进一步提出了一种混合并行共享决策树挖掘算法(HPSDT)。该算法采用混合数据结构,在计算分裂指标阶段使用属性表结构,在分裂阶段采用数据记录结构。数据分析表明,HPSDT算法简化了分裂过程,其I/O操作是PSDT的0.34左右。实验结果表明,PSDT和HPSDT都具有良好的并行性和扩展性；HPSDT比PSDT性能更好,并且随着数据集的增大,HPSDT的优越性更加明显。

关键词: 共享决策树,并行共享决策树,混合数据结构,云计算,Hadoop

Abstract: Shared knowledge is focused on the problems of knowledge discovery shared by two(or more) applications/datasets,and it can help users investigate a poorly understood dataset from a well understood dataset by analogy and transferring knowledge．However,with the advent of the era of big data,the existing algorithms based on serialized can not able to meet the need of the rapid data growth and the serial algorithm has low efficiency in big datasets．The Parallel Shared Decision Tree algorithm(PSDT) based on Hadoop was introduced by using cloud computing．The traditional attributes list structure is uses in the PSDT．But the overmuch operations of I/O occur in the parallel processes of this algorithm,which will influence the algorithm’s performance．A new Hybrid Parallel Shared Decision Tree algorithm(HPSDT),which uses hybrid data structures,was proposed．The algorithm applies attributes-list to compute the split indicators parallelly,and data records-structure to split in the splitting procedure．Data analysis indicates that the algorithm of HPSDT simplifies the splitting process,and its operations of I/O are 0.34times of the PSDT．The experimental results show that PSDT and HPSDT have good parallelism and scalability,furthermore,the performance of HPSDT is better than that of PSDT．Especially,the superiority is even more obvious with the increase of the dataset size.

Key words: Shared decision tree,Parallel shared decision tree,Hybrid data structure,Cloud computing,Hadoop

陈湘涛,张超,韩茜. 基于Hadoop的并行共享决策树挖掘算法研究[J]. 计算机科学, 2013, 40(11): 215-221. https://doi.org/

CHEN Xiang-tao,ZHANG Chao and HAN Qian. Research on Parallel Shared Decision Tree Algorithm Based on Hadoop[J]. Computer Science, 2013, 40(11): 215-221. https://doi.org/

参考文献

[1] Dong Guo-zhu,Han Qian．Mining Shared Decision Trees be-tween Datasets[R]．OH.USA:Wright State University/OhioLINK,2010
[2] Quinlan J R.Induction of decision trees [J]．Machine Learning,1986(1):81-106
[3] Quinlan J R．C4.5:Programs for machine learning [M]．SanFrancisco:Morgan Kaufmann Publishers Inc.,1992
[4] Manish M,Rakesh A,Jorma R.SLIQ:A fast scalable classier for data mining[C]∥EDBT’ 96.London:Springer-Verlag,1996:18-32
[5] John C S,Rakesh A,Manish M．SPRINT:A scalable parallel classier for data mining[C]∥22th VLDB Conference．San Francisco:Morgan Kaufmann Publishers Inc.,1996:544-555
[6] Hayes B．Cloud computing [J]．Communications of the ACM,2008,51(7):9-11
[7] Armbrust M,Stoica I,Zaharia M,et al．A view of cloud computing [J]．Communications of the ACM,2010,53(4):50-58
[8] Mohammed J Z,Ho Ching-tien,Rakesh A．Parallel classification for data mining on shared-memory multiprocessors [C]∥ICDE:IEEE International Conference on Data Engineering．Sydney:IEEE Computer Society,1999:198-205
[9] Mohammed J Z．Parallel and distributed association mining:asurvey [J]．IEEE Concurrency,1999,7(4):14-25
[10] Ruo-ming J,Ge Yang,Gagan A．Shared memory parallelizetion of data mining algorithms:techniques,programming interface,and performance[J].Transactions on Knowledge and Data Engineering,2005,17(1):71-89
[11] Peter S P．Parallel Programming with MPI [M]．San Francisco:Morgan Kaufmann Publishers Inc.,1997
[12] Gropp W,Lusk E,Skjellum A．Using MPI portable parallel programming with the message-passing interface [M]．Cambridge:MIT Press,1999
[13] Nuno A,Joo G,Fernando S．Exploiting Parallelism in Decision Tree Induction[C]∥ECML/PKDD Workshop on Parallel and Distributed computing for Machine Learning．2003:13-22
[14] Wu Rong,Liang Shuai,Liao Hua-ming．Cyclic workflow execution mechanism on top of MapReduce framework[C]∥Seventh International Conference on Semantics,Knowledge and Grids．Washington,DC:IEEE Computer Society,2011:28-35
[15] Jeffrey D,Sanjay G．MapReduce:Simplified Data Processing on Large Clusters [J]．Communications of the ACM,2008,51(1):107-113
[16] Sanjay G,Howard G,Leung S T．The Google file system [C]∥ the Nineteenth ACM Symposium on Operating Systems Principles(SOSP’ 03).Bolton Landing,NY,USA,2003,37(5):29-43
[17] Kang U,Tsourakakis C,Appel A P,et al．HADI:fast diameter estimation and mining in massive graphs with Hadoop[R]．Pittsburgh:Carnegie Mellon University,2008
[18] Lin J,Dyer C．Data-intensive text processing with MapReduce[M]．Morgan and Claypool Publishers,2010
[19] Ene A,Im S,Moseley B．Fast clustering using MapReduce[C]∥Proceeding of the 17th ACM SIGKDD Inernational Conference on Knowledge Discovery and Data Ming．New York:ACM,2011:681-689
[20] 向小军,高阳,商琳,等.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188
[21] 赵卫中,马慧芳,傅燕翔,等．基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168
[22] Biswanath P,Joshua S H,Sugato B,et al．PLANET:Massi-vely parallel learning of tree ensembles with MapReduce[C]∥35th International Conference on Very Large Data Bases(VLDB-2009)．2009:1426-1437
[23] Lu Qiu,Cheng Xiao-hui．The Research of Decision Tree Mining Based on Hadoop[C]∥9th International Conference on Fuzzy Systems and Knowledge Discovery(FSKD 2012)．2012:798-801
[24] Apache.Hadoop[EB/OL].http://hadoop．apache.org/,2012-10-20
[25] White T．Hadoop:权威指南 [M].曾大聃,周傲英,译．北京:清华大学出版社,2010
[26] Mohammed J Z．Parallel and Distributed Data Mining:An Introduction[C]∥Revised Papers from Large-Scale Parallel Data Mining,Workshop on Large-Scale Parallel KDD Systems．London,UK:SIGKDD,2000:1-23
[27] Ruoming J,Gagan A．Communication and memory efficient paral-lel decision tree construction[C]∥the third SIAM International Conference on Data Mining．San Fra ncisco:Society for Industrial & Applied,2003:119-129

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed