基于Hadoop的并行共享决策树挖掘算法研究

Abstract

Abstract: Shared knowledge is focused on the problems of knowledge discovery shared by two(or more) applications/datasets,and it can help users investigate a poorly understood dataset from a well understood dataset by analogy and transferring knowledge．However,with the advent of the era of big data,the existing algorithms based on serialized can not able to meet the need of the rapid data growth and the serial algorithm has low efficiency in big datasets．The Parallel Shared Decision Tree algorithm(PSDT) based on Hadoop was introduced by using cloud computing．The traditional attributes list structure is uses in the PSDT．But the overmuch operations of I/O occur in the parallel processes of this algorithm,which will influence the algorithm’s performance．A new Hybrid Parallel Shared Decision Tree algorithm(HPSDT),which uses hybrid data structures,was proposed．The algorithm applies attributes-list to compute the split indicators parallelly,and data records-structure to split in the splitting procedure．Data analysis indicates that the algorithm of HPSDT simplifies the splitting process,and its operations of I/O are 0.34times of the PSDT．The experimental results show that PSDT and HPSDT have good parallelism and scalability,furthermore,the performance of HPSDT is better than that of PSDT．Especially,the superiority is even more obvious with the increase of the dataset size.

Key words: Shared decision tree,Parallel shared decision tree,Hybrid data structure,Cloud computing,Hadoop

CHEN Xiang-tao,ZHANG Chao and HAN Qian. Research on Parallel Shared Decision Tree Algorithm Based on Hadoop[J].Computer Science, 2013, 40(11): 215-221.

References

[1] Dong Guo-zhu,Han Qian．Mining Shared Decision Trees be-tween Datasets[R]．OH.USA:Wright State University/OhioLINK,2010
[2] Quinlan J R.Induction of decision trees [J]．Machine Learning,1986(1):81-106
[3] Quinlan J R．C4.5:Programs for machine learning [M]．SanFrancisco:Morgan Kaufmann Publishers Inc.,1992
[4] Manish M,Rakesh A,Jorma R.SLIQ:A fast scalable classier for data mining[C]∥EDBT’ 96.London:Springer-Verlag,1996:18-32
[5] John C S,Rakesh A,Manish M．SPRINT:A scalable parallel classier for data mining[C]∥22th VLDB Conference．San Francisco:Morgan Kaufmann Publishers Inc.,1996:544-555
[6] Hayes B．Cloud computing [J]．Communications of the ACM,2008,51(7):9-11
[7] Armbrust M,Stoica I,Zaharia M,et al．A view of cloud computing [J]．Communications of the ACM,2010,53(4):50-58
[8] Mohammed J Z,Ho Ching-tien,Rakesh A．Parallel classification for data mining on shared-memory multiprocessors [C]∥ICDE:IEEE International Conference on Data Engineering．Sydney:IEEE Computer Society,1999:198-205
[9] Mohammed J Z．Parallel and distributed association mining:asurvey [J]．IEEE Concurrency,1999,7(4):14-25
[10] Ruo-ming J,Ge Yang,Gagan A．Shared memory parallelizetion of data mining algorithms:techniques,programming interface,and performance[J].Transactions on Knowledge and Data Engineering,2005,17(1):71-89
[11] Peter S P．Parallel Programming with MPI [M]．San Francisco:Morgan Kaufmann Publishers Inc.,1997
[12] Gropp W,Lusk E,Skjellum A．Using MPI portable parallel programming with the message-passing interface [M]．Cambridge:MIT Press,1999
[13] Nuno A,Joo G,Fernando S．Exploiting Parallelism in Decision Tree Induction[C]∥ECML/PKDD Workshop on Parallel and Distributed computing for Machine Learning．2003:13-22
[14] Wu Rong,Liang Shuai,Liao Hua-ming．Cyclic workflow execution mechanism on top of MapReduce framework[C]∥Seventh International Conference on Semantics,Knowledge and Grids．Washington,DC:IEEE Computer Society,2011:28-35
[15] Jeffrey D,Sanjay G．MapReduce:Simplified Data Processing on Large Clusters [J]．Communications of the ACM,2008,51(1):107-113
[16] Sanjay G,Howard G,Leung S T．The Google file system [C]∥ the Nineteenth ACM Symposium on Operating Systems Principles(SOSP’ 03).Bolton Landing,NY,USA,2003,37(5):29-43
[17] Kang U,Tsourakakis C,Appel A P,et al．HADI:fast diameter estimation and mining in massive graphs with Hadoop[R]．Pittsburgh:Carnegie Mellon University,2008
[18] Lin J,Dyer C．Data-intensive text processing with MapReduce[M]．Morgan and Claypool Publishers,2010
[19] Ene A,Im S,Moseley B．Fast clustering using MapReduce[C]∥Proceeding of the 17th ACM SIGKDD Inernational Conference on Knowledge Discovery and Data Ming．New York:ACM,2011:681-689
[20] 向小军,高阳,商琳,等.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188
[21] 赵卫中,马慧芳,傅燕翔,等．基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168
[22] Biswanath P,Joshua S H,Sugato B,et al．PLANET:Massi-vely parallel learning of tree ensembles with MapReduce[C]∥35th International Conference on Very Large Data Bases(VLDB-2009)．2009:1426-1437
[23] Lu Qiu,Cheng Xiao-hui．The Research of Decision Tree Mining Based on Hadoop[C]∥9th International Conference on Fuzzy Systems and Knowledge Discovery(FSKD 2012)．2012:798-801
[24] Apache.Hadoop[EB/OL].http://hadoop．apache.org/,2012-10-20
[25] White T．Hadoop:权威指南 [M].曾大聃,周傲英,译．北京:清华大学出版社,2010
[26] Mohammed J Z．Parallel and Distributed Data Mining:An Introduction[C]∥Revised Papers from Large-Scale Parallel Data Mining,Workshop on Large-Scale Parallel KDD Systems．London,UK:SIGKDD,2000:1-23
[27] Ruoming J,Gagan A．Communication and memory efficient paral-lel decision tree construction[C]∥the third SIAM International Conference on Data Mining．San Fra ncisco:Society for Industrial & Applied,2003:119-129

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Research on Parallel Shared Decision Tree Algorithm Based on Hadoop

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0