计算机科学 ›› 2013, Vol. 40 ›› Issue (11): 215-221.
陈湘涛,张超,韩茜
CHEN Xiang-tao,ZHANG Chao and HAN Qian
摘要: 共享知识挖掘是指通过学习不同事物之间的共享知识,将学习到的知识应用到未知事物来加快认知未知事物。针对大数据集中串行共享知识挖掘算法效率低下的问题,结合云计算技术,提出了一种基于Hadoop的并行共享决策树挖掘算法(PSDT)。该算法采用传统的属性表结构实现并行挖掘,但 其I/O操作过多,影响算法性能,为此,进一步提出了一种混合并行共享决策树挖掘算法(HPSDT)。该算法采用混合数据结构,在计算分裂指标阶段使用属性表结构,在分裂阶段采用数据记录结构。数据分析表明,HPSDT算法简化了分裂过程,其I/O操作是PSDT的0.34左右。实验结果表明,PSDT和HPSDT都具有良好的并行性和扩展性;HPSDT比PSDT性能更好,并且随着数据集的增大,HPSDT的优越性更加明显。
[1] Dong Guo-zhu,Han Qian.Mining Shared Decision Trees be-tween Datasets[R].OH.USA:Wright State University/OhioLINK,2010 [2] Quinlan J R.Induction of decision trees [J].Machine Learning,1986(1):81-106 [3] Quinlan J R.C4.5:Programs for machine learning [M].SanFrancisco:Morgan Kaufmann Publishers Inc.,1992 [4] Manish M,Rakesh A,Jorma R.SLIQ:A fast scalable classier for data mining[C]∥EDBT’ 96.London:Springer-Verlag,1996:18-32 [5] John C S,Rakesh A,Manish M.SPRINT:A scalable parallel classier for data mining[C]∥22th VLDB Conference.San Francisco:Morgan Kaufmann Publishers Inc.,1996:544-555 [6] Hayes B.Cloud computing [J].Communications of the ACM,2008,51(7):9-11 [7] Armbrust M,Stoica I,Zaharia M,et al.A view of cloud computing [J].Communications of the ACM,2010,53(4):50-58 [8] Mohammed J Z,Ho Ching-tien,Rakesh A.Parallel classification for data mining on shared-memory multiprocessors [C]∥ICDE:IEEE International Conference on Data Engineering.Sydney:IEEE Computer Society,1999:198-205 [9] Mohammed J Z.Parallel and distributed association mining:asurvey [J].IEEE Concurrency,1999,7(4):14-25 [10] Ruo-ming J,Ge Yang,Gagan A.Shared memory parallelizetion of data mining algorithms:techniques,programming interface,and performance[J].Transactions on Knowledge and Data Engineering,2005,17(1):71-89 [11] Peter S P.Parallel Programming with MPI [M].San Francisco:Morgan Kaufmann Publishers Inc.,1997 [12] Gropp W,Lusk E,Skjellum A.Using MPI portable parallel programming with the message-passing interface [M].Cambridge:MIT Press,1999 [13] Nuno A,Joo G,Fernando S.Exploiting Parallelism in Decision Tree Induction[C]∥ECML/PKDD Workshop on Parallel and Distributed computing for Machine Learning.2003:13-22 [14] Wu Rong,Liang Shuai,Liao Hua-ming.Cyclic workflow execution mechanism on top of MapReduce framework[C]∥Seventh International Conference on Semantics,Knowledge and Grids.Washington,DC:IEEE Computer Society,2011:28-35 [15] Jeffrey D,Sanjay G.MapReduce:Simplified Data Processing on Large Clusters [J].Communications of the ACM,2008,51(1):107-113 [16] Sanjay G,Howard G,Leung S T.The Google file system [C]∥ the Nineteenth ACM Symposium on Operating Systems Principles(SOSP’ 03).Bolton Landing,NY,USA,2003,37(5):29-43 [17] Kang U,Tsourakakis C,Appel A P,et al.HADI:fast diameter estimation and mining in massive graphs with Hadoop[R].Pittsburgh:Carnegie Mellon University,2008 [18] Lin J,Dyer C.Data-intensive text processing with MapReduce[M].Morgan and Claypool Publishers,2010 [19] Ene A,Im S,Moseley B.Fast clustering using MapReduce[C]∥Proceeding of the 17th ACM SIGKDD Inernational Conference on Knowledge Discovery and Data Ming.New York:ACM,2011:681-689 [20] 向小军,高阳,商琳,等.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188 [21] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行k-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168 [22] Biswanath P,Joshua S H,Sugato B,et al.PLANET:Massi-vely parallel learning of tree ensembles with MapReduce[C]∥35th International Conference on Very Large Data Bases(VLDB-2009).2009:1426-1437 [23] Lu Qiu,Cheng Xiao-hui.The Research of Decision Tree Mining Based on Hadoop[C]∥9th International Conference on Fuzzy Systems and Knowledge Discovery(FSKD 2012).2012:798-801 [24] Apache.Hadoop[EB/OL].http://hadoop.apache.org/,2012-10-20 [25] White T.Hadoop:权威指南 [M].曾大聃,周傲英,译.北京:清华大学出版社,2010 [26] Mohammed J Z.Parallel and Distributed Data Mining:An Introduction[C]∥Revised Papers from Large-Scale Parallel Data Mining,Workshop on Large-Scale Parallel KDD Systems.London,UK:SIGKDD,2000:1-23 [27] Ruoming J,Gagan A.Communication and memory efficient paral-lel decision tree construction[C]∥the third SIAM International Conference on Data Mining.San Fra ncisco:Society for Industrial & Applied,2003:119-129 |
No related articles found! |
|