计算机科学 ›› 2017, Vol. 44 ›› Issue (7): 31-37.doi: 10.11896/j.issn.1002-137X.2017.07.006

• 2016 年全国理论计算机科学学术年会 • 上一篇    下一篇

一种基于MapReduce模型的高效频繁项集挖掘算法

朱坤,黄瑞章,张娜娜   

  1. 贵州大学计算机科学与技术学院 贵阳550025 贵州省公共大数据重点实验室 贵阳550025,贵州大学计算机科学与技术学院 贵阳550025 贵州省公共大数据重点实验室 贵阳550025,贵州大学计算机科学与技术学院 贵阳550025 贵州省公共大数据重点实验室 贵阳550025
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61462011,61202089),高等学校博士学科专项科研基金(20125201120006),贵州大学引进人才科研项目(2011015)资助

Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model

ZHU Kun, HUANG Rui-zhang and ZHANG Na-na   

  • Online:2018-11-13 Published:2018-11-13

摘要: 由于互联网技术急速发展及其用户迅速地增加,很多网络服务公司每天不得不处理TB级甚至更大规模的数据量。在如今的大数据时代,如何挖掘有用的信息正变成一个重要的问题。关于数据挖掘(Data Mining)的算法在很多领域中已经被广泛运用,挖掘频繁项集是数据挖掘中最常见且最主要的应用之一,Apriori则是从一个大的数据集中挖掘出频繁项集的最为典型的算法。然而,当数据集比较大或使用单一主机时,内存将会被快速消耗,计算时间也将急剧增加,使得算法性能较低,基于MapReduce的分布式和并行计算则被提出。文中提出了一种改进的MMRA (Matrix MapReduce Algorithm)算法,它通过将分块数据转换成矩阵来挖掘所有的频繁k项集;然后将提出的算法和目前已经存在的两种算法(one-phase算法、k-phase算法)进行比较。采用Hadoop-MapReduce作为实验平台,并行和分布式计算为处理大数据集提供了一个潜在的解决方案。实验结果表明,改进算法的性能优于其他两种算法。

关键词: Hadoop,MapReduce,分布式计算,数据挖掘,频繁项集挖掘,Apriori算法

Abstract: Along with the rapid development of Internet and the rapid growing group of users,many Internet services companies have to manage TB size or higher amount of data every day.How to find useful information in this big data era is becoming an important problem.The data mining algorithm has been widely used in many fields,and finding frequent itemsets is one of the most common and primary applications of data mining,and Apriori algorithm is the most typical algorithm for finding frequent itemsets from a big transaction database.However,when the dataset size is rather huge or a single host is used,the memory would be quickly exhausted and the computation time would also increase dramatically,which make the algorithm performance inefficient.Parallel and distributed computing based on the MapReduce framework has been proposed.An improved reformative MMRA (Matrix MapReduce Algorithm) algorithm which should convert the blocked data into matrixs to find all frequent k-itemsets was proposed in this paper,and the proposed algorithm was compared with current two existed algorithms(one-phase algorithm and k-phase algorithm).Adapting Hadoop-MapReduce as the experiment platform,parallel and distributed computing offer a potential solution for processing vast amount of data.Experimental results show that the proposed algorithm outperforms the other two algorithms.

Key words: Hadoop,MapReduce,Distributed computing,Data mining,Frequent itemset mining,Apriori algorithm

[1] HE Y,WANG W Q,XUE F.Study of Massive Data MiningBased on Cloud Computing[J].Computer Technology and Development,2013,23(2):69-72.(in Chinese) 贺瑶,王文庆,薛飞.基于云计算的海量数据挖掘研究[J].计算机技术与发展,2013,23(2):69-72.
[2] TIAN S,WANG S,LIU Y,et al.An algorithm for mining frequent itemsets on uncertain dataset[J].Computer Modelling and New Technologies,2014:264-272.
[3] CHANG R,LIU Z.An improved apriori algorithm[C]∥2011 International Conference on Electronics and Optoelectronics.2011:1476-1478.
[4] AGRAWAL R,IMIELINSKI T,SWAMI A.Mining Association Rule between Sets of Items in Large Databases[C]∥ACM SIGMOD Conf.Management of Data,1993:207-216.
[5] LI W W,ZHAO H,ZHANG Y,et al.Research on massive data mining based on MapReduce[J].Computer Engineering and Applications,2013,49(20):112-117.(in Chinese) 李伟卫,赵航,张阳,等.基于MapReduce的海量数据挖掘技术研究[J].计算机工程与应用,2013,49(20):112-117.
[6] LIU S H,LIU S J,CHEN X S,et al.IOMRA-- A High Efficiency Frequent Itemset Mining Algorithm Based on the MapReduce[C]∥IEEE 17th International Conference on Computation Model.2014:1290-1295.
[7] YANG X Y,LIU Z,FU Z.MapReduce as a Programming Model for Association Rules Algorithm on Hadoop[C]∥The 3rd International Conference on Information Sciences and Interaction Sciences.Chengdu,China,IEEE,2010:99-102.
[8] YE Y B,CHANG C C.A Parallel Apriori Algorithm for Frequent Itemsets Mining[C]∥Fourth International Conference Software Engineering Research,Managementand Applications.2006:87-94.
[9] KOVACS F,ILLES J.Frequent itemset mining on hadoop.Com-putational Cybernetics (ICCC)[C]∥The IEEE 9th Internatio-nal Conference.2013:241-245.
[10] YU K M,LEE M G,HUANG Y S,et al.An Efficient Frequent Patterns Mining Algorithm Based on MapReduce Framework[C]∥ International Conference on Software Intelligence Techno-logies and Applications & International Conference on Frontiers of Internet of Things.2014:1-5.
[11] LI L,ZHANG M.The Strategy of Mining Association RuleBased on Cloud Computing[C]∥The International Conference on Business Computing and Global Informatization.Washington,DC,USA,2011:475-478.
[12] LI N,ZENG L,HE Q,et al.Parallel Implementation of Apriori Algorithm Based on MapReduce[C]∥The 13th ACIS International Conference on Software Engineering,Networking and Parallel& Distributed Computing.Kyoto,2012:236-241.
[13] NING Y H.Study of Some Techniques in Data Mining Based on Spark[D].Hangzhou:China Jiliang University,2015.(in Chinese) 宁永恒.基于Spark的若干数据挖掘技术研究[D].杭州:中国计量学院,2015.
[14] LIU D D,CHEN J,LIANG F,et al.A Performance Analysis for Hadoop under Heterogeneous Cloud Computing Environments[J].Journal of Interation Technology,2012,1(4):46-51.(in Chinese) 刘丹丹,陈俊,梁锋,等.云计算异构环境下Hadoop性能分析[J].集成技术,2012,1(4):46-51.
[15] DONG X H,LI R X,ZHOU W W,et al.Performance Optimization and Feature Enhancements of Hadoop System[J].Journal of Computer Research and Development,2013,50(S2):1-15.(in Chinese) 董新华,李瑞轩,周湾湾,等.Hadoop系统性能优化与功能增强综述[J].计算机研究与发展,2013,50(S2):1-15.
[16] YING Y,LIU Y J.Survey of Developments of MapReduce Parallel Computing Technology[J].Computer System Application,2014,23(4):1-6.(in Chinese) 应毅,刘亚军.MapReduce并行计算技术发展综述[J].计算机系统应用,2014,23(4):1-6.
[17] KHARE N,ADLAKHA N,PARDASANI K R.An Algorithm Ior Mining Multidimensional Association Rules using Boolean Matrix[C]∥ 2010 International Conference in Recent Trends in Information,Telecommun ication and Computing.Kochi,Kerala,2010:95-99.
[18] LIU H Z,DAI S P,JIANG H.Quantitative association rulesmining algorithm based on matrix[C]∥2009 International Conference on Computational Intelligence and Software Enginee-ring.Wuhan,2009:1-4.
[19] TIAN S,WANG S,LIU Y,et al.An algorithm for mining frequent itemsets on uncertain dataset[J].Computer Modelling and New Technologies,2014,18(11):264-272.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!