计算机科学 ›› 2017, Vol. 44 ›› Issue (7): 262-266.doi: 10.11896/j.issn.1002-137X.2017.07.046

• 人工智能 • 上一篇    下一篇

基于Hadoop的Apriori改进算法研究

黄剑,李明奇,郭文强   

  1. 电子科技大学数学科学学院 成都611731,电子科技大学数学科学学院 成都611731,新疆财经大学计算机科学与工程学院 乌鲁木齐830012
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61163066)资助

Reseach on Improved Apriori Algorithm Based on Hadoop

HUANG Jian, LI Ming-qi and GUO Wen-qiang   

  • Online:2018-11-13 Published:2018-11-13

摘要: 对于规模庞大的事务数据库,传统的并行Apriori算法在挖掘中会在数据IO上有较大的时间开销。 从压缩事务、减少扫描次数、简化候选集生成3个方面对Apriori算法进行改进。提出了以元素“0”和“1”表示事务的布尔矩阵模型,并引入权值维度,压缩了相同事务的矩阵规模。同时,动态地进行剪枝,矩阵的“与”运算用于候选集合的生成。将改进后的算法在Hadoop框架上进行并行化实现,实验表明该算法适合大规模数据挖掘且具有良好的伸缩性与有效性。

关键词: Apriori算法,事务数据库,布尔矩阵,Hadoop

Abstract: Traditional mining based on parallel Apriori algorithms needs much more time in data IO with the increasing size of large transaction database.In this paper,we improved Apriori algorithm in three aspects:compression in the transaction,reducing the number of scanning areas,and simplifying the candidate set generation .We proposed “0” and “1” as the entries to describe the transaction Boolean matrix model,and introduced the weight dimensions to compress the matrix size of the transaction.Meanwhile,dynamic pruning matrix is adopted,and “and” operation of matrix is applied to generate a candidate set.The experiments of the improved algorithm running parallel in Hadoop framework show that the algorithm is suitable for large-scale data mining,and the algorithm has good scalability and effectiveness.

Key words: Apriori algorithm,Transaction database,Boolean matrix,Hadoop

[1] 董志安,纪希禹,韩秋明,等.数据挖掘技术与应用实例[M].北京:机械工业出版社,2009:12-48.
[2] XIE Y G,LIU R,QIAO R,et al.Resarch Literature on Big Data and Public Opinion[J].New media and society,2014(4):133-157.(in Chinese) 谢耘耕,刘锐,乔睿,等.大数据与社会舆情研究综述[J].新媒体与社会,2014(4):133-154.
[3] AGRAWAL R,SRIKANT R.Fast algorithm for mining association rules[C]∥Proceedings of 20th Int.Conf.Very Large Data Bases (VLDB).Morgan Kaufman Press,1994:487-499.
[4] REN W J,YU B W.Improved Apriori Algorithm Based on Ma-trix Reducation[J].Computer and Modern,2015,0:2-3.(in Chinese) 任伟建,于博文.基于矩阵简约的Apriori算法改进[J].计算机与现代化,2015,0:2-3.
[5] GUNARATHNE T,WU T L,QIU J,et al.MapReduce in the Clouds for Science[C]∥2010 IEEE Second International Con-ference on Cloud Computing Technology and Science(CloudCom).IEEE,2010:565-572.
[6] 陆嘉恒.Hadoop实战[M].北京:机械工业出版社,2011:17-128.
[7] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters [J].Communications of the ACM,2008,1(1):107-113.
[8] LI W J,ZHAO H,ZHANG Y,et al.Research on massive data mining technology based on MapReduce[J].Computer Engineering and Applications,2013(20):113-114.(in Chinese) 李伟卫,赵航,张阳,等.基于MapReduce的海量数据挖掘技术研究[J].计算机工程与应用,2013(20):113-114.
[9] CHANG F,DEAN J,GHEMAWAT S,et al.Bigtable:A distri-buted storage system for structured data [J].ACM Transactions on Computer Systems (TOCS),2008,6(2):4-10.
[10] LIU S J.Research of Frequent Itemsets Mining AlogorithmBased on MapReduce Calculation Model[D].Harbin:Harbin University of Science and Technology,2015:39-50.(in Chinese) 刘士佳.基于MapReduce框架的频繁项集挖掘算法研究[D].哈尔滨:哈尔滨理工大学,2015:39-50.
[11] KUANG S H,LI B.Analysis Cloud Computing Architecture and its Application[J].Computer&Digital Engineering,2010,3(6):61-63.(in Chinese) 框胜徽,李勃.云计算体系结构及应用实例分析[J].计算机与数字工程,2010,3(6):61-63.
[12] BONCHI F,LUCCHESE C.Pushing tougher constraints in fre-quent patternvmining[M]∥Advances in Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2005:114-124.
[13] BCHEUNG D W,HAN J,NG V T,et al.Maintenance of discovered association rules in largedatabases:An incremental updating technique[C]∥Proceedings of the Twelfth International Conference on Data Engineering,1996.IEEE,1996,6-114.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!