计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 409-411.

• 大数据与数据挖掘 • 上一篇    下一篇

一种基于Hadoop的关联规则挖掘算法

丁勇, 朱长水, 武玉艳   

  1. 南京理工大学泰州科技学院 江苏 泰州225300
  • 出版日期:2019-02-26 发布日期:2019-02-26
  • 作者简介:丁 勇(1980-),男,硕士,副教授,主要研究方向为数据库理论与数据挖掘,E-mail:4383526@qq.com;朱长水(1981-),男,硕士,副教授,主要研究方向为虚拟现实、图像处理,E-mail:shui_zc@163.com;武玉艳(1981-),女,硕士,助理研究员,主要研究方向为信息技术基础、教学管理,E-mail:1718427921@qq.com。
  • 基金资助:
    本文受2015江苏省高校自然科学研究面上项目(15KJB520016),2017年度江苏省高校“青蓝工程”资助。

Association Rule Mining Algorithm Based on Hadoop

DING Yong, ZHU Chang-shui, WU Yu-yan   

  1. Taizhou College of Science and Technology,Nanjing University of Science and Technology,Taizhou,Jiangsu 225300,China
  • Online:2019-02-26 Published:2019-02-26

摘要: 传统的并行关联规则算法对每一次迭代都定义一个MapReduce任务,以实现候选项集的生成和计数功能,但多次启动MapReduce任务会带来极大的性能开销。文中定义了一种并行关联规则挖掘算法PST-Apriori,该算法采取分治策略,在每个分布式计算节点定义一个前缀共享树,通过递归调用的方式将事务T生成的候选项集逐层压缩到前缀共享树(PST)中。然后广度遍历PST,逐层将每个节点对应的〈key,value〉作为map函数的输入,并由Map-Reduce框架自动按照key值进行聚集。最后调用reduce函数对多个任务的处理结果进行汇总,得到满足最小支持度阈值的频繁项集。算法只使用两个MapReduce任务,且PST按照key值排序便于Mapper端的shuffle操作,提高了运行效率。

关键词: Hadoop, MapReduce, 关联规则, 前缀共享树

Abstract: The traditional parallel association rule algorithm defines a MapReduce task for each iteration to implement the generation and counting function of the candidate set,but multiple startup of the MapReduce task brings great performance overhead.This paper defined a parallel association rule mining algorithm (PST-Apriori).This algorithm adopts a partition strategy,defines a prefix shared tree in each distributed computing node,and compresses the candidate items generated by each transaction T to the prefix shared tree (PST).Then the breadth traversal algorithm is used,and the 〈key,value〉 corresponding to each node are used as input of the map function,and the MapReduce frame is automatically gathered according to the key value.Finally,the reduce function is called to aggregate the processing results of multiple tasks,and the frequent itemsets satisfying the minimum support threshold are obtained.The algorithm only usestwo MapReduce tasks,and PST is sorted according to key value to facilitate shuffle operation at Mapper,which improves the efficiency of operation.

Key words: Association rule, Hadoop, MapReduce, Prefix shared tree

中图分类号: 

  • TP311
[1]AGRAWAL R,SRIKANT R.Fast algorithms for mining associa-tion rules(3rd ed)[M]∥Readings in Database Systems.Morgan Kaufmann Publishers Inc.,1998:2299-2308.
[2]HAN J,PEI J,YIN Y.Mining frequent patterns without candidate generation[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2000:1-12.
[3]LI L,ZHANG M.The Strategy of Mining Association Rule Based on Cloud Computing[C]∥International Conference on Business Computing and Global Informatization.IEEE,2011:475-478.
[4]LI N,ZENG L,HE Q,et al.Parallel Implementation of Apriori Algorithm Based on MapReduce[C]∥Acis International Conference on Software Engineering,Artificial Intelligence,Networking and Parallel & Distributed Computing.IEEE,2012:236-241.
[5]ZHOU X,HUANG Y.An Improved Parallel Association Rules Algorithm Based on MapReduce Framework for Big Data[C]∥International Conference on Fuzzy Systems and Knowledge Discovery.IEEE,2014:284-288.
[6]郝天曙.基于Hadoop的并行数据挖掘的研究[D].南京:南京邮电大学,2017.
[7]张玲.基于Hadoop平台并行关联规则挖掘算法研究[D].西安:西安科技大学,2017.
[8]荀亚玲.集群环境下的关联规则挖掘及应用[D].太原:太原科技大学,2017.
[9]于跃.基于Hadoop平台的并行化分布式关联规则挖掘算法研究[D].吉林:吉林大学,2017.
[10]李若晨.基于并行的Apriori数据挖掘算法的研究[D].吉林:吉林大学,2017.
[11]叶璐.基于Spark的改进关联规则算法研究[D].太原:太原科技大学,2017.
[12]马连灯.基于HADOOP平台的并行关联规则算法研究[D].天津:天津工业大学,2017.
[1] 刘卫明, 安冉, 毛伊敏.
基于聚类和WOA的并行支持向量机算法
Parallel Support Vector Machine Algorithm Based on Clustering and WOA
计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[2] 曹扬晨, 朱国胜, 孙文和, 吴善超.
未知网络攻击识别关键技术研究
Study on Key Technologies of Unknown Network Attack Identification
计算机科学, 2022, 49(6A): 581-587. https://doi.org/10.11896/jsjkx.210400044
[3] 田冰川, 田臣, 周宇航, 陈贵海, 窦万春.
减少Hadoop集群中网络队头阻塞的调度算法
Reducing Head-of-Line Blocking on Network in Hadoop Clusters
计算机科学, 2022, 49(3): 11-22. https://doi.org/10.11896/jsjkx.210900117
[4] 徐慧慧, 晏华.
基于相对危险度的儿童先心病风险因素分析算法
Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children
计算机科学, 2021, 48(6): 210-214. https://doi.org/10.11896/jsjkx.200500082
[5] 沈夏炯, 杨继勇, 张磊.
基于不相关属性集合的属性探索算法
Attribute Exploration Algorithm Based on Unrelated Attribute Set
计算机科学, 2021, 48(4): 54-62. https://doi.org/10.11896/jsjkx.200800082
[6] 张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚.
面向MapReduce的中间数据传输流水线优化机制
Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework
计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103
[7] 张素梅, 张波涛.
一种基于量子耗散粒子群的评估模型构建方法
Evaluation Model Construction Method Based on Quantum Dissipative Particle Swarm Optimization
计算机科学, 2020, 47(6A): 84-88. https://doi.org/10.11896/JsJkx.190900148
[8] 陈孟辉, 曹黔峰, 兰彦琦.
基于区块挖掘与重组的启发式算法求解置换流水车间调度问题
Heuristic Algorithm Based on Block Mining and Recombination for Permutation Flow-shop Scheduling Problem
计算机科学, 2020, 47(6A): 108-113. https://doi.org/10.11896/JsJkx.190300151
[9] 崔巍, 贾晓琳, 樊帅帅, 朱晓燕.
一种新的不均衡关联分类算法
New Associative Classification Algorithm for Imbalanced Data
计算机科学, 2020, 47(6A): 488-493. https://doi.org/10.11896/JsJkx.190600132
[10] 王青松, 姜富山, 李菲.
大数据环境下基于关联规则的多标签学习算法
Multi-label Learning Algorithm Based on Association Rules in Big Data Environment
计算机科学, 2020, 47(5): 90-95. https://doi.org/10.11896/jsjkx.190300150
[11] 朱岸青, 李帅, 唐晓东.
Spark平台中的并行化FP_growth关联规则挖掘方法
Parallel FP_growth Association Rules Mining Method on Spark Platform
计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110
[12] 王童, 马文平, 罗维.
基于区块链的信息共享及安全多方计算模型
Information Sharing and Secure Multi-party Computing Model Based on Blockchain
计算机科学, 2019, 46(9): 162-168. https://doi.org/10.11896/j.issn.1002-137X.2019.09.023
[13] 张蕾,蔡明.
基于主题融合和关联规则挖掘的图像标注
Image Annotation Based on Topic Fusion and Frequent Patterns Mining
计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[14] 张维国.
面向知识推荐服务的选课决策
Decision Making of Course Selection Oriented by Knowledge Recommendation Service
计算机科学, 2019, 46(6A): 507-510.
[15] 贾宁, 李瑛达.
基于智能可穿戴设备的个性化健康监管平台的构建
Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device
计算机科学, 2019, 46(6A): 566-570.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!