计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 139-143.doi: 10.11896/jsjkx.191000110

• 数据库&大数据&数据科学 • 上一篇    下一篇

Spark平台中的并行化FP_growth关联规则挖掘方法

朱岸青1, 李帅2, 唐晓东3   

  1. 1 暨南大学管理学院 广州 510000
    2 北京航空航天大学计算机学院 北京 100191
    3 华南师范大学经济与管理学院 广州 510006
  • 收稿日期:2019-10-16 修回日期:2020-03-10 出版日期:2020-12-15 发布日期:2020-12-17
  • 通讯作者: 朱岸青(87651566@qq.com)
  • 基金资助:
    广州市专利技术产业化项目(201601010207);国家自然科学基金面上项目(61672077);国家重点研发计划(2017YFF0106407);2017国家自然科学基金青年基金项目(61702026)

Parallel FP_growth Association Rules Mining Method on Spark Platform

ZHU An-qing1, LI Shuai2, TANG Xiao-dong3   

  1. 1 School of Management Jinan University Guangzhou 510000,China
    2 School of Computer Science and Engineering Beihang University Beijing 100191,China
    3 School of Economics and Management South,China Normal University Guangzhou 510006,China
  • Received:2019-10-16 Revised:2020-03-10 Online:2020-12-15 Published:2020-12-17
  • About author:ZHU An-qing,born in 1976Ph.Dassociate professor.Her main research interests include internet of things and enterprise management.
  • Supported by:
    Guangzhou Patent Technology Industrialization Project(201601010207),General Project of National Natural Science Foundation of China(61672077),National Key R&D Program(2017YFF0106407) and 2017 National Natural Science Foundation Youth Fund Project(61702026).

摘要: 为了提高关联规则挖掘效率文中提出了一种适用于Spark平台的并行化FP_growth关联规则挖掘方法.首先利用Spark平台在分布式系统中的所有节点的内存RDD中完成遍历扫描运算得到频繁集以便生成FP_Table并更新FP_Tree.然后引入时间序列来预测待挖掘的项目集以便实现分布式系统中的所有节点能够均衡分担挖掘任务从而充分利用各节点的FP_Tree遍历功能获取FP_growth关联规则挖掘结果.实验结果显示相比单机情况并行化FP_growth关联规则挖掘在效率方面提高了约60%.经过负载均衡处理后的FP_growth关联规则挖掘的效率更高提高了约14%这说明各节点遍历任务的分配更均衡并行化程度更高.

关键词: FP_growth算法, Spark平台, 负载均衡, 关联规则挖掘, 频繁集

Abstract: In order to improve the efficiency of association rule mininga parallel FP_growth association rule mining method suitable for spark platform is proposed.Firstthe Spark platform is used to complete the traversal scan operation in the memory RDD of all nodes of the distributed system to obtain frequent sets in order to generate FP_Table and update FP_Tree.Thenthe time series is introduced to predict the itemsets to be minedso that all nodes in the distributed system can share the mining tasks in a balanced mannerso as to make full use of the traversal FP_Tree calculation function of each node to obtain the FP_growth association rule mining results.The experimental results show that compared to the single machine casethe parallelized FP_growth association rule mining improves the efficiency by about 60%.After the load balancing processthe mining efficiency of the FP_growth association rule is higherincreasing by about 14%which indicates that the traversal task allocation of each node is more balanced and the degree of parallelism is higher.

Key words: Association rules mining, FP_growth algorithm, Frequent sets, Load balancing, Spark platform

中图分类号: 

  • TP311.13
[1] BELALEM G,ABBACHE A,BELKREDIM F Z,et al.Arabic Query Expansion Using WordNet and Association Rules[J].International Journal of Intelligent Information Technologies,2016,12(3):51-64.
[2] MAI T,VO B,NGUYEN L T T.A lattice-based approach for mining high utility association rules[J].Information Sciences,2017,399:81-97.
[3] YU B,LIU S Q.An improved association rule mining algorithm based on FP-growth algorithm[J].Computer and Network,2017,43(14):68-71.
[4] LIN W T,CHU C P.Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments[J].Parallel Algorithms &Applications,2014,30(5):1-13.
[5] SHAO X Y,ZHANG L.An improved parallel algorithm for FP-Growth association rules based on Hadoop[J].Computer Applied Research,2018,35 (1).
[6] DIVYAVARMA K,REMYA M,DEEPA G.An Enhanced Bug Mining for Identifying Frequent Bug Pattern Using Word Tokenizer and FP-Growth[M]//Proceedings of the 5th International Conference on Frontiers in Intelligent Computing:Theory and Applications.2017:525-532.
[7] MOHAMED Y S,NAJIB M,ABDELAZIZ E,et al.APRICOIN:An adaptive approach for prioritizing high-risk containers inspections[J].IEEE Access,2017,5(99):18238-18249.
[8] XU F,LU H.The Application of FP-Growth Algorithm Based on Distributed Intelligence in Wisdom Medical Treatment[J].International Journal of Pattern Recognition &Artificial Intelligence,2017,31(4):232-237.
[9] CHEN J G,LI K L,MEMBER S,et al.A Parallel Random Fo-rest Algorithm for Big Data in a Spark Cloud Computing Environment[J].IEEE Transactions on Parallel &Distributed Systems,2017,28(4):919-933.
[10] SHI W,ZHU Y,YU P S,et al.Effective Prediction of Missing Data on Apache Spark over Multivariable Time Series[J].IEEE Transactions on Big Data,2018,4(4):473-486.
[11] ZHANG L Z,CUI Y,LUO G C,et al.Dynamic Load Balancing Algorithms for Large Data Distributed Storage[J].Computer Science,2017,44(5):178-183.
[12] LI Z Y,YU J,BIAN C,et al.Data flow dynamic load balancing strategy based on load perception[J].Computer Applications,2017,37(10):2760-2766.
[13] LUO J,LEI R,XUE L.Spatio-Temporal Load Balancing for Ener-gy Cost Optimization in Distributed Internet Data Centers[J].IEEE Transactions on Cloud Computing,2017,3(3):387-397.
[14] LI X,LI T.Design and implementation of recommendation system based on Spark[J].Computer Technology and development,2018,28(10):201-205.
[15] MESTRE D G,PIRES C E S,NASCIMENTOD C,et al.An efficient spark-based adaptive windowing for entity matching[J].Journal of Systems &Software,2017,128:1-10.
[16] CHEN J G,LI K L,MEMBER S,et al.A Parallel Random Fo-rest Algorithm for Big Data in a Spark Cloud Computing Environment[J].IEEE Transactions on Parallel &Distributed Systems,2017,28(4):919-933.
[17] YAN Y L,CHEN M,SADIQ S,et al.Efficient Imbalanced Multimedia Concept Retrieval by Deep Learning on Spark Clusters[J].International Journal of Multimedia Data Engineering &Management,2017,8(1):1-20.
[18] CAO N,WANG C,LI M,et al.Privacy-Preserving Multi-Keyword Ranked Search over Encrypted Cloud Data[J].IEEE Transactions on Parallel &Distributed Systems,2014,25(1):222-233.
[19] HU J,PAN H A.Improved incremental updating algorithm of association rules[J].Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition),2018,30(4):558-563.
[1] 田真真, 蒋维, 郑炳旭, 孟利民.
基于服务器集群的负载均衡优化调度算法
Load Balancing Optimization Scheduling Algorithm Based on Server Cluster
计算机科学, 2022, 49(6A): 639-644. https://doi.org/10.11896/jsjkx.210800071
[2] 高捷, 刘沙, 黄则强, 郑天宇, 刘鑫, 漆锋滨.
基于国产众核处理器的深度神经网络算子加速库优化
Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor
计算机科学, 2022, 49(5): 355-362. https://doi.org/10.11896/jsjkx.210500226
[3] 谭双杰, 林宝军, 刘迎春, 赵帅.
基于机器学习的分布式星载RTs系统负载调度算法
Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning
计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126
[4] 夏中, 向敏, 黄春梅.
基于CHBL的P2P视频监控网络分层管理机制
Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL
计算机科学, 2021, 48(9): 278-285. https://doi.org/10.11896/jsjkx.201200056
[5] 白勇, 张占龙, 熊隽迪.
基于FP-Growth算法和GRNN的电力知识文本挖掘
Power Knowledge Text Mining Based on FP-Growth Algorithm and GRNN
计算机科学, 2021, 48(8): 86-90. https://doi.org/10.11896/jsjkx.210600031
[6] 宋海宁, 焦健, 刘永.
高速公路中的移动边缘计算研究
Research on Mobile Edge Computing in Expressway
计算机科学, 2021, 48(6A): 383-386. https://doi.org/10.11896/jsjkx.200900212
[7] 王政, 姜春茂.
一种基于三支决策的云任务调度优化算法
Cloud Task Scheduling Algorithm Based on Three-way Decisions
计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023
[8] 郑增乾, 王锟, 赵涛, 蒋维, 孟利民.
带宽和时延受限的流媒体服务器集群负载均衡机制
Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster
计算机科学, 2021, 48(6): 261-267. https://doi.org/10.11896/jsjkx.200400131
[9] 姚泽玮, 林嘉雯, 胡俊钦, 陈星.
基于PSO-GA的多边缘负载均衡方法
PSO-GA Based Approach to Multi-edge Load Balancing
计算机科学, 2021, 48(11A): 456-463. https://doi.org/10.11896/jsjkx.210100191
[10] 杨紫淇, 蔡英, 张皓晨, 范艳芳.
基于负载均衡的VEC服务器联合计算任务卸载方案
Computational Task Offloading Scheme Based on Load Balance for Cooperative VEC Servers
计算机科学, 2021, 48(1): 81-88. https://doi.org/10.11896/jsjkx.200800220
[11] 郭飞雁, 唐兵.
基于用户延迟感知的移动边缘服务器放置方法
Mobile Edge Server Placement Method Based on User Latency-aware
计算机科学, 2021, 48(1): 103-110. https://doi.org/10.11896/jsjkx.200900146
[12] 王国澎, 杨剑新, 尹飞, 蒋生健.
负载均衡的处理器运算资源分配方法
Computing Resources Allocation with Load Balance in Modern Processor
计算机科学, 2020, 47(8): 41-48. https://doi.org/10.11896/jsjkx.191000148
[13] 金琪, 王俊昌, 付雄.
基于智能放置策略的Cuckoo哈希表
Cuckoo Hash Table Based on Smart Placement Strategy
计算机科学, 2020, 47(8): 80-86. https://doi.org/10.11896/jsjkx.191200109
[14] 高子妍, 王勇.
面向云服务的分布式消息系统负载均衡策略
Load Balancing Strategy of Distributed Messaging System for Cloud Services
计算机科学, 2020, 47(6A): 318-324. https://doi.org/10.11896/JsJkx.191100012
[15] 黄梅根, 汪涛, 刘亮, 庞瑞琴, 杜欢.
基于软件定义网络资源优化的虚拟网络功能部署策略
Virtual Network Function Deployment Strategy Based on Software Defined Network Resource Optimization
计算机科学, 2020, 47(6A): 404-408. https://doi.org/10.11896/JsJkx.191000116
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!