计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 55-61.doi: 10.11896/jsjkx.221000011
张阳1,2, 王瑞3, 吴贯锋1,2, 刘弘毅1,2
ZHANG Yang1,2, WANG Rui3, WU Guanfeng1,2, LIU Hongyi1,2
摘要: 频繁项集挖掘是数据挖掘中的一个基本问题,在许多数据挖掘应用中发挥着重要作用。针对并行频繁项集挖掘算法MrPrePost在大数据环境存在密集数据集下算法效率下降、计算节点负载量不均衡和冗余搜索等问题,提出了基于N-lists和DiffNodeset两种结构的并行频繁项集挖掘算法(Parallel Mining algorithm of Frequent Itemset based on N-list and DiffNodeset structure,PFIMND)。首先,根据N-list和DiffNodeset在存储不同数据集上的优势,设计了稀疏度估计函数(Sparsity Estimation,SE),根据数据集稀疏程度灵活选取其中之一压缩数据集,相比采用单一存储结构消耗的内存更少;其次,提出了计算量估计函数(Computation Estimation,CE)来估计频繁1项集F-list中每一项的负载量,并根据计算量进行均匀分组;最后采用集合枚举树作为搜索空间,为避免组合爆炸和冗余搜索问题,设计了超集剪枝策略和基于宽度优先搜索的剪枝策略,生成最终的挖掘结果。实验结果表明,相比同类算法HP-FIMBN,PFIMND算法在Susy数据集上挖掘频繁项集的效果提升了12.3%。
中图分类号:
[1]AGRAWAL R,MANNILA H,SRIKANT R,et al.Fast disco-very of association rules[J].Advances in Knowledge Discovery and Data Mining,1996,12(1):307-328. [2]LUNA J M,FOURNIER-VIGER P,VENTURA S.Frequentitemset mining:A 25 years review[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2019,9(6):e1329. [3]HAN J,PEI J,YIN Y.Mining frequent patterns without candidate generation[J].ACM Sigmod Record,2000,29(2):1-12. [4]ZAKI M J.Scalable algorithms for association mining[J].IEEE Transactions on Knowledge and Data Engineering,2000,12(3):372-390. [5]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [6]XIAO W,HU J.Paradigm and performance analysis of distributed frequent itemset mining algorithms based on MapReduce[J].Microprocessors and Microsystems,2021,82:103817. [7]QIN J,HAO T S,DONG Q Q.Parallel improvement of Apriori algorithm based on MapReduce[J].Computer Technology and Development,2017,27(4):64-68. [8]GUO F F,LIANG X,WANG H Q,et al.A parallel Apriori algorithm based on multi tree[J].Small Microcomputer System,2015,36(6):1176-1180. [9]NADIMI-SHAHRAKI M H,MANSOURI M.Hp-Apriori:Ho-rizontal parallel-apriori algorithm for frequent itemset mining from big data[C]//2017 IEEE 2nd International Conference on Big Data Analysis(ICBDA).IEEE,2017:286-290. [10]LI H,WANG Y,ZHANG D,et al.PFP:parallel FP-growth for query recommendation[C]//Proceedings of the 2008 ACM Conference on Recommender Systems.2008:107-114. [11]ZHOU L,ZHONG Z,CHANG J,et al.Balanced parallel FP-growth with MapReduce[C]//2010 IEEE youth Conference on Information,Computing and Telecommunications.IEEE,2010:243-246. [12]GAO Q,WAN X D.Parallel FP-growth algorithm based on load balance[J].Computer Engineering,2019,45(3):32-35,40. [13]MOENS S,AKSEHIRLI E,GOETHALS B.Frequent itemsetmining for big data[C]//2013 IEEE International Conference on Big Data.IEEE,2013:111-118. [14]ZHANG Z G,JI G L,TANG M M.A new algorithm for parallel mining frequent item sets mreclat[J].Computer Applications,2014,34(8):2175-2178. [15]FENG X J,PAN X.Parallel Eclat algorithm based on spark[J].Computer Application Research,2019,36(1):18-21. [16]LIAO J,ZHAO Y,LONG S.MRPrePost—A parallel algorithm adapted for mining big data[C]//2014 IEEE Workshop on Electronics,Computer and Applications.IEEE,2014:564-568. [17]DENG Z H,WANG Z H,JIANG J J.A new algorithm for fast mining frequent itemsets using N-lists[J].Science China Information Sciences,2012,55(9):2008-2030. [18]LIU W M,ZHANG C,MAO Y M.Hybrid parallel frequent itemset mining algorithm using n-list structure[J].Computer Science and Exploration,2022,16(1):120-136. [19]MAO Y M,GENG J H,MWAKAPESA D S,et al.PFIMD:a parallel MapReduce-based algorithm for frequent itemset mining[J].Multimedia Systems,2021,27(4):709-722. [20]DENG Z H.DiffNodesets:An efficient structure for fast mining frequent itemsets[J].Applied Soft Computing,2016,41:214-223. [21]FOURNIER-VIGER P,GOMARIZ A,SOLTANI A,et al.SPMF:OpenSource data mining platform[EB/OL].http://www.philippe-fournier-viger.com/spmf. |
|