Computer Science ›› 2023, Vol. 50 ›› Issue (11): 55-61.doi: 10.11896/jsjkx.221000011

• Database & Big Data & Data Science • Previous Articles     Next Articles

Parallel Mining Algorithm of Frequent Itemset Based on N-list and DiffNodeset Structure

ZHANG Yang1,2, WANG Rui3, WU Guanfeng1,2, LIU Hongyi1,2   

  1. 1 School of Mathematics,Southwest Jiaotong University,Chengdu 611756,China
    2 National-Local Joint Engineering Laboratory of System Credibility Automatic Verification,Southwest Jiaotong University,Chengdu 611756,China
    3 Aerospace Internet of Things Technology Co.,Ltd,Beijing 100094,China
  • Received:2022-10-07 Revised:2023-02-22 Online:2023-11-15 Published:2023-11-06
  • About author:ZHANG Yang,born in 1998,postgra-duate,is a member of China Computer Federation.His main research interests include data mining and pattern disco-very.WU Guanfeng,born in 1986,Ph.D,is a member of China Computer Federation.His main research interests include intelligent information processing and parallel computing.
  • Supported by:
    National Natural Science Foundation of China(62106206).

Abstract: Frequent itemset mining is a basic problem of data mining and plays an important role in many data mining applications.In order to solve the problems of the parallel frequent itemset mining algorithm(MrPrePost) in big data environment,such as algorithm efficiency degradation,unbalanced load of computing nodes and redundant search,this paper proposes a parallel frequent itemset mining algorithm(PFIMND),which is based on N-lists and DiffNodeset.Firstly,according to the advantages of N-list and DiffNodeset data structures,the data set sparsity estimation function(SE) is designed,and one of them is selected to store data according to the data set sparsity.Secondly,the computational estimation function(CE) is proposed to estimate the load of each item in the frequent 1-item set F-list,and the load is evenly grouped according to the computational cost.Finally,the set enumeration tree is used as the search space.In order to avoid combination explosion and redundant search problems,the superset pruning strategy and the pruning strategy based on width first searches are designed to generate the final mining results.Experimental results show that compared with the similar algorithm(HP-FIMND),the effect of PFIMND algorithm in mining frequent itemsets on Susy dataset is improved by 12.3%.

Key words: Frequent itemset, Load estimation, MapReduce, Sparse estimation, Set-enumeration tree

CLC Number: 

  • TP311
[1]AGRAWAL R,MANNILA H,SRIKANT R,et al.Fast disco-very of association rules[J].Advances in Knowledge Discovery and Data Mining,1996,12(1):307-328.
[2]LUNA J M,FOURNIER-VIGER P,VENTURA S.Frequentitemset mining:A 25 years review[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2019,9(6):e1329.
[3]HAN J,PEI J,YIN Y.Mining frequent patterns without candidate generation[J].ACM Sigmod Record,2000,29(2):1-12.
[4]ZAKI M J.Scalable algorithms for association mining[J].IEEE Transactions on Knowledge and Data Engineering,2000,12(3):372-390.
[5]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[6]XIAO W,HU J.Paradigm and performance analysis of distributed frequent itemset mining algorithms based on MapReduce[J].Microprocessors and Microsystems,2021,82:103817.
[7]QIN J,HAO T S,DONG Q Q.Parallel improvement of Apriori algorithm based on MapReduce[J].Computer Technology and Development,2017,27(4):64-68.
[8]GUO F F,LIANG X,WANG H Q,et al.A parallel Apriori algorithm based on multi tree[J].Small Microcomputer System,2015,36(6):1176-1180.
[9]NADIMI-SHAHRAKI M H,MANSOURI M.Hp-Apriori:Ho-rizontal parallel-apriori algorithm for frequent itemset mining from big data[C]//2017 IEEE 2nd International Conference on Big Data Analysis(ICBDA).IEEE,2017:286-290.
[10]LI H,WANG Y,ZHANG D,et al.PFP:parallel FP-growth for query recommendation[C]//Proceedings of the 2008 ACM Conference on Recommender Systems.2008:107-114.
[11]ZHOU L,ZHONG Z,CHANG J,et al.Balanced parallel FP-growth with MapReduce[C]//2010 IEEE youth Conference on Information,Computing and Telecommunications.IEEE,2010:243-246.
[12]GAO Q,WAN X D.Parallel FP-growth algorithm based on load balance[J].Computer Engineering,2019,45(3):32-35,40.
[13]MOENS S,AKSEHIRLI E,GOETHALS B.Frequent itemsetmining for big data[C]//2013 IEEE International Conference on Big Data.IEEE,2013:111-118.
[14]ZHANG Z G,JI G L,TANG M M.A new algorithm for parallel mining frequent item sets mreclat[J].Computer Applications,2014,34(8):2175-2178.
[15]FENG X J,PAN X.Parallel Eclat algorithm based on spark[J].Computer Application Research,2019,36(1):18-21.
[16]LIAO J,ZHAO Y,LONG S.MRPrePost—A parallel algorithm adapted for mining big data[C]//2014 IEEE Workshop on Electronics,Computer and Applications.IEEE,2014:564-568.
[17]DENG Z H,WANG Z H,JIANG J J.A new algorithm for fast mining frequent itemsets using N-lists[J].Science China Information Sciences,2012,55(9):2008-2030.
[18]LIU W M,ZHANG C,MAO Y M.Hybrid parallel frequent itemset mining algorithm using n-list structure[J].Computer Science and Exploration,2022,16(1):120-136.
[19]MAO Y M,GENG J H,MWAKAPESA D S,et al.PFIMD:a parallel MapReduce-based algorithm for frequent itemset mining[J].Multimedia Systems,2021,27(4):709-722.
[20]DENG Z H.DiffNodesets:An efficient structure for fast mining frequent itemsets[J].Applied Soft Computing,2016,41:214-223.
[21]FOURNIER-VIGER P,GOMARIZ A,SOLTANI A,et al.SPMF:OpenSource data mining platform[EB/OL].http://www.philippe-fournier-viger.com/spmf.
[1] ZHAO Xuejian, ZHAO Ke. Bio-inspired Frequent Itemset Mining Strategy Based on Genetic Algorithm [J]. Computer Science, 2023, 50(11A): 220700200-8.
[2] LIU Wei-ming, AN Ran, MAO Yi-min. Parallel Support Vector Machine Algorithm Based on Clustering and WOA [J]. Computer Science, 2022, 49(7): 64-72.
[3] ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang. Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework [J]. Computer Science, 2021, 48(2): 41-46.
[4] WANG Tong, MA Wen-ping, LUO Wei. Information Sharing and Secure Multi-party Computing Model Based on Blockchain [J]. Computer Science, 2019, 46(9): 162-168.
[5] WANG Xiao-xia, SUN De-cai. Q-sample-based Local Similarity Join Parallel Algorithm [J]. Computer Science, 2019, 46(12): 38-44.
[6] HU Ying-shuang, LU Yi-hong. Cell Clustering Algorithm Based on MapReduce and Strongly Connected Fusion [J]. Computer Science, 2019, 46(11A): 204-207.
[7] QI Yu-dong,HE Cheng,SI Wei-chao. Cloud Resource Selection Algorithm by Skyline under MapReduce Frame [J]. Computer Science, 2018, 45(6A): 411-414.
[8] ZHANG Bin, LE Jia-jin. Hash Join in MapReduce Distributed Environment Based on Column-store [J]. Computer Science, 2018, 45(6A): 471-475.
[9] ZHOU Hua-ping, LIU Guang-zong and ZHANG Bei-bei. Load Balancing Strategy of MapReduce Clustering Based on Index Shift [J]. Computer Science, 2018, 45(5): 303-309.
[10] WANG Hua-jin, LI Jian-hui, SHEN Zhi-hong and ZHOU Yuan-chun. ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries [J]. Computer Science, 2018, 45(3): 158-164.
[11] MIAO Feng-yu, WANG Hong-zhi, RUAN Qun-sheng. Method of Similarity Join on Uncertain Graphs Using MapReduce [J]. Computer Science, 2018, 45(12): 299-307.
[12] LI Guang-pu, HUANG Miao-hua. Research Progress and Mainstream Methods of Frequent Itemsets Mining [J]. Computer Science, 2018, 45(11A): 1-11.
[13] YING Yi, REN Kai, LIU Ya-jun. Network Log Analysis Technology Based on Big Data [J]. Computer Science, 2018, 45(11A): 353-355.
[14] DING Yong, ZHU Chang-shui, WU Yu-yan. Association Rule Mining Algorithm Based on Hadoop [J]. Computer Science, 2018, 45(11A): 409-411.
[15] SONG Zhe-li, WANG Chao, WANG Zhen-fei. Multi-level Feature Selection Mechanism Based on MapReduce [J]. Computer Science, 2018, 45(11A): 468-473.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!