Computer Science ›› 2019, Vol. 46 ›› Issue (12): 38-44.doi: 10.11896/jsjkx.190100240

• Big Data & Data Science • Previous Articles     Next Articles

Q-sample-based Local Similarity Join Parallel Algorithm

WANG Xiao-xia, SUN De-cai   

  1. (College of Information Science and Technology,Bohai University,Jinzhou,Liaoning 121013,China)
  • Received:2019-01-29 Online:2019-12-15 Published:2019-12-17

Abstract: Local similarity join can finds all local similar pairs from sets quickly,which is a basic operation in many areas,such as gene sequence alignment,near duplicate detection,data cleaning and so on.This paper focused on designing similarity join parallel algorithm with MapReduce,and proposed a Q-sample-based algorithm to solve the locating problem of local similarity join.The proposed algorithm employs a filter-verify based framework.In filter stage,a Q-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs,and then more dissimilar string pairs are discarded.In verify stage,the LS-Join’s backward-forward verification method is improved with the technique of removing redundant match,combining consecutive match and combining non-consecutive match.In the experiments,the performances of the proposed algorithm with different size of datasets or different value of edit distances are scaled.Experimental results show that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset.Theoretical analysis and experimental result demonstrate that the performance of local similarity join is improved by using the techniques of the proposed algorithm.

Key words: Big data, Data cleaning, MapReduce, Q-sample, Similarity join

CLC Number: 

  • TP391
[1]SILVA Y N,PEARSON S S,CHON J,et al.Similarity Joins:Their implementation and interactions with other databaseopera-tors[J].Information Systems,2015,52(8/9):149-162.
[2]YU M,LI G,DENG D,et al.String similarity search and join:a survey[J].Frontiers of Computer Science,2016,10(3):399-417.
[3]PAGH R.Large-scale similarity joins with guarantees[C]//Proceedings of 18th International Conference on Database Theory.Brussels,Belgium:Schloss Dagstuhl,2015:15-24.
[4]PANG J,YU G,XU J,et al.Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science,2015,42(1):1-5,27.
[5]LEVENSHTEIN V.Binary codes capable of correcting dele- tions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710.
[6]LIN C,YU H,WENG W,et al.Large-Scale Similarity Join with Edit-Distance Constraints[C]//Proceedings of 19th InternationalConference on Database Systems for Advanced Applications.Bali Indonesia:Springer,2014:328-342.
[7]DENG D,LI G L,HAO S,et al.MassJoin:A MapReduce-based Method for Scalable String Similarity Joins[C]//Proceedings of IEEE 30th International Conference on Data Engineering.New York,USA:IEEE,2014:340-351.
[8]CHEN G,YANG K,CHEN L,et al.Metric similarity joins using MapReduce[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(3):656-669.
[9]WANG J Y,YANG X C,WANG B,et al.LS-Join:Local Similarity Join on String Collections[J].IEEE Transactions Know-ledge and Data Engineering,2017,29(9):1928-1942.
[10]BAYARDO R J,MA Y,SRIKANT R.Scaling up all pairs similarity search[C]//Proceedings of 16th International Conference on World Wide Web.New York,USA:ACM,2007:131-140.
[11]XIAO C,WANG W,LIN X.Ed-join:an efficient algorithm for similarity joins with edit distance constraints[J].Proceedings of the Vldb Endowment,2008,1(1):933-944.
[12]XIAO C,WANG W,LIN X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems,2011,36(3):1-41.
[13]LI G,DONG D,WANG J,et al.PASS-JOIN:A Partition-based Method for Similarity Joins[J].Proceedings of the Vldb Endowment,2011,5(3):253-264.
[14]WANG W,QIN J,XIAO C,et al.Vchunkjoin:An efficient algorithm for edit similarity joins[J].IEEE Transactions on Know-ledge and Data Engineering,2013,25(8):1916-1929.
[15]SHANG Z,LIU Y,LI G,et al.K-Join:Knowledge-Aware Similarity Join[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(12):3293-3308.
[16]VERNICA R,CAREY M J,LI C.Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of 2010 ACM SIGMOD International Conference on Management of Data.Indianapolis,IN,USA:ACM,2010:495-506.
[17]METWALLY A,FALOUTSOS C.V-SMART-Join:A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors[J].Proceedings of the Vldb Endowment,2012,5(8):704-715.
[18]AFRATI F N,SARMA A D,MENESTRINA D,et al.Fuzzy Joins Using MapReduce[C]//Proceedings of IEEE 28th International Conference on Data Engineering.Washington,DC:IEEE,2012:498-509.
[19]MA Y,MENG X,WANG S.Parallel similarity joins on massive high-dimensional data using MapReduce[J].Concurrency and Computation:Practice and Experience,2016,28(1):166-183.
[20]RONG C T,LIN C B,SILVA Y N,et al.Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics[C]//Proceedings of IEEE 33rd International Conference on Data Engineering.New York,USA:IEEE,2017:1059-1070.
[1] CHEN Jing, WU Ling-ling. Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment [J]. Computer Science, 2022, 49(8): 108-112.
[2] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[3] LIU Wei-ming, AN Ran, MAO Yi-min. Parallel Support Vector Machine Algorithm Based on Clustering and WOA [J]. Computer Science, 2022, 49(7): 64-72.
[4] WANG Mei-shan, YAO Lan, GAO Fu-xiang, XU Jun-can. Study on Differential Privacy Protection for Medical Set-Valued Data [J]. Computer Science, 2022, 49(4): 362-368.
[5] SUN Xuan, WANG Huan-xiao. Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives [J]. Computer Science, 2022, 49(4): 67-73.
[6] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[7] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[8] WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[9] WANG Xue-cen, ZHANG Yu, LIU Ying-jie, YU Ge. Evaluation of Quality of Interaction in Online Learning Based on Representation Learning [J]. Computer Science, 2021, 48(2): 207-211.
[10] ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang. Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework [J]. Computer Science, 2021, 48(2): 41-46.
[11] TENG Jian, TENG Fei, LI Tian-rui. Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder [J]. Computer Science, 2021, 48(12): 195-203.
[12] ZHANG Yu-long, WANG Qiang, CHEN Ming-kang, SUN Jing-tao. Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems [J]. Computer Science, 2021, 48(12): 231-242.
[13] LIU Ya-chen, HUANG Xue-ying. Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data [J]. Computer Science, 2021, 48(11A): 258-264.
[14] ZHANG Guang-jun, ZHANG Xiang. Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” [J]. Computer Science, 2021, 48(10): 324-333.
[15] YE Ya-zhen, LIU Guo-hua, ZHU Yang-yong. Two-step Authorization Pattern of Data Product Circulation [J]. Computer Science, 2021, 48(1): 119-124.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!