计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 38-44.doi: 10.11896/jsjkx.190100240

• 大数据与数据科学 • 上一篇    下一篇

一种基于Q-sample的局部相似连接并行算法

王晓霞, 孙德才   

  1. (渤海大学信息科学与技术学院 辽宁 锦州121013)
  • 收稿日期:2019-01-29 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 王晓霞(1977-),女,硕士,讲师,主要研究方向为大数据、相似连接、入侵检测等,E-mail:wxxsdc@163.com。
  • 作者简介:孙德才(1979-),男,博士,副教授,主要研究方向为大数据、相似连接、近似串匹配等。
  • 基金资助:
    本文受教育部人文社会科学研究青年基金项目(15YJC870021),国家自然科学基金青年基金项目(61602056),国家社会科学基金项目(19BTQ028),辽宁省自然科学基金(20170540015),辽宁省社会科学基金(L18AXW001),辽宁省教育厅科学研究项目(L2015010)资助。

Q-sample-based Local Similarity Join Parallel Algorithm

WANG Xiao-xia, SUN De-cai   

  1. (College of Information Science and Technology,Bohai University,Jinzhou,Liaoning 121013,China)
  • Received:2019-01-29 Online:2019-12-15 Published:2019-12-17

摘要: 局部相似连接能快速找出数据集间的局部相似记录对,是基因序列比对、剽窃检测和数据清洗等研究领域的基本操作。文中主要研究基于MapReduce框架的并行相似连接技术,提出了一种基于Q-sample的局部相似连接算法,解决了局部相似连接的定位问题。该算法采用了过滤验证二阶段模式:在过滤阶段,所提算法使用Q-sample分割方案拆分字符串集,在不丢失任何匹配的基础上生成了高质量的子串,抛弃了大量的无关字符串对;在验证阶段,所提算法优化了LS-Join算法的双向扩展验证方法,通过去除冗余匹配、合并连续匹配和合并非连续匹配等技术提高了算法的验证效率。通过实验对比了不同数据集和编辑距离参数下算法的性能表现,结果显示所提算法在大数据集上的局部相似连接速度快于当前的优秀算法LS-Join。理论分析和实验结果证明,所提算法的相关技术提高了局部相似的连接性能。

关键词: MapReduce, Q-sample, 大数据, 数据清洗, 相似连接

Abstract: Local similarity join can finds all local similar pairs from sets quickly,which is a basic operation in many areas,such as gene sequence alignment,near duplicate detection,data cleaning and so on.This paper focused on designing similarity join parallel algorithm with MapReduce,and proposed a Q-sample-based algorithm to solve the locating problem of local similarity join.The proposed algorithm employs a filter-verify based framework.In filter stage,a Q-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs,and then more dissimilar string pairs are discarded.In verify stage,the LS-Join’s backward-forward verification method is improved with the technique of removing redundant match,combining consecutive match and combining non-consecutive match.In the experiments,the performances of the proposed algorithm with different size of datasets or different value of edit distances are scaled.Experimental results show that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset.Theoretical analysis and experimental result demonstrate that the performance of local similarity join is improved by using the techniques of the proposed algorithm.

Key words: Big data, Data cleaning, MapReduce, Q-sample, Similarity join

中图分类号: 

  • TP391
[1]SILVA Y N,PEARSON S S,CHON J,et al.Similarity Joins:Their implementation and interactions with other databaseopera-tors[J].Information Systems,2015,52(8/9):149-162.
[2]YU M,LI G,DENG D,et al.String similarity search and join:a survey[J].Frontiers of Computer Science,2016,10(3):399-417.
[3]PAGH R.Large-scale similarity joins with guarantees[C]//Proceedings of 18th International Conference on Database Theory.Brussels,Belgium:Schloss Dagstuhl,2015:15-24.
[4]PANG J,YU G,XU J,et al.Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science,2015,42(1):1-5,27.
[5]LEVENSHTEIN V.Binary codes capable of correcting dele- tions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710.
[6]LIN C,YU H,WENG W,et al.Large-Scale Similarity Join with Edit-Distance Constraints[C]//Proceedings of 19th InternationalConference on Database Systems for Advanced Applications.Bali Indonesia:Springer,2014:328-342.
[7]DENG D,LI G L,HAO S,et al.MassJoin:A MapReduce-based Method for Scalable String Similarity Joins[C]//Proceedings of IEEE 30th International Conference on Data Engineering.New York,USA:IEEE,2014:340-351.
[8]CHEN G,YANG K,CHEN L,et al.Metric similarity joins using MapReduce[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(3):656-669.
[9]WANG J Y,YANG X C,WANG B,et al.LS-Join:Local Similarity Join on String Collections[J].IEEE Transactions Know-ledge and Data Engineering,2017,29(9):1928-1942.
[10]BAYARDO R J,MA Y,SRIKANT R.Scaling up all pairs similarity search[C]//Proceedings of 16th International Conference on World Wide Web.New York,USA:ACM,2007:131-140.
[11]XIAO C,WANG W,LIN X.Ed-join:an efficient algorithm for similarity joins with edit distance constraints[J].Proceedings of the Vldb Endowment,2008,1(1):933-944.
[12]XIAO C,WANG W,LIN X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems,2011,36(3):1-41.
[13]LI G,DONG D,WANG J,et al.PASS-JOIN:A Partition-based Method for Similarity Joins[J].Proceedings of the Vldb Endowment,2011,5(3):253-264.
[14]WANG W,QIN J,XIAO C,et al.Vchunkjoin:An efficient algorithm for edit similarity joins[J].IEEE Transactions on Know-ledge and Data Engineering,2013,25(8):1916-1929.
[15]SHANG Z,LIU Y,LI G,et al.K-Join:Knowledge-Aware Similarity Join[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(12):3293-3308.
[16]VERNICA R,CAREY M J,LI C.Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of 2010 ACM SIGMOD International Conference on Management of Data.Indianapolis,IN,USA:ACM,2010:495-506.
[17]METWALLY A,FALOUTSOS C.V-SMART-Join:A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors[J].Proceedings of the Vldb Endowment,2012,5(8):704-715.
[18]AFRATI F N,SARMA A D,MENESTRINA D,et al.Fuzzy Joins Using MapReduce[C]//Proceedings of IEEE 28th International Conference on Data Engineering.Washington,DC:IEEE,2012:498-509.
[19]MA Y,MENG X,WANG S.Parallel similarity joins on massive high-dimensional data using MapReduce[J].Concurrency and Computation:Practice and Experience,2016,28(1):166-183.
[20]RONG C T,LIN C B,SILVA Y N,et al.Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics[C]//Proceedings of IEEE 33rd International Conference on Data Engineering.New York,USA:IEEE,2017:1059-1070.
[1] 陈晶, 吴玲玲.
多源异构环境下的车联网大数据混合属性特征检测方法
Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment
计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[2] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[3] 刘卫明, 安冉, 毛伊敏.
基于聚类和WOA的并行支持向量机算法
Parallel Support Vector Machine Algorithm Based on Clustering and WOA
计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[4] 孙轩, 王焕骁.
政务大数据安全防护能力建设:基于技术和管理视角的探讨
Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives
计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010
[5] 王美珊, 姚兰, 高福祥, 徐军灿.
面向医疗集值数据的差分隐私保护技术研究
Study on Differential Privacy Protection for Medical Set-Valued Data
计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[6] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[7] 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳.
面向大数据分析的智能交互向导系统
Smart Interactive Guide System for Big Data Analytics
计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[8] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[9] 王雪岑, 张昱, 刘迎婕, 于戈.
基于表示学习的在线学习交互质量评价方法
Evaluation of Quality of Interaction in Online Learning Based on Representation Learning
计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042
[10] 张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚.
面向MapReduce的中间数据传输流水线优化机制
Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework
计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103
[11] 滕建, 滕飞, 李天瑞.
基于3D卷积和LSTM编码解码的出行需求预测
Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder
计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022
[12] 张育龙, 王强, 陈明康, 孙静涛.
图像去雨算法在云物联网应用中的研究综述
Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems
计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055
[13] 曹萌, 于洋, 梁英, 史红周.
基于区块链的大数据交易关键技术与发展趋势
Key Technologies and Development Trends of Big Data Trade Based on Blockchain
计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163
[14] 刘亚臣, 黄雪莹.
卫星监测时空大数据蠕变特征提取及预警算法
Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data
计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071
[15] 张光君, 张翔.
应用“大数据+区块链”优化立法评估制度的机理与路径
Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain”
计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!