计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 38-44.doi: 10.11896/jsjkx.190100240

• 大数据与数据科学 • 上一篇    下一篇

一种基于Q-sample的局部相似连接并行算法

王晓霞, 孙德才   

  1. (渤海大学信息科学与技术学院 辽宁 锦州121013)
  • 收稿日期:2019-01-29 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 王晓霞(1977-),女,硕士,讲师,主要研究方向为大数据、相似连接、入侵检测等,E-mail:wxxsdc@163.com。
  • 作者简介:孙德才(1979-),男,博士,副教授,主要研究方向为大数据、相似连接、近似串匹配等。
  • 基金资助:
    本文受教育部人文社会科学研究青年基金项目(15YJC870021),国家自然科学基金青年基金项目(61602056),国家社会科学基金项目(19BTQ028),辽宁省自然科学基金(20170540015),辽宁省社会科学基金(L18AXW001),辽宁省教育厅科学研究项目(L2015010)资助。

Q-sample-based Local Similarity Join Parallel Algorithm

WANG Xiao-xia, SUN De-cai   

  1. (College of Information Science and Technology,Bohai University,Jinzhou,Liaoning 121013,China)
  • Received:2019-01-29 Online:2019-12-15 Published:2019-12-17

摘要: 局部相似连接能快速找出数据集间的局部相似记录对,是基因序列比对、剽窃检测和数据清洗等研究领域的基本操作。文中主要研究基于MapReduce框架的并行相似连接技术,提出了一种基于Q-sample的局部相似连接算法,解决了局部相似连接的定位问题。该算法采用了过滤验证二阶段模式:在过滤阶段,所提算法使用Q-sample分割方案拆分字符串集,在不丢失任何匹配的基础上生成了高质量的子串,抛弃了大量的无关字符串对;在验证阶段,所提算法优化了LS-Join算法的双向扩展验证方法,通过去除冗余匹配、合并连续匹配和合并非连续匹配等技术提高了算法的验证效率。通过实验对比了不同数据集和编辑距离参数下算法的性能表现,结果显示所提算法在大数据集上的局部相似连接速度快于当前的优秀算法LS-Join。理论分析和实验结果证明,所提算法的相关技术提高了局部相似的连接性能。

关键词: 相似连接, Q-sample, MapReduce, 数据清洗, 大数据

Abstract: Local similarity join can finds all local similar pairs from sets quickly,which is a basic operation in many areas,such as gene sequence alignment,near duplicate detection,data cleaning and so on.This paper focused on designing similarity join parallel algorithm with MapReduce,and proposed a Q-sample-based algorithm to solve the locating problem of local similarity join.The proposed algorithm employs a filter-verify based framework.In filter stage,a Q-sample partition scheme is adopted to generate high-quality signatures without losing any true pairs,and then more dissimilar string pairs are discarded.In verify stage,the LS-Join’s backward-forward verification method is improved with the technique of removing redundant match,combining consecutive match and combining non-consecutive match.In the experiments,the performances of the proposed algorithm with different size of datasets or different value of edit distances are scaled.Experimental results show that the proposed algorithm outperforms the current excellent algorithm LS-Join on big dataset.Theoretical analysis and experimental result demonstrate that the performance of local similarity join is improved by using the techniques of the proposed algorithm.

Key words: Similarity join, Q-sample, MapReduce, Data cleaning, Big data

中图分类号: 

  • TP391
[1] SILVA Y N,PEARSON S S,CHON J,et al.Similarity Joins:Their implementation and interactions with other databaseopera-tors[J].Information Systems,2015,52(8/9):149-162.
[2] YU M,LI G,DENG D,et al.String similarity search and join:a survey[J].Frontiers of Computer Science,2016,10(3):399-417.
[3] PAGH R.Large-scale similarity joins with guarantees[C]//Proceedings of 18th International Conference on Database Theory.Brussels,Belgium:Schloss Dagstuhl,2015:15-24.
[4] PANG J,YU G,XU J,et al.Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science,2015,42(1):1-5,27.
[5] LEVENSHTEIN V.Binary codes capable of correcting dele- tions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710.
[6] LIN C,YU H,WENG W,et al.Large-Scale Similarity Join with Edit-Distance Constraints[C]//Proceedings of 19th InternationalConference on Database Systems for Advanced Applications.Bali Indonesia:Springer,2014:328-342.
[7] DENG D,LI G L,HAO S,et al.MassJoin:A MapReduce-based Method for Scalable String Similarity Joins[C]//Proceedings of IEEE 30th International Conference on Data Engineering.New York,USA:IEEE,2014:340-351.
[8] CHEN G,YANG K,CHEN L,et al.Metric similarity joins using MapReduce[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(3):656-669.
[9] WANG J Y,YANG X C,WANG B,et al.LS-Join:Local Similarity Join on String Collections[J].IEEE Transactions Know-ledge and Data Engineering,2017,29(9):1928-1942.
[10] BAYARDO R J,MA Y,SRIKANT R.Scaling up all pairs similarity search[C]//Proceedings of 16th International Conference on World Wide Web.New York,USA:ACM,2007:131-140.
[11] XIAO C,WANG W,LIN X.Ed-join:an efficient algorithm for similarity joins with edit distance constraints[J].Proceedings of the Vldb Endowment,2008,1(1):933-944.
[12] XIAO C,WANG W,LIN X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems,2011,36(3):1-41.
[13] LI G,DONG D,WANG J,et al.PASS-JOIN:A Partition-based Method for Similarity Joins[J].Proceedings of the Vldb Endowment,2011,5(3):253-264.
[14] WANG W,QIN J,XIAO C,et al.Vchunkjoin:An efficient algorithm for edit similarity joins[J].IEEE Transactions on Know-ledge and Data Engineering,2013,25(8):1916-1929.
[15] SHANG Z,LIU Y,LI G,et al.K-Join:Knowledge-Aware Similarity Join[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(12):3293-3308.
[16] VERNICA R,CAREY M J,LI C.Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of 2010 ACM SIGMOD International Conference on Management of Data.Indianapolis,IN,USA:ACM,2010:495-506.
[17] METWALLY A,FALOUTSOS C.V-SMART-Join:A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors[J].Proceedings of the Vldb Endowment,2012,5(8):704-715.
[18] AFRATI F N,SARMA A D,MENESTRINA D,et al.Fuzzy Joins Using MapReduce[C]//Proceedings of IEEE 28th International Conference on Data Engineering.Washington,DC:IEEE,2012:498-509.
[19] MA Y,MENG X,WANG S.Parallel similarity joins on massive high-dimensional data using MapReduce[J].Concurrency and Computation:Practice and Experience,2016,28(1):166-183.
[20] RONG C T,LIN C B,SILVA Y N,et al.Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics[C]//Proceedings of IEEE 33rd International Conference on Data Engineering.New York,USA:IEEE,2017:1059-1070.
[1] 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式[J]. 计算机科学, 2021, 48(1): 119-124.
[2] 赵会群, 吴凯锋. 一种大数据估价算法[J]. 计算机科学, 2020, 47(9): 110-116.
[3] 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术[J]. 计算机科学, 2020, 47(9): 117-122.
[4] 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型[J]. 计算机科学, 2020, 47(8): 185-188.
[5] 朝乐门. 数据科学导论的课程设计及教学改革[J]. 计算机科学, 2020, 47(7): 1-7.
[6] 顾荣杰, 吴治平, 石焕. 基于TFR 模型的公安云平台数据分级分类安全访问控制模型研究[J]. 计算机科学, 2020, 47(6A): 400-403.
[7] 李泳. 基于BigQuant 大数据平台的股票投资策略开发[J]. 计算机科学, 2020, 47(6A): 612-615.
[8] 葛雨明, 韩庆文, 王妙琼, 曾令秋, 李璐. 汽车大数据应用模式与挑战分析[J]. 计算机科学, 2020, 47(6): 59-65.
[9] 刘纪芹, 史开泉. 大数据分解-融合及其智能获取[J]. 计算机科学, 2020, 47(6): 66-73.
[10] 曾伟良, 吴淼森, 孙为军, 谢胜利. 自动驾驶出租车调度系统研究综述[J]. 计算机科学, 2020, 47(5): 181-189.
[11] 禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统[J]. 计算机科学, 2020, 47(11A): 667-672.
[12] 郝秀梅, 史开泉. 大数据智能检索与大数据区块元智能分离[J]. 计算机科学, 2020, 47(11): 113-121.
[13] 徐鹤, 吴昊, 李鹏. 面向物联网的时空数据处理算法设计[J]. 计算机科学, 2020, 47(11): 310-315.
[14] 汪洋, 李鹏, 季一木, 樊卫北, 张玉杰, 王汝传, 陈国良. 高性能计算与天文大数据研究综述[J]. 计算机科学, 2020, 47(1): 1-6.
[15] 王童, 马文平, 罗维. 基于区块链的信息共享及安全多方计算模型[J]. 计算机科学, 2019, 46(9): 162-168.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .