计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 38-44.doi: 10.11896/jsjkx.190100240
王晓霞, 孙德才
WANG Xiao-xia, SUN De-cai
摘要: 局部相似连接能快速找出数据集间的局部相似记录对,是基因序列比对、剽窃检测和数据清洗等研究领域的基本操作。文中主要研究基于MapReduce框架的并行相似连接技术,提出了一种基于Q-sample的局部相似连接算法,解决了局部相似连接的定位问题。该算法采用了过滤验证二阶段模式:在过滤阶段,所提算法使用Q-sample分割方案拆分字符串集,在不丢失任何匹配的基础上生成了高质量的子串,抛弃了大量的无关字符串对;在验证阶段,所提算法优化了LS-Join算法的双向扩展验证方法,通过去除冗余匹配、合并连续匹配和合并非连续匹配等技术提高了算法的验证效率。通过实验对比了不同数据集和编辑距离参数下算法的性能表现,结果显示所提算法在大数据集上的局部相似连接速度快于当前的优秀算法LS-Join。理论分析和实验结果证明,所提算法的相关技术提高了局部相似的连接性能。
中图分类号:
[1] | SILVA Y N,PEARSON S S,CHON J,et al.Similarity Joins:Their implementation and interactions with other databaseopera-tors[J].Information Systems,2015,52(8/9):149-162. |
[2] | YU M,LI G,DENG D,et al.String similarity search and join:a survey[J].Frontiers of Computer Science,2016,10(3):399-417. |
[3] | PAGH R.Large-scale similarity joins with guarantees[C]//Proceedings of 18th International Conference on Database Theory.Brussels,Belgium:Schloss Dagstuhl,2015:15-24. |
[4] | PANG J,YU G,XU J,et al.Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science,2015,42(1):1-5,27. |
[5] | LEVENSHTEIN V.Binary codes capable of correcting dele- tions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710. |
[6] | LIN C,YU H,WENG W,et al.Large-Scale Similarity Join with Edit-Distance Constraints[C]//Proceedings of 19th InternationalConference on Database Systems for Advanced Applications.Bali Indonesia:Springer,2014:328-342. |
[7] | DENG D,LI G L,HAO S,et al.MassJoin:A MapReduce-based Method for Scalable String Similarity Joins[C]//Proceedings of IEEE 30th International Conference on Data Engineering.New York,USA:IEEE,2014:340-351. |
[8] | CHEN G,YANG K,CHEN L,et al.Metric similarity joins using MapReduce[J].IEEE Transactions on Knowledge and Data Engineering,2017,29(3):656-669. |
[9] | WANG J Y,YANG X C,WANG B,et al.LS-Join:Local Similarity Join on String Collections[J].IEEE Transactions Know-ledge and Data Engineering,2017,29(9):1928-1942. |
[10] | BAYARDO R J,MA Y,SRIKANT R.Scaling up all pairs similarity search[C]//Proceedings of 16th International Conference on World Wide Web.New York,USA:ACM,2007:131-140. |
[11] | XIAO C,WANG W,LIN X.Ed-join:an efficient algorithm for similarity joins with edit distance constraints[J].Proceedings of the Vldb Endowment,2008,1(1):933-944. |
[12] | XIAO C,WANG W,LIN X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems,2011,36(3):1-41. |
[13] | LI G,DONG D,WANG J,et al.PASS-JOIN:A Partition-based Method for Similarity Joins[J].Proceedings of the Vldb Endowment,2011,5(3):253-264. |
[14] | WANG W,QIN J,XIAO C,et al.Vchunkjoin:An efficient algorithm for edit similarity joins[J].IEEE Transactions on Know-ledge and Data Engineering,2013,25(8):1916-1929. |
[15] | SHANG Z,LIU Y,LI G,et al.K-Join:Knowledge-Aware Similarity Join[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(12):3293-3308. |
[16] | VERNICA R,CAREY M J,LI C.Efficient parallel set-similarity joins using MapReduce[C]//Proceedings of 2010 ACM SIGMOD International Conference on Management of Data.Indianapolis,IN,USA:ACM,2010:495-506. |
[17] | METWALLY A,FALOUTSOS C.V-SMART-Join:A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors[J].Proceedings of the Vldb Endowment,2012,5(8):704-715. |
[18] | AFRATI F N,SARMA A D,MENESTRINA D,et al.Fuzzy Joins Using MapReduce[C]//Proceedings of IEEE 28th International Conference on Data Engineering.Washington,DC:IEEE,2012:498-509. |
[19] | MA Y,MENG X,WANG S.Parallel similarity joins on massive high-dimensional data using MapReduce[J].Concurrency and Computation:Practice and Experience,2016,28(1):166-183. |
[20] | RONG C T,LIN C B,SILVA Y N,et al.Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics[C]//Proceedings of IEEE 33rd International Conference on Data Engineering.New York,USA:IEEE,2017:1059-1070. |
[1] | 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式[J]. 计算机科学, 2021, 48(1): 119-124. |
[2] | 赵会群, 吴凯锋. 一种大数据估价算法[J]. 计算机科学, 2020, 47(9): 110-116. |
[3] | 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术[J]. 计算机科学, 2020, 47(9): 117-122. |
[4] | 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型[J]. 计算机科学, 2020, 47(8): 185-188. |
[5] | 朝乐门. 数据科学导论的课程设计及教学改革[J]. 计算机科学, 2020, 47(7): 1-7. |
[6] | 顾荣杰, 吴治平, 石焕. 基于TFR 模型的公安云平台数据分级分类安全访问控制模型研究[J]. 计算机科学, 2020, 47(6A): 400-403. |
[7] | 李泳. 基于BigQuant 大数据平台的股票投资策略开发[J]. 计算机科学, 2020, 47(6A): 612-615. |
[8] | 葛雨明, 韩庆文, 王妙琼, 曾令秋, 李璐. 汽车大数据应用模式与挑战分析[J]. 计算机科学, 2020, 47(6): 59-65. |
[9] | 刘纪芹, 史开泉. 大数据分解-融合及其智能获取[J]. 计算机科学, 2020, 47(6): 66-73. |
[10] | 曾伟良, 吴淼森, 孙为军, 谢胜利. 自动驾驶出租车调度系统研究综述[J]. 计算机科学, 2020, 47(5): 181-189. |
[11] | 禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统[J]. 计算机科学, 2020, 47(11A): 667-672. |
[12] | 郝秀梅, 史开泉. 大数据智能检索与大数据区块元智能分离[J]. 计算机科学, 2020, 47(11): 113-121. |
[13] | 徐鹤, 吴昊, 李鹏. 面向物联网的时空数据处理算法设计[J]. 计算机科学, 2020, 47(11): 310-315. |
[14] | 汪洋, 李鹏, 季一木, 樊卫北, 张玉杰, 王汝传, 陈国良. 高性能计算与天文大数据研究综述[J]. 计算机科学, 2020, 47(1): 1-6. |
[15] | 王童, 马文平, 罗维. 基于区块链的信息共享及安全多方计算模型[J]. 计算机科学, 2019, 46(9): 162-168. |
|