基于MapReduce框架的海量数据相似性连接研究进展

doi:10.11896/j.issn.1002-137X.2015.01.001

摘要/Abstract

摘要： 海量数据相似性连接作为海量数据处理的基本操作,在文本聚类、剽窃检测、实体解析等研究领域具有重要作用。另一方面,MapReduce编程模型因为具有良好的可扩放性、容错性和易用性,被广泛地应用于海量数据处理。因此,基于MapReduce框架的海量数据相似性连接查询技术成为海量数据处理领域的热点问题之一。首先,概括了海量数据固有特点和MapReduce编程框架的缺陷给现有相似性连接查询技术带来的巨大挑战；其次,提出了海量数据相似性连接的定义,按3种不同的分类标准对其进行了分类；接着,重点分析了集合、字符串和向量数据类型的海量相似性连接查询最新技术,并从效率和适用范围等方面分别对这些技术进行了比较；最后,讨论了海量数据相似性连接查询技术亟待解决的关键问题,并提出了一些有前景的解决方案。

关键词: 海量数据,相似性连接,MapReduce,Top-k

Abstract: As a basic operation of large-scale data processing,similarity joins on large-scale data play an important role in document clustering,plagiarism detection,entity resolution and many other fields.On the other hand,the MapReduce programming model is widely applied to massive data processing because of its good scalability,fault tolerance and usability.Therefore,similarity joins on massive data based on MapReduce become one of the hot topics in the field of massive data processing.Firstly,big challenges of similarity join query introduced by rapid growth of data volume were generalized.Then,the definition of similarity joins on massive data was presented and similarity joins on massive data were classified according to three different standards.In addition,the current status of set,string and vector similarity joins on massive data were emphatically analyzed and compared from the aspects of efficiency,applicability and so on.Finally,the research focus and trend of this area were investigated and the promising solutions were suggested.

Key words: Massive data,Similarity join,MapReduce,Top-k

庞俊,于戈,许嘉,谷峪. 基于MapReduce框架的海量数据相似性连接研究进展[J]. 计算机科学, 2015, 42(1): 1-5. https://doi.org/10.11896/j.issn.1002-137X.2015.01.001

PANG Jun, YU Ge, XU Jia and GU Yu. Similarity Joins on Massive Data Based on MapReduce Framework[J]. Computer Science, 2015, 42(1): 1-5. https://doi.org/10.11896/j.issn.1002-137X.2015.01.001

参考文献

[1] Viktor M S,Kenneth C.Big data:A Revolution that will transform how we live,work and think[M].盛杨燕,周涛,译.杭州:浙江人民出版社,2013:9-10
[2] 李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,9(8):8-15
[3] Miller R.‘Digital universe’ nears a zettabyte[EB/OL].[2012-11-24].http://www.datacenterknowledge.com/archives/2010/05/04/digital-universe-nears-a-zettabyte/
[4] Broder A Z,Glassman S C,Manasse M S,et al.Syntactic cluste-ring of the web[J].Computer Networks and ISCN Systems,1997,9(8-13):1157-1166
[5] Hoad T C,Zobel J.Methods for identifying versioned and plagi-arized documents[J].Journal of the American Society for Information Science and Technology,2003,4(3):203-215
[6] Kolb L,Thor A,Rahm E.Load balancing for MapReduce-based entity resolution [C]∥Proc of ICDE.Washington:IEEE,2012:618-629
[7] Cho J,Shivakumar N,Garcia-Molina H.Finding replicated web collections [C]∥Proc of SIGMOD.New York:ACM,2000:355-366
[8] 相似图片搜索引擎 TinEye [EB/OL].[2012-11-24].http://www.ipc.me/tineye.html
[9] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[C]∥Proc of OSDI.New York:ACM,2004:137-150
[10] Zhang X F,Chen L,Wang M.Efficient multi-way theta-join processing using MapReduce[J].PVLDB,2012,5(11):1184-1195
[11] 庞俊,谷峪,许嘉,等.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13
[12] 智慧城市.“智慧”来自大数据[EB/OL].[2012-11-24].http://www.im2m.com.cn/107/10492963023.shtml
[13] CIO时代网.浅析大数据的特点[EB/OL].[2012-12-13].http://www.ciotimes.com/baike/62989.html
[14] Baraglia R,Morales G D F,Lucchese C.Document similarityself-join with MapReduce[C]∥Proc of ICDM.Washington:IEEE,2010:731-736
[15] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2010:495-506
[16] Afrati F N,Sarma A D,Menestrina D,et al.Fuzzy joins using MapReduce[C]∥Proc of ICDE.Washington:IEEE,2012:834-845
[17] Silva Y N,Reed J M.Exploiting MapReduce-based similarityjoins [C]∥Proceedings of SIGMOD.New York:ACM,2012:693-696
[18] Metwally A,Faloutsos C.V-SMART-Join:a scalable MapRe-duce framework for all-pair similarity joins of multisets and vectors[C]∥Proc of VLDB.New York:ACM,2012:704-715
[19] Silva Y N,Reed J M,Tsosie L M.MapReduce-based similarity join for metric spaces[C]∥Proc of Cloud-I.New York:ACM,2012:3
[20] Kim Y H,Shim K.Parallel top-k similarity join algorithms using MapReduce[C]∥Proc of ICDE.Washington:IEEE,2012:510-521
[21] Okcan A,Riedewald M.Processing theta joins using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2011:949-960
[22] Zhang C,Li F F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proc of EDBT.New York:ACM,2012:38-49
[23] Lu W,Shen Y Y,Chen S,et al.Efficient Processing of k nearest neighbor joins using MapReduce[J].PVLDB,2012,5(10):1016-1027
[24] Yokoyama T,Ishikawa Y,Suzuk Y.Processing all k-nearestneighbor queries in hadoop[C]∥Proc of WAIM.Berlin:Sprin-ger,2012:346-351
[25] Bentley J L,Shamos M I.Divide-and-conquer in multidimensional space[C]∥Proc of STOC.New York:ACM,1976:220-230
[26] Shim K,Srikant R,Agrawal R.High-dimensional similarity joins[C]∥Proc of ICDE.Washington:IEEE,1997:301-311
[27] Shafer J C,Agrawal R.Parallel algorithms for high-dimensional similarity joins for data mining applications[C]∥Proc of VLDB.New York:ACM,1997:176-185 (下转第27页)(上接第5页)
[28] Fagin R,Lotem A,Naor M.Optimal aggregation algorithms for middleware[J].Computer and System Sciences,2003,6(4):614-656
[29] Sarawagi S,Kirpal A.Efficient set joins on similarity predicates[C]∥Proc of SIGMOD.New York:ACM,2004:743-754
[30] Xiao C,Wang W,Lin X M,et al.Efficient similarity joins fornear duplicate detection [C]∥Proc of WWW.New York:ACM,2008:131-140
[31] Jacox E H,Samet H.Metric space similarity joins [J].ACM Transactions on Database Systems,2008,33(2):1-38
[32] Xiao C,Wang W,Lin X M,et al.Top-k set similarity joins[C]∥Proc of ICDE.Washington:IEEE,2009:916-927
[33] Zhu S W,Wu J J,Xia G P.Top-k cosine similarity interesting pairs search[C]∥Proc of FSKD.Washington:IEEE,2010:1479-1483
[34] Blanas S,Patel J M,Ercegovac V,et al.A comparison of join algorithms for log processing in MaPreduce[C]∥Proc of SIGMOD.New York:ACM,2010:975-986
[35] Li G L,Deng D,Wang J N,et al.Pass-Join:a partition-based method for similarity joins[C]∥Proc of VLDB.New York:ACM,2012:253-264
[36] Lynden S J,Tanimura Y,Kojima I,et al.Dynamic data redistribution for MapReduce joins[C]∥Proc of Cloud Com.Washington:IEEE,2011:717-723
[37] Afrati F N,Ullman J D.Optimizing multiway joins in a Map-Reduce environment [J].TKDE,2011,23(9):1282-1298
[38] Zhang X F,Chen L,Wang M.Towards Efficient join processing over large rdf graph using MapReduce[C]∥Proc of SSDBM.Berlin:Springer,2012:250-259
[39] Lin J.Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce[C]∥Proc of SIGIR.New York:ACM,2009:155-162
[40] Elsayed T,Lin J J,Oard D W.Pairwise document similarity in large collections with MapReduce[C]∥Proc of ACL.Stroudsburg:Association for Computational Linguistics,2008:265-268
[41] Li B,Mazur E,Diao Y L,et al.A platform for scalable one-pass analytics using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2011:985-996
[42] Gufler B,Augsten N,Reiser A,et al.Handling data skew in MapReduce[C]∥Proc of CLOSER.SciTePress,2011:574-583
[43] Gufler B,Augsten N,Reiser A,et al.Load balancing in MapReduce based on scalable cardinality estimates[C]∥Proc of ICDE.Washington:IEEE,2012:522-533

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed