基于MapReduce框架的海量数据相似性连接研究进展

doi:10.11896/j.issn.1002-137X.2015.01.001

Abstract

Abstract: As a basic operation of large-scale data processing,similarity joins on large-scale data play an important role in document clustering,plagiarism detection,entity resolution and many other fields.On the other hand,the MapReduce programming model is widely applied to massive data processing because of its good scalability,fault tolerance and usability.Therefore,similarity joins on massive data based on MapReduce become one of the hot topics in the field of massive data processing.Firstly,big challenges of similarity join query introduced by rapid growth of data volume were generalized.Then,the definition of similarity joins on massive data was presented and similarity joins on massive data were classified according to three different standards.In addition,the current status of set,string and vector similarity joins on massive data were emphatically analyzed and compared from the aspects of efficiency,applicability and so on.Finally,the research focus and trend of this area were investigated and the promising solutions were suggested.

Key words: Massive data,Similarity join,MapReduce,Top-k

PANG Jun, YU Ge, XU Jia and GU Yu. Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science, 2015, 42(1): 1-5.

References

[1] Viktor M S,Kenneth C.Big data:A Revolution that will transform how we live,work and think[M].盛杨燕,周涛,译.杭州:浙江人民出版社,2013:9-10
[2] 李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,9(8):8-15
[3] Miller R.‘Digital universe’ nears a zettabyte[EB/OL].[2012-11-24].http://www.datacenterknowledge.com/archives/2010/05/04/digital-universe-nears-a-zettabyte/
[4] Broder A Z,Glassman S C,Manasse M S,et al.Syntactic cluste-ring of the web[J].Computer Networks and ISCN Systems,1997,9(8-13):1157-1166
[5] Hoad T C,Zobel J.Methods for identifying versioned and plagi-arized documents[J].Journal of the American Society for Information Science and Technology,2003,4(3):203-215
[6] Kolb L,Thor A,Rahm E.Load balancing for MapReduce-based entity resolution [C]∥Proc of ICDE.Washington:IEEE,2012:618-629
[7] Cho J,Shivakumar N,Garcia-Molina H.Finding replicated web collections [C]∥Proc of SIGMOD.New York:ACM,2000:355-366
[8] 相似图片搜索引擎 TinEye [EB/OL].[2012-11-24].http://www.ipc.me/tineye.html
[9] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[C]∥Proc of OSDI.New York:ACM,2004:137-150
[10] Zhang X F,Chen L,Wang M.Efficient multi-way theta-join processing using MapReduce[J].PVLDB,2012,5(11):1184-1195
[11] 庞俊,谷峪,许嘉,等.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13
[12] 智慧城市.“智慧”来自大数据[EB/OL].[2012-11-24].http://www.im2m.com.cn/107/10492963023.shtml
[13] CIO时代网.浅析大数据的特点[EB/OL].[2012-12-13].http://www.ciotimes.com/baike/62989.html
[14] Baraglia R,Morales G D F,Lucchese C.Document similarityself-join with MapReduce[C]∥Proc of ICDM.Washington:IEEE,2010:731-736
[15] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2010:495-506
[16] Afrati F N,Sarma A D,Menestrina D,et al.Fuzzy joins using MapReduce[C]∥Proc of ICDE.Washington:IEEE,2012:834-845
[17] Silva Y N,Reed J M.Exploiting MapReduce-based similarityjoins [C]∥Proceedings of SIGMOD.New York:ACM,2012:693-696
[18] Metwally A,Faloutsos C.V-SMART-Join:a scalable MapRe-duce framework for all-pair similarity joins of multisets and vectors[C]∥Proc of VLDB.New York:ACM,2012:704-715
[19] Silva Y N,Reed J M,Tsosie L M.MapReduce-based similarity join for metric spaces[C]∥Proc of Cloud-I.New York:ACM,2012:3
[20] Kim Y H,Shim K.Parallel top-k similarity join algorithms using MapReduce[C]∥Proc of ICDE.Washington:IEEE,2012:510-521
[21] Okcan A,Riedewald M.Processing theta joins using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2011:949-960
[22] Zhang C,Li F F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proc of EDBT.New York:ACM,2012:38-49
[23] Lu W,Shen Y Y,Chen S,et al.Efficient Processing of k nearest neighbor joins using MapReduce[J].PVLDB,2012,5(10):1016-1027
[24] Yokoyama T,Ishikawa Y,Suzuk Y.Processing all k-nearestneighbor queries in hadoop[C]∥Proc of WAIM.Berlin:Sprin-ger,2012:346-351
[25] Bentley J L,Shamos M I.Divide-and-conquer in multidimensional space[C]∥Proc of STOC.New York:ACM,1976:220-230
[26] Shim K,Srikant R,Agrawal R.High-dimensional similarity joins[C]∥Proc of ICDE.Washington:IEEE,1997:301-311
[27] Shafer J C,Agrawal R.Parallel algorithms for high-dimensional similarity joins for data mining applications[C]∥Proc of VLDB.New York:ACM,1997:176-185 (下转第27页)(上接第5页)
[28] Fagin R,Lotem A,Naor M.Optimal aggregation algorithms for middleware[J].Computer and System Sciences,2003,6(4):614-656
[29] Sarawagi S,Kirpal A.Efficient set joins on similarity predicates[C]∥Proc of SIGMOD.New York:ACM,2004:743-754
[30] Xiao C,Wang W,Lin X M,et al.Efficient similarity joins fornear duplicate detection [C]∥Proc of WWW.New York:ACM,2008:131-140
[31] Jacox E H,Samet H.Metric space similarity joins [J].ACM Transactions on Database Systems,2008,33(2):1-38
[32] Xiao C,Wang W,Lin X M,et al.Top-k set similarity joins[C]∥Proc of ICDE.Washington:IEEE,2009:916-927
[33] Zhu S W,Wu J J,Xia G P.Top-k cosine similarity interesting pairs search[C]∥Proc of FSKD.Washington:IEEE,2010:1479-1483
[34] Blanas S,Patel J M,Ercegovac V,et al.A comparison of join algorithms for log processing in MaPreduce[C]∥Proc of SIGMOD.New York:ACM,2010:975-986
[35] Li G L,Deng D,Wang J N,et al.Pass-Join:a partition-based method for similarity joins[C]∥Proc of VLDB.New York:ACM,2012:253-264
[36] Lynden S J,Tanimura Y,Kojima I,et al.Dynamic data redistribution for MapReduce joins[C]∥Proc of Cloud Com.Washington:IEEE,2011:717-723
[37] Afrati F N,Ullman J D.Optimizing multiway joins in a Map-Reduce environment [J].TKDE,2011,23(9):1282-1298
[38] Zhang X F,Chen L,Wang M.Towards Efficient join processing over large rdf graph using MapReduce[C]∥Proc of SSDBM.Berlin:Springer,2012:250-259
[39] Lin J.Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce[C]∥Proc of SIGIR.New York:ACM,2009:155-162
[40] Elsayed T,Lin J J,Oard D W.Pairwise document similarity in large collections with MapReduce[C]∥Proc of ACL.Stroudsburg:Association for Computational Linguistics,2008:265-268
[41] Li B,Mazur E,Diao Y L,et al.A platform for scalable one-pass analytics using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2011:985-996
[42] Gufler B,Augsten N,Reiser A,et al.Handling data skew in MapReduce[C]∥Proc of CLOSER.SciTePress,2011:574-583
[43] Gufler B,Augsten N,Reiser A,et al.Load balancing in MapReduce based on scalable cardinality estimates[C]∥Proc of ICDE.Washington:IEEE,2012:522-533

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Similarity Joins on Massive Data Based on MapReduce Framework

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0