Computer Science ›› 2015, Vol. 42 ›› Issue (1): 1-5,27.doi: 10.11896/j.issn.1002-137X.2015.01.001

    Next Articles

Similarity Joins on Massive Data Based on MapReduce Framework

PANG Jun, YU Ge, XU Jia and GU Yu   

  • Online:2018-11-14 Published:2018-11-14

Abstract: As a basic operation of large-scale data processing,similarity joins on large-scale data play an important role in document clustering,plagiarism detection,entity resolution and many other fields.On the other hand,the MapReduce programming model is widely applied to massive data processing because of its good scalability,fault tolerance and usability.Therefore,similarity joins on massive data based on MapReduce become one of the hot topics in the field of massive data processing.Firstly,big challenges of similarity join query introduced by rapid growth of data volume were generalized.Then,the definition of similarity joins on massive data was presented and similarity joins on massive data were classified according to three different standards.In addition,the current status of set,string and vector similarity joins on massive data were emphatically analyzed and compared from the aspects of efficiency,applicability and so on.Finally,the research focus and trend of this area were investigated and the promising solutions were suggested.

Key words: Massive data,Similarity join,MapReduce,Top-k

[1] Viktor M S,Kenneth C.Big data:A Revolution that will transform how we live,work and think[M].盛杨燕,周涛,译.杭州:浙江人民出版社,2013:9-10
[2] 李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,9(8):8-15
[3] Miller R.‘Digital universe’ nears a zettabyte[EB/OL].[2012-11-24].
[4] Broder A Z,Glassman S C,Manasse M S,et al.Syntactic cluste-ring of the web[J].Computer Networks and ISCN Systems,1997,9(8-13):1157-1166
[5] Hoad T C,Zobel J.Methods for identifying versioned and plagi-arized documents[J].Journal of the American Society for Information Science and Technology,2003,4(3):203-215
[6] Kolb L,Thor A,Rahm E.Load balancing for MapReduce-based entity resolution [C]∥Proc of ICDE.Washington:IEEE,2012:618-629
[7] Cho J,Shivakumar N,Garcia-Molina H.Finding replicated web collections [C]∥Proc of SIGMOD.New York:ACM,2000:355-366
[8] 相似图片搜索引擎 TinEye [EB/OL].[2012-11-24].
[9] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[C]∥Proc of OSDI.New York:ACM,2004:137-150
[10] Zhang X F,Chen L,Wang M.Efficient multi-way theta-join processing using MapReduce[J].PVLDB,2012,5(11):1184-1195
[11] 庞俊,谷峪,许嘉,等.相似性连接查询技术研究进展[J].计算机科学与探索,2013,7(1):1-13
[12] 智慧城市.“智慧”来自大数据[EB/OL].[2012-11-24].
[13] CIO时代网.浅析大数据的特点[EB/OL].[2012-12-13].
[14] Baraglia R,Morales G D F,Lucchese C.Document similarityself-join with MapReduce[C]∥Proc of ICDM.Washington:IEEE,2010:731-736
[15] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2010:495-506
[16] Afrati F N,Sarma A D,Menestrina D,et al.Fuzzy joins using MapReduce[C]∥Proc of ICDE.Washington:IEEE,2012:834-845
[17] Silva Y N,Reed J M.Exploiting MapReduce-based similarityjoins [C]∥Proceedings of SIGMOD.New York:ACM,2012:693-696
[18] Metwally A,Faloutsos C.V-SMART-Join:a scalable MapRe-duce framework for all-pair similarity joins of multisets and vectors[C]∥Proc of VLDB.New York:ACM,2012:704-715
[19] Silva Y N,Reed J M,Tsosie L M.MapReduce-based similarity join for metric spaces[C]∥Proc of Cloud-I.New York:ACM,2012:3
[20] Kim Y H,Shim K.Parallel top-k similarity join algorithms using MapReduce[C]∥Proc of ICDE.Washington:IEEE,2012:510-521
[21] Okcan A,Riedewald M.Processing theta joins using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2011:949-960
[22] Zhang C,Li F F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proc of EDBT.New York:ACM,2012:38-49
[23] Lu W,Shen Y Y,Chen S,et al.Efficient Processing of k nearest neighbor joins using MapReduce[J].PVLDB,2012,5(10):1016-1027
[24] Yokoyama T,Ishikawa Y,Suzuk Y.Processing all k-nearestneighbor queries in hadoop[C]∥Proc of WAIM.Berlin:Sprin-ger,2012:346-351
[25] Bentley J L,Shamos M I.Divide-and-conquer in multidimensional space[C]∥Proc of STOC.New York:ACM,1976:220-230
[26] Shim K,Srikant R,Agrawal R.High-dimensional similarity joins[C]∥Proc of ICDE.Washington:IEEE,1997:301-311
[27] Shafer J C,Agrawal R.Parallel algorithms for high-dimensional similarity joins for data mining applications[C]∥Proc of VLDB.New York:ACM,1997:176-185 (下转第27页)(上接第5页)
[28] Fagin R,Lotem A,Naor M.Optimal aggregation algorithms for middleware[J].Computer and System Sciences,2003,6(4):614-656
[29] Sarawagi S,Kirpal A.Efficient set joins on similarity predicates[C]∥Proc of SIGMOD.New York:ACM,2004:743-754
[30] Xiao C,Wang W,Lin X M,et al.Efficient similarity joins fornear duplicate detection [C]∥Proc of WWW.New York:ACM,2008:131-140
[31] Jacox E H,Samet H.Metric space similarity joins [J].ACM Transactions on Database Systems,2008,33(2):1-38
[32] Xiao C,Wang W,Lin X M,et al.Top-k set similarity joins[C]∥Proc of ICDE.Washington:IEEE,2009:916-927
[33] Zhu S W,Wu J J,Xia G P.Top-k cosine similarity interesting pairs search[C]∥Proc of FSKD.Washington:IEEE,2010:1479-1483
[34] Blanas S,Patel J M,Ercegovac V,et al.A comparison of join algorithms for log processing in MaPreduce[C]∥Proc of SIGMOD.New York:ACM,2010:975-986
[35] Li G L,Deng D,Wang J N,et al.Pass-Join:a partition-based method for similarity joins[C]∥Proc of VLDB.New York:ACM,2012:253-264
[36] Lynden S J,Tanimura Y,Kojima I,et al.Dynamic data redistribution for MapReduce joins[C]∥Proc of Cloud Com.Washington:IEEE,2011:717-723
[37] Afrati F N,Ullman J D.Optimizing multiway joins in a Map-Reduce environment [J].TKDE,2011,23(9):1282-1298
[38] Zhang X F,Chen L,Wang M.Towards Efficient join processing over large rdf graph using MapReduce[C]∥Proc of SSDBM.Berlin:Springer,2012:250-259
[39] Lin J.Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce[C]∥Proc of SIGIR.New York:ACM,2009:155-162
[40] Elsayed T,Lin J J,Oard D W.Pairwise document similarity in large collections with MapReduce[C]∥Proc of ACL.Stroudsburg:Association for Computational Linguistics,2008:265-268
[41] Li B,Mazur E,Diao Y L,et al.A platform for scalable one-pass analytics using MapReduce[C]∥Proc of SIGMOD.New York:ACM,2011:985-996
[42] Gufler B,Augsten N,Reiser A,et al.Handling data skew in MapReduce[C]∥Proc of CLOSER.SciTePress,2011:574-583
[43] Gufler B,Augsten N,Reiser A,et al.Load balancing in MapReduce based on scalable cardinality estimates[C]∥Proc of ICDE.Washington:IEEE,2012:522-533

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[3] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .
[4] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105, 130 .
[5] WANG Zhen-wu, LV Xiao-hua and HAN Xiao-hui. Survey of Terrain LOD Technology Based on Quadtree Segmentation[J]. Computer Science, 2018, 45(4): 34 -45 .
[6] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[7] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121, 136 .
[8] LI Shan and RAO Wen-bi. Video-based Detection of Human Motion Area in Mine[J]. Computer Science, 2018, 45(4): 291 -295 .
[9] LIAO Xing, YUAN Jing-ling and CHEN Min-cheng. Parallel PSO Container Packing Algorithm with Adaptive Weight[J]. Computer Science, 2018, 45(3): 231 -234, 273 .
[10] SHI Chao, XIE Zai-peng, LIU Han and LV Xin. Optimization of Container Deployment Strategy Based on Stable Matching[J]. Computer Science, 2018, 45(4): 131 -136 .