Computer Science ›› 2017, Vol. 44 ›› Issue (5): 20-25.doi: 10.11896/j.issn.1002-137X.2017.05.004

Previous Articles     Next Articles

MapReduce Based Similarity Self-join Algorithm for Big Dataset

SUN De-cai and WANG Xiao-xia   

  • Online:2018-11-13 Published:2018-11-13

Abstract: How to find out duplicates/similarities in dataset is a key issue in big data processing.Similarity join is a va-lid operation for finding similarities,and similarity join algorithm based on MapReduce has attracted serious concern for the advantage of processing big dataset.In this paper,similarity self-join algorithms were researched and some factors which slow self-join were discovered.To accelerate self-join,an improved similarity self-join algorithm based on Massjoin was proposed.In filtration stage,new filtration criterion is added to eliminating self-join redundant pairs.In verification stage,the techniques of backward-forward pairs and combined id are adopted to eliminate more self-join redundant candidate pairs,and the dataset is scanned only once in reading original strings.The experimental results demonstrate that both filtration CPU time and the verification CPU time of new algorithm decrease.As a result,the efficiency of similarity self-join algorithm is increased by using our revision strategies.

Key words: Similarity join,Big data,MapReduce,Data cleaning

[1] YU W,LI S J,YANG S,et al.Automatically Discovering of Inconsistency Among Cross-Source Data Based on Web Big Data[J].Journal of Computer Research & Development,2015,52(2):295-308.(in Chinese) 余伟,李石君,杨莎,等.Web大数据环境下的不一致跨源数据发现[J].计算机研究与发展,2015,52(2):295-308.
[2] XU J,WANG G Y,YU H.Review of Big Data Processing Based on Granular Computing[J].Chinese Journal of Computers,2015,38(8):1497-1517.(in Chinese) 徐计,王国胤,于洪.基于粒计算的大数据处理[J].计算机学报,2015,38(8):1497-1517.
[3] M Y Z,M X F,W S Y.Parallel similarity joins on massive high-dimensional data using MapReduce[J].Concurrency and Computation:Practice and Experience,2016,28(1):166-83.
[4] SILVA Y N,PEARSON S S,CHON J,et al.Similarity Joins:Their implementation and interactions with other database opera-tors[J].Information Systems,2015,52(8-9):149-162.
[5] SANTOS L F D,OLMES CARVALHO L,OLIVEIRA W D,et al.Diversity in similarity joins[C]∥Proceedings of the 8th International Conference on Similarity Search and Applications(SISAP 2015).Glasgow,UK:Springer,2015:42-53.
[6] PANG J,YU G,XU J,et al.Similarity Joins on Massive Data Based on MapReduce Framework[J].Computer Science,2015,2(1):1-5,27.(in Chinese) 庞俊,于戈,许嘉,等.基于MapReduce框架的海量数据相似性连接研究进展[J].计算机科学,2015,2(1):1-5,27.
[7] WANG H Y,YANG L H,LIU X Q.Optimizing Top-k Similarity Join Algorithm[J].Journal of Software,2016,7(12):3051-3066.(in Chinese) 王洪亚,杨利宏,刘晓强.Top-k相似连接算法性能优化研究[J].软件学报,2016,7(12):3051-3066.
[8] CHEN Z J,ZHANG J N,LIU W Y.Region-based Spatial-textualSimilarity Join in MapReduce [J].Journal of Chinese Computer Systems,2015,36(10):2245-2251.(in Chinese) 陈子军,张娟娜,刘文远.MapReduce框架下基于范围的空间文本相似连接[J].小型微型计算机系统,2015,36(10):2245-2251.
[9] ZHANG M.An Approach of Near-duplicate Detection of Mass Data Based on MapReduce[J].Research and Exploration In Laboratory,2014,33(9):132-136.(in Chinese) 张敏.海量数据的MapReduce相似度检测[J].实验室研究与探索,2014,33(9):132-136.
[10] LEVENSHTEIN V.Binary codes capable of correcting dele-tions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710.
[11] DENG D,LI G L,HAO S,et al.MassJoin:A MapReduce-based Method for Scalable String Similarity Joins[C]∥Proceedings of IEEE 30th International Conference on Data Engineering (ICDE).New York:IEEE,2014:340-351.
[12] WANG J,LI G,FENG J.Trie-join:Efficient trie-based string simi-larity joins with edit-distance constraints[J].PVLDB,2010,3(1):1219-1230.
[13] BAYARDO R J,MA Y,SRIKANT R.Scaling up all pairs similarity search[C]∥Proceedings of 16th International World Wide Web Conference(WWW 2007).Banff,AB:ACM,2007:131-140.
[14] XIAO C,WANG W,LIN X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Trans.Database Syst.,2011,36(3):1-41.
[15] XIAO C,WANG W,LIN X.Ed-join:an efficient algorithm for similarity joins with edit distance constraints[J].Proceedings of the Vldb Endowment,2008,1(1):933-944.
[16] WANG W,QIN J,XIAO C,et al.Vchunkjoin:An efficient algorithm for edit similarity joins[J].IEEE Trans.Knowl.Data Eng.,2013,25(8):1916-1929.
[17] ARASU A,GANTI V,KAUSHI KR.Efficient exact set-simi-larity joins[C]∥Proceedings of the 32nd International Confe-rence on Very Large Data Bases.Seoul,Korea:VLDB Endowment,2006:918-929.
[18] LI G,DONG D,WANG J,et al.PASS-JOIN:A Partition-based Method for Similarity Joins[J].Proceedings of the Vldb Endowment,2011,5(3):253-264.
[19] VERNICA R,CAREY M J,LI C.Efficient parallel set-similarity joins using MapReduce[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.Indiana-polis,IN,USA:ACM,2010:495-506.
[20] METWALLY A,FALOUTSOS C.V-SMART-Join:A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors[J].Proceedings of the Vldb Endowment,2012,5(8):704-715.
[21] AFRATI F N,SARMA A D,MENESTRINA D,et al.FuzzyJoins Using MapReduce[C]∥Proceedings of IEEE 28th International Conference on Data Engineering.Washington,DC:IEEE,2012:498-509.
[22] LIN C,YU H,WENG W,et al.Large-Scale Similarity Join with Edit-Distance Constraints[C]∥Proceedings of 19th InternationalConference on Database Systems for Advanced Applications(DASFAA 2014).Bali,Indonesia:Springer International Publishing,2014:328-342.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!