孙德才,王晓霞.一种基于MapReduce的大数据集相似自连接算法[J].计算机科学,2017,44(5):20-25, 32
一种基于MapReduce的大数据集相似自连接算法
MapReduce Based Similarity Self-join Algorithm for Big Dataset
投稿时间:2016-08-22  修订日期:2016-11-12
DOI:10.11896/j.issn.1002-137X.2017.05.004
中文关键词:  相似连接,大数据,MapReduce,数据清洗
英文关键词:Similarity join,Big data,MapReduce,Data cleaning
基金项目:本文受教育部人文社会科学研究青年基金项目(15YJC870021,5YJC870028),辽宁省博士科研启动基金计划项目(20141138),辽宁省教育厅科学研究项目(L2015010,L2014451),辽宁省自然科学基金(2015020009),国家自然科学基金青年基金项目(61602056)资助
作者单位E-mail
孙德才 渤海大学信息科学与技术学院 锦州121013 sdecai@163.com 
王晓霞 渤海大学大学基础教研部 锦州121013 wxxsdc@163.com 
摘要点击次数: 966
全文下载次数: 313
中文摘要:
      如何快速发现数据集中重复或相似的记录是大数据处理技术中的一个基本问题。相似连接是一种有效的相似数据查找方法,且基于MapReduce的相似连接算法因对大数据集的处理能力强而得到广泛关注。通过分析当前相似连接算法进行自连接时存在的自连接冗余、读取原字符串复杂等问题,在Massjoin算法的基础上提出了一种改进的基于MapReduce的自连接算法。改进算法在过滤阶段增加了消除自身冗余的过滤条件,在验证阶段又采用了生成正反候选对和组合id等去冗余技术,并且读取原始字符串内容时只需读取数据集一次。实验数据显示,改进算法无论在过滤阶段还是在验证阶段都减少了算法的CPU时耗,结果表明所提改进策略是有效的。
英文摘要:
      How to find out duplicates/similarities in dataset is a key issue in big data processing.Similarity join is a va-lid operation for finding similarities,and similarity join algorithm based on MapReduce has attracted serious concern for the advantage of processing big dataset.In this paper,similarity self-join algorithms were researched and some factors which slow self-join were discovered.To accelerate self-join,an improved similarity self-join algorithm based on Massjoin was proposed.In filtration stage,new filtration criterion is added to eliminating self-join redundant pairs.In verification stage,the techniques of backward-forward pairs and combined id are adopted to eliminate more self-join redundant candidate pairs,and the dataset is scanned only once in reading original strings.The experimental results demonstrate that both filtration CPU time and the verification CPU time of new algorithm decrease.As a result,the efficiency of similarity self-join algorithm is increased by using our revision strategies.
查看全文  查看/发表评论  下载PDF阅读器