计算机科学 ›› 2016, Vol. 43 ›› Issue (9): 27-31.doi: 10.11896/j.issn.1002-137X.2016.09.005

• 目次 • 上一篇    下一篇

基于MapReduce的数据倾斜连接算法

梁俊杰,何利民   

  1. 湖北大学计算机与信息工程学院 武汉430062,湖北大学计算机与信息工程学院 武汉430062
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受湖北省自然科学基金重点项目(2015CFA067,2013CFA115),湖北省教育厅科研项目计划(D20151001),武汉市科技攻关计划项目(2013012401010851)资助

Join Algorithm in Skewed Datasets Based on MapReduce

LIANG Jun-jie and HE Li-min   

  • Online:2018-12-01 Published:2018-12-01

摘要: 连接操作是大规模数据集在数据分析应用中最常用的操作,针对MapReduce自身不能有效地处理数据倾斜情况下的连接操作,提出了基于MapReduce的频次分类连接算法。根据数据在连接数据集中出现的频率将整个数据集分为3类,对倾斜数据利用分区算法和广播算法实现数据重分布,以消除数据倾斜的影响;对非倾斜数据采用Hash算法实现数据重分布。重分布后的数据在单节点内即可完成数据连接操作,避免了MapReduce框架下连接操作的跨节点传输代价;同时有效地均衡了MapReduce各节点的任务负载,从而提高了数据倾斜状态下连接操作的效率。通过与传统连接算法的对比,证明了所提算法的有效性和实用性。

关键词: 数据倾斜,MapReduce,连接算法,负载均衡

Abstract: Join operation is the most common operation in data analysis applications with large-scale datasets,and Map-Reduce can not support join operation perfectly in handling data skew problem.MapReduce frequecncy classified join algorithm was proposed,and datasets were classified into three categories according to the appeared data frequency.Data redistribution applying partitioning algorithm and broadcast algorithms eliminate the impact of skewed data.And data redistribution is realized by using hash algorithm for the non-skew data.Join operation can be completed in a single node,avoiding the cost of communications across the nodes under the MapReduce for the redistributed data,and balancing the workload of each node effectively,thereby improves the efficiency of join operations in skewed data.The effectiveness and practicality of the algorithms are proved by the comparison with traditional algorithms.

Key words: Data skew,MapReduce,Join algorithm,Load balancing

[1] Dean J,Ghemawat S.MapReduce:Simplified data processing on large clusters[J].Communications of the ACM,2008,1(1):107-113
[2] YongChul K,Magdalena B,Bill H,et al.A Study of Skew inMapReduce Applications[C]∥Open Cirrus Summit.2011
[3] Viswanath P,Yannis E I.Estimation of Query-Eesult Distribution and its Application in Parallel-Join Load Balancing[C]∥Proceedings of the 22nd VLDB Conference (PVLDB).UMumbai(Bombay),India,1996:448-459
[4] Chen Yong-xu,Chen Meng-jie,Liu Xue-bin,et al.MapReduce Based Aggreate-Join Query Algorithms[J].Journal of Computer Research and Development,2013,0(Suppl):306-311(in Chinese) 陈勇旭,陈梦杰,刘雪冰,等.基于MapReduce的连接聚集查询算法研究[J].计算机研究与发展,2013,50(Suppl):306-311
[5] Song jie,Li Tian-tian,Zhu Zhi-liang,et al.Research on I/O Cost of MapReduce Join[J].Journal of Software,2015,26(6):1438-1456(in Chinese) 宋杰,李甜甜,朱志良,等.MapReduce连接查询的I/O代价研究[J].软件学报,2015,6(6):1438-1456
[6] Slagter K,Hsu C H,Chung Y C,et al.Smart Join:a network-aware multiway join for MapReduce[J].Cluster Computing,2014,7(3):629-641
[7] Hassan M A H,Bamha M.Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model[J].Procedia Computer Science,2015,51(1):70-79
[8] Yu X,Pekka K,et al.Handling Data Skew in Parallel Joins in Shared-Nothing Systems[C]∥SIGMOD 08.Vancouver,BC,Canada,2008:1043-1052
[9] Fariha A,Stratis D V,Salman N.SAND Join —A skew handling join algorithm for Google’s MapReduce framework[C]∥IEEE 14th International Multitopic Conference(INMIC).Karachi,Pakistan,2011:498-509
[10] David J D,Jeffrey F N,Donovan A S,et al.Practical Skew Handling in Parallel Joins[C]∥Proceedings of the 18th VLDB Conference(VLDB).Vancouver,British Columbia,Canada,1992:27-40
[11] Zhou Jia-shuai,Wang Qi,Gao Jun.An Approach for Load Ba-lancing in MapRedcue via Dynamic Partitioning[J].Journal of Computer Research and Development,2013,50(Suppl):369-377(in Chinese) 周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,0(Suppl):369-377
[12] Gufler B,Augslen N,Reiser A,et al.Load balancing in MapReduce based on scalable cardinality estimates[C]∥Proc of ICDE.Piscataway,NJ,IEEE,2012:522-533
[13] Gufler B,Augsten N,Reiser A,et al.Handling data skew in Map-Reduce[C]∥Proc of CLOSER.Portugal:SciTePress,2011:574-583
[14] http://www.enet.com.cn/article/2012/0401/A20120401990472.shtml

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!