Computer Science ›› 2016, Vol. 43 ›› Issue (9): 27-31.doi: 10.11896/j.issn.1002-137X.2016.09.005

Previous Articles     Next Articles

Join Algorithm in Skewed Datasets Based on MapReduce

LIANG Jun-jie and HE Li-min   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Join operation is the most common operation in data analysis applications with large-scale datasets,and Map-Reduce can not support join operation perfectly in handling data skew problem.MapReduce frequecncy classified join algorithm was proposed,and datasets were classified into three categories according to the appeared data frequency.Data redistribution applying partitioning algorithm and broadcast algorithms eliminate the impact of skewed data.And data redistribution is realized by using hash algorithm for the non-skew data.Join operation can be completed in a single node,avoiding the cost of communications across the nodes under the MapReduce for the redistributed data,and balancing the workload of each node effectively,thereby improves the efficiency of join operations in skewed data.The effectiveness and practicality of the algorithms are proved by the comparison with traditional algorithms.

Key words: Data skew,MapReduce,Join algorithm,Load balancing

[1] Dean J,Ghemawat S.MapReduce:Simplified data processing on large clusters[J].Communications of the ACM,2008,1(1):107-113
[2] YongChul K,Magdalena B,Bill H,et al.A Study of Skew inMapReduce Applications[C]∥Open Cirrus Summit.2011
[3] Viswanath P,Yannis E I.Estimation of Query-Eesult Distribution and its Application in Parallel-Join Load Balancing[C]∥Proceedings of the 22nd VLDB Conference (PVLDB).UMumbai(Bombay),India,1996:448-459
[4] Chen Yong-xu,Chen Meng-jie,Liu Xue-bin,et al.MapReduce Based Aggreate-Join Query Algorithms[J].Journal of Computer Research and Development,2013,0(Suppl):306-311(in Chinese) 陈勇旭,陈梦杰,刘雪冰,等.基于MapReduce的连接聚集查询算法研究[J].计算机研究与发展,2013,50(Suppl):306-311
[5] Song jie,Li Tian-tian,Zhu Zhi-liang,et al.Research on I/O Cost of MapReduce Join[J].Journal of Software,2015,26(6):1438-1456(in Chinese) 宋杰,李甜甜,朱志良,等.MapReduce连接查询的I/O代价研究[J].软件学报,2015,6(6):1438-1456
[6] Slagter K,Hsu C H,Chung Y C,et al.Smart Join:a network-aware multiway join for MapReduce[J].Cluster Computing,2014,7(3):629-641
[7] Hassan M A H,Bamha M.Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model[J].Procedia Computer Science,2015,51(1):70-79
[8] Yu X,Pekka K,et al.Handling Data Skew in Parallel Joins in Shared-Nothing Systems[C]∥SIGMOD 08.Vancouver,BC,Canada,2008:1043-1052
[9] Fariha A,Stratis D V,Salman N.SAND Join —A skew handling join algorithm for Google’s MapReduce framework[C]∥IEEE 14th International Multitopic Conference(INMIC).Karachi,Pakistan,2011:498-509
[10] David J D,Jeffrey F N,Donovan A S,et al.Practical Skew Handling in Parallel Joins[C]∥Proceedings of the 18th VLDB Conference(VLDB).Vancouver,British Columbia,Canada,1992:27-40
[11] Zhou Jia-shuai,Wang Qi,Gao Jun.An Approach for Load Ba-lancing in MapRedcue via Dynamic Partitioning[J].Journal of Computer Research and Development,2013,50(Suppl):369-377(in Chinese) 周家帅,王琦,高军.一种基于动态划分的MapReduce负载均衡方法[J].计算机研究与发展,2013,0(Suppl):369-377
[12] Gufler B,Augslen N,Reiser A,et al.Load balancing in MapReduce based on scalable cardinality estimates[C]∥Proc of ICDE.Piscataway,NJ,IEEE,2012:522-533
[13] Gufler B,Augsten N,Reiser A,et al.Handling data skew in Map-Reduce[C]∥Proc of CLOSER.Portugal:SciTePress,2011:574-583
[14] http://www.enet.com.cn/article/2012/0401/A20120401990472.shtml

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!