Computer Science ›› 2018, Vol. 45 ›› Issue (3): 158-164.doi: 10.11896/j.issn.1002-137X.2018.03.025

Previous Articles     Next Articles

ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries

WANG Hua-jin, LI Jian-hui, SHEN Zhi-hong and ZHOU Yuan-chun   

  • Online:2018-03-15 Published:2018-11-13

Abstract: The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it’s very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key’s frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way,which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal’s significant improvement on both the key histogram estimation and the reducer load balancing.

Key words: Load balancing,MapReduce,Hive,Join,Reducer,ORC

[1] KWON Y C,BALAZINSKA M,HOWE B,et al.Skew-resistantparallel processing of feature-extracting scientific user-defined functions[C]∥ACM Symposium on Cloud Computing (SoCC).ACM,2010:75-86.
[2] KE Q,PRABHAKARAN V,XIE Y,et al.Optimizing Data Par-titioning for Data-Parallel Computing[C]∥Proceedings of the 13th USENIX conference on Hot topics in operating systems (HotOS).USENIX,2011:13.
[3] YAN W,XUE Y,MALIN B.Scalable and robust key group size estimation for reducer load balancing in MapReduce[C]∥IEEE International Conference on Big Data.IEEE,2013:156-162.
[4] IBRAHIM S,JIN H,LU L,et al.LEEN:Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud[C]∥IEEE Second International Conference on Cloud Computing Technology and Science.IEEE,2010:17-24.
[5] GUFLER B,AUGSTEN N,REISER A,et al.Load Balancing in MapReduce Based on Scalable Cardinality Estimates[C]∥International Conference on Data Engineerin (ICDE).IEEE Compu-ter Society,2012:522-533.
[6] CHEN Q,YAO J,XIAO Z H.LIBRA:Lightweight Data SkewMitigation in MapReduce[J].IEEE Transactions on Parallel & Distributed Systems,2015,26(9):2520-2533.
[7] WANG Z,CHEN Q,LI Z H,et al, An Incremental Partitioning Strategy for Data Balance on MapReduce[J].Chinese Journal of Computers,2016,39(1):19-35.(in Chinese) 王卓,陈群,李战怀,等.基于增量式分区策略的 MapReduce 数据均衡方法[J].计算机学报,2016,39(1):19-35.
[8] KWON Y,BALAZINSKA M,HOWE B,et al.SkewTune:mitigating skew in mapreduce applications[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2012:25-36.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!