Computer Science ›› 2018, Vol. 45 ›› Issue (3): 158-164.doi: 10.11896/j.issn.1002-137X.2018.03.025

Previous Articles     Next Articles

ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries

WANG Hua-jin, LI Jian-hui, SHEN Zhi-hong and ZHOU Yuan-chun   

  • Online:2018-03-15 Published:2018-11-13

Abstract: The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it’s very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key’s frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way,which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal’s significant improvement on both the key histogram estimation and the reducer load balancing.

Key words: Load balancing,MapReduce,Hive,Join,Reducer,ORC

[1] KWON Y C,BALAZINSKA M,HOWE B,et al.Skew-resistantparallel processing of feature-extracting scientific user-defined functions[C]∥ACM Symposium on Cloud Computing (SoCC).ACM,2010:75-86.
[2] KE Q,PRABHAKARAN V,XIE Y,et al.Optimizing Data Par-titioning for Data-Parallel Computing[C]∥Proceedings of the 13th USENIX conference on Hot topics in operating systems (HotOS).USENIX,2011:13.
[3] YAN W,XUE Y,MALIN B.Scalable and robust key group size estimation for reducer load balancing in MapReduce[C]∥IEEE International Conference on Big Data.IEEE,2013:156-162.
[4] IBRAHIM S,JIN H,LU L,et al.LEEN:Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud[C]∥IEEE Second International Conference on Cloud Computing Technology and Science.IEEE,2010:17-24.
[5] GUFLER B,AUGSTEN N,REISER A,et al.Load Balancing in MapReduce Based on Scalable Cardinality Estimates[C]∥International Conference on Data Engineerin (ICDE).IEEE Compu-ter Society,2012:522-533.
[6] CHEN Q,YAO J,XIAO Z H.LIBRA:Lightweight Data SkewMitigation in MapReduce[J].IEEE Transactions on Parallel & Distributed Systems,2015,26(9):2520-2533.
[7] WANG Z,CHEN Q,LI Z H,et al, An Incremental Partitioning Strategy for Data Balance on MapReduce[J].Chinese Journal of Computers,2016,39(1):19-35.(in Chinese) 王卓,陈群,李战怀,等.基于增量式分区策略的 MapReduce 数据均衡方法[J].计算机学报,2016,39(1):19-35.
[8] KWON Y,BALAZINSKA M,HOWE B,et al.SkewTune:mitigating skew in mapreduce applications[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2012:25-36.

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .