Computer Science, 2018, Vol. 45, Issue (3): 158-164.doi: 10.11896/j.issn.1002-137X.2018.03.025

ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries

WANG Hua-jin, LI Jian-hui, SHEN Zhi-hong and ZHOU Yuan-chun   

  Online:2018-03-15 Published:2018-11-13

Abstract: The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it’s very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key’s frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way,which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal’s significant improvement on both the key histogram estimation and the reducer load balancing.

Key words: Load balancing,MapReduce,Hive,Join,Reducer,ORC

