计算机科学 ›› 2018, Vol. 45 ›› Issue (3): 158-164.doi: 10.11896/j.issn.1002-137X.2018.03.025

• 软件与数据库技术 • 上一篇    下一篇

基于ORC元数据的Hive Join查询Reducer负载均衡方法

王华进,黎建辉,沈志宏,周园春   

  1. 中国科学院计算机网络信息中心 北京100190;中国科学院大学 北京100049,中国科学院计算机网络信息中心 北京100190,中国科学院计算机网络信息中心 北京100190,中国科学院计算机网络信息中心 北京100190
  • 出版日期:2018-03-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家重点研发计划项目:科学大数据管理系统(2016YFB1000600),协同精密定位技术(2016YFB0501900)资助

ORC Metadata Based Reducer Load Balancing Method for Hive Join Queries

WANG Hua-jin, LI Jian-hui, SHEN Zhi-hong and ZHOU Yuan-chun   

  • Online:2018-03-15 Published:2018-11-13

摘要: 负载不均衡问题位列影响大规模MapReduce集群性能因素的首位,而Hive join查询非常容易触发该问题。通用解决方案是基于中间键值对的key频率分布设计能够实现负载均衡的key划分算法。现有工作估算key频率分布时依赖于对map的输出进行监控采样,使得通信开销较大并显著延后了shuffle的启动。针对Hive join查询,提出了基于ORC元数据的key频率分布估计方法和相应的负载均衡key划分方法。该方法具有计算量小、通信开销小、不影响现有shuffle机制的优点。通过基准测试证明了该方法在key频率分布估算效率上的巨大提升及相应的key划分方法对Hive join查询性能的提升。

关键词: 负载均衡,MapReduce,Hive,Join,Reducer,ORC

Abstract: The load imbalance problem ranks first among the performance issues in large-scale MapReduce cluster,and it’s very prone to be triggered by Hive join queries.An effective solution is to design reducer load balancing partitioning algorithms by consulting the key’s frequency distribution histogram estimated from intermediate key-value pairs.The existing works of key histogram estimation rely on monitoring and sampling the output of map in a distributed way,which triggers huge network traffic load and notably delays the start of the shuffle.A novel key histogram estimation method based on ORC metadata and the corresponding load balancing partitioning strategy was proposed for Hive join queries.The proposals only need some light-weight computation before the start of the job,thus imposing no extra loads on network traffics and the shuffle.Benchmarking test proves the proposal’s significant improvement on both the key histogram estimation and the reducer load balancing.

Key words: Load balancing,MapReduce,Hive,Join,Reducer,ORC

[1] KWON Y C,BALAZINSKA M,HOWE B,et al.Skew-resistantparallel processing of feature-extracting scientific user-defined functions[C]∥ACM Symposium on Cloud Computing (SoCC).ACM,2010:75-86.
[2] KE Q,PRABHAKARAN V,XIE Y,et al.Optimizing Data Par-titioning for Data-Parallel Computing[C]∥Proceedings of the 13th USENIX conference on Hot topics in operating systems (HotOS).USENIX,2011:13.
[3] YAN W,XUE Y,MALIN B.Scalable and robust key group size estimation for reducer load balancing in MapReduce[C]∥IEEE International Conference on Big Data.IEEE,2013:156-162.
[4] IBRAHIM S,JIN H,LU L,et al.LEEN:Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud[C]∥IEEE Second International Conference on Cloud Computing Technology and Science.IEEE,2010:17-24.
[5] GUFLER B,AUGSTEN N,REISER A,et al.Load Balancing in MapReduce Based on Scalable Cardinality Estimates[C]∥International Conference on Data Engineerin (ICDE).IEEE Compu-ter Society,2012:522-533.
[6] CHEN Q,YAO J,XIAO Z H.LIBRA:Lightweight Data SkewMitigation in MapReduce[J].IEEE Transactions on Parallel & Distributed Systems,2015,26(9):2520-2533.
[7] WANG Z,CHEN Q,LI Z H,et al, An Incremental Partitioning Strategy for Data Balance on MapReduce[J].Chinese Journal of Computers,2016,39(1):19-35.(in Chinese) 王卓,陈群,李战怀,等.基于增量式分区策略的 MapReduce 数据均衡方法[J].计算机学报,2016,39(1):19-35.
[8] KWON Y,BALAZINSKA M,HOWE B,et al.SkewTune:mitigating skew in mapreduce applications[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2012:25-36.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!