计算机科学 ›› 2015, Vol. 42 ›› Issue (10): 50-56.

• 网络与通信 • 上一篇    下一篇

数据本地性感知的MapReduce负载均衡策略

李航晨,秦小麟,沈尧   

  1. 南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金项目(61373015,61300052),国家教育部高等学校博士学科点专项科研基金(20103218110017),江苏高校优势学科建设工程资助

Load Balancing Strategy on MapReduce with Locality-aware

LI Hang-chen, QIN Xiao-lin and SHEN Yao   

  • Online:2018-11-14 Published:2018-11-14

摘要: 现有针对MapReduce的负载均衡调度的研究均未考虑中间数据的分布特点及网络传输的开销,导致额外的网络传输代价与系统效率的下降。为解决上述问题,提出了一种数据本地性感知的负载均衡策略。充分利用YARN中资源管理的新特性,在Map阶段对内存数据溢写的同时进行统计以获取数据分布,根据数据分布情况及各节点的计算能力进行任务调度,减少网络传输开销的同时尽量保证各节点的负载平衡。此外,通过引入细粒度分区与分区的自适应分裂策略,进一步提高在数据倾斜时调度策略的性能。对比实验结果表明,提出的负载均衡调度策略能有效提升性能,同时较好地降低网络总开销。

关键词: MapReduce,数据本地性,数据倾斜,负载均衡

Abstract: Intermediate data distribution characteristics and network traffic overhead are not considered in any existing research on load balancing strategy on MapReduce,resulting in additional network traffic overhead and decrease of system efficiency.To solve this problem ,this paper presented a locality-aware load balancing strategy.By taking advantage of the new features of resource management brought by YARN,the strategy can obtain the data distribution when the buffered data are written to local disk.The strategy schedules the reduce tasks according to the data distribution along with the processing speed of each node to decrease network overhead while maximizing load balancing of each node.In addition,to further improve the performance of scheduling strategy with data skew,this paper introduced the strategy of fine-grained partitioning and self-adaption fragmentation.The comparative experimental results show that the presented strategy can improve the performance effectively,and reduce the total network traffic overhead.

Key words: MapReduce,Data locality,Data skew,Load balance

[1] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113
[2] Apache Hadoop [EB/OL].http://hadoop.apache.org,2014
[3] Vavilapalli V K,Murthy A C,Douglas C,et al.Apache hadoop yarn:Yet another resource negotiator[C]∥Proceedings of the 4th annual Symposium on Cloud Computing.ACM,2013
[4] Ibrahim S,Jin H,Lu L,et al.Handling partitioning skew in Map-Reduce using LEEN[J].Peer-to-Peer Networking and Applications,2013,6(4):409-424
[5] Guo L,Sun H,Luo Z.A data distribution aware task scheduling strategy for mapreduce system[M]∥Cloud Computing.Springer Berlin Heidelberg,2009:694-699
[6] Polo J,Carrera D,Becerra Y,et al.Performance-driven task co-scheduling for mapreduce environments[C]∥Network Operations and Management Sympo-sium (NOMS),2010 IEEE.IEEE,2010:373-380
[7] 唐一韬,黄晶,肖球.一种基于 DAG 的 MapReduce 任务调度算法[J].计算机科学,2014,1(6A):42-46,1 Tang Yi-tao,Huang Jing,Xiao Qiu.Task Scheduling Algorithm for MapReduce Based on DAG[J].Computer Science,2014,1(6A):42-46,1
[8] Dhawalia P,Kailasam S,Janakiram D.Chisel:A Resource Savvy Approach for Handling Skew in MapReduce Applications[C]∥2013 IEEE Sixth International Conference on Cloud Computing (CLOUD).IEEE,2013:652-660
[9] Dewitt D J,Naughton J F,Schneider D A,et al.Practical skew handling in parallel joins[C]∥Proceedings of the 18th International Conference on Very Large Data Bases.1992:27-40
[10] Poosala V,Ioannidis Y E.Estimation of query-result distribution and its application in parallel-join load balancing[C]∥VLDB.1996:448-459
[11] Shatdal A,Naughton J F.Adaptive parallel aggregation algo-rithms[J].ACM SIGMOD Record.ACM,1995,24(2):104-114
[12] Gates A F,Natkovich O,Chopra S,et al.Building a high-leveldataflow system on top of Map-Reduce:the Pig experience[J].Proceedings of the VLDB Endowment,2009,2(2):1414-1425
[13] Kwon Y C,Balazinska M,Howe B,et al.Skew-resistant parallel processing of feature-extracting scientific user-defined functions[C]∥Proceedings of the 1st ACM Symposium on Cloud Computing.ACM,2010:75-86
[14] Morton K,Balazinska M,Grossman D.ParaTimer:a progressindicator for MapReduce DAGs[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.ACM,2010:507-518
[15] Shi Y,Meng X,Liu B.Halt or continue:estimating progress of queries in the cloud[M]∥Database Systems for Advanced Applications.Springer Berlin Heidelberg,2012:169-184
[16] Hassan M,Bamha M,Loulergue F.Handling Data-skew Effects in Join Operations Using MapReduce[J].Procedia Computer Science,2014,29:145-158
[17] Zacheilas N,Kalogeraki V.Real-Time Scheduling of SkewedMapReduce Jobs in Heterogeneous Environments[C]∥International Conference on Autonomic Computing.2014:145-158
[18] Seo S,Jang I,Woo K,et al.HPMR:Prefetching and pre-shuffling in shared MapReduce computation environment[C]∥IEEE International Conference on Cluster Computing and Workshops,2009(CLUSTER’09).IEEE,2009:2736-2743

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!