面向MapReduce的中间数据传输流水线优化机制

doi:10.11896/jsjkx.191000103

计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 41-46.doi: 10.11896/jsjkx.191000103

• 新型分布式计算技术与系统^* • 上一篇下一篇

面向MapReduce的中间数据传输流水线优化机制

张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚

浙江工业大学计算机科学与技术学院杭州310023

收稿日期:2019-10-16 修回日期:2019-12-06 出版日期:2021-02-15 发布日期:2021-02-04
通讯作者: 肖刚(xg@zjut.edu.cn)
作者简介:zym@zjut.edu.cn
基金资助:
计算机体系结构国家重点实验室开放课题(CARCH201804)

Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework

ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang

College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China

Received:2019-10-16 Revised:2019-12-06 Online:2021-02-15 Published:2021-02-04
About author:ZHANG Yuan-ming,born in 1977,Ph.D,associate professor.His main research interests include parallel computing and cloud computing.
XIAO Gang,born in 1965,Ph.D,professor.His main research interests include cloud computing and intelligent information system.
Supported by:
The Open Projects Funding of State Key Lab of Computer Architecture(CARCH201804),ICT,CAS.

摘要/Abstract

摘要： MapReduce是一种适用于大数据处理的重要并行计算框架,通过在大量集群节点上并行执行多个任务,极大地提高了数据的处理性能。然而,由于中间数据需要等到Mapper任务完成之后才能被发送给Reducer任务,由此导致的大量传输延迟成为MapReduce框架性能的重要瓶颈。为此,文中提出了一种面向MapReduce的中间数据传输流水线优化机制,将有效计算与中间数据传输解耦,以流水线的方式重叠执行各个阶段,有效隐藏数据传输开销。文中还给出了中间数据传输流水线执行机制和实现策略,包括流水线划分、数据细分、数据归并和数据传输粒度等。在公开数据集上对所提中间数据传输流水线优化机制进行了评价,当Shuffle数据量较大时,该优化机制比默认框架的整体性能提高了60.2%。

关键词: MapReduce框架, 传输延迟, 流水线, 溢写文件归并, 中间数据传输

Abstract: MapReduce is an important parallel computing framework for large data processing,which greatly improves the performance of data processing by performing multiple tasks in parallel on a large number of cluster nodes.However,since the intermediate data needs to wait until the Mapper task is completed,it can be sent to the Reducer task.The massive transmission delay becomes an important bottleneck of the MapReduce framework performance.To this end,an intermediate data transmission pipeline mechanism for MapReduce is proposed.It decouples the effective computation from intermediate data transmission,overlaps each stage in pipeline mode,and effectively hides data transmission delay.The execution mechanism and implementation strategy of the approach are given,including pipeline partition,data subdivision,data merging and data transmission granularity.The proposed mechanism is evaluated on public data sets.When the Shuffle data volume is large,the overall performance improves by 60.2% compared with the default framework.

Key words: Intermediate data, MapReduce framework, Overflow file merging, Pipeline, Transmission delay

中图分类号:

TP391

张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚. 面向MapReduce的中间数据传输流水线优化机制[J]. 计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103

ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang. Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework[J]. Computer Science, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103

参考文献

[1] DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[2] WANG S,WANG H J,QIN X P.Architecting Big Data:Challenges,Studies and Forecasts[J].Chinese Journal of Compu-ters,2011,34(10):1741-1752.
[3] ZAHARIA M,CHOWDHURY M,FRANKLIN M,et al.Spark:cluster computing with working sets[C]//Usenix Conference on Hot Topics in Cloud Computing.Berkeley:USENIX Association,2010:1-12.
[4] XUN Y L,ZHANG J F,QIN X.Data Placement Strategy for MapReduce Cluster Environment[J].Jounary of Software,2015(8):2056-2073.
[5] AHMAD F,LEE S,THOTTETHODI M,et al.MapReducewith communication overlap (MaRCO)[J].Journal of Parallel &Distributed Computing,2013,73(5):608-620.
[6] YU W,WANG Y,QUE X.Design and Evaluation of Network-Levitated Merge for Hadoop Acceleration[J].IEEE Transactions on Parallel & Distributed Systems,2014,25(3):602-611.
[7] CHOWDHURY M,ZAHARIA M,MA J,et al.Managing data transfers in computer clusters with orchestra[J].ACM SIGCOMM Computer Communication Review,2011,41(4):98-109.
[8] LI J J,WU J,YANG X L,et al.Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce[C]//IEEE International Conference on Parallel Processing.2015:939-948.
[9] RAHMAN M W,ISLAM N S,LU X,et al.A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters[J].IEEE Transactions on Parallel & Distributed Systems,2017,PP(99):1-1.
[10] ARSLAN E,SHEKHAR M,KOSAR T.Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications[C]//International Workshop on Data-intensive Computing in the Clouds.IEEE,2014.
[11] CAO Y,WANG H.Communication optimization for interme-diate data of MapReduce computing model[J].Jounary of Computer Applications,2018,38(4):1078-1083.
[12] LIU S,WANG H,LI B.Optimizing Shuffle in Wide-Area Data Analytics[C]//2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).IEEE Computer Socie-ty,2017.
[13] SEO S,JANG I,WOO K,et al.HPMR:Prefetching and Pre-Shuffling in Shared MapReduce Computation Environment[C]//Proc of the IEEE International Conference on Cluster Computing and Workshops.2009:1-8.
[14] ZHANG S,HAN J,LIU Z,et al.Accelerating MapReduce with Distributed Memory Cache [C]//International Conference on Parallel & Distributed Systems.2009:472-478.
[15] WANG B,JIANG J,YANG G.mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters [C]//NPC 2014.2014:220-233.
[16] LI J,LIN X,CUI X,et al.Improving the Shuffle of Hadoop Map-Reduce[C]//IEEE International Conference on Cloud Computing Technology & Science.IEEE,2014.
[17] QI K Y,HAN Y B,ZHAO Z F.MapReduce Intermediate Result Cache for Concurrent Data Stream Processing[J].Journal of Computer Research and Development,2013,50(1):111-121.
[18] WANG J,QIU M,GUO B,et al.Phase-Reconfigurable Shuffle Optimization for Hadoop MapReduce[J].IEEE Transactions on Cloud Computing,2015,8(2):418-431.
[19] TAN J,CHIN A,HU Z Z,et al.DynMR: dynamic MapReduce with ReduceTask interleaving and MapTask backfilling[C]//Proc of the Ninth European Conference on Computer Systems.ACM,2014:1-14.
[20] QIN X P,WANG H J,DU X Y.Big Data Analysis-Computition and Symbiosis of RDBMS and MapReduce[J].Journal of Software,2012,23(1):32-45.
[21] KAMBATLA K,KOLLIAS G,KUMAR V,et al.Trends in Big Data Analytics [J].Journal of Parallel & Distributed Computing,2014,74(7):2561-2573.
[22] HO L Y,WU J J,LIU P.Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework [C]//IEEE International Conference on Cloud Computing.2011.
[23] YAN W M,WANG W M.Data Structure[M].Beijing:Tsinghua University Press,2013:297-304.

相关文章 15

[1]	刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[2]	傅思清, 黎铁军, 张建民. 面向粒子输运程序加速的体系结构设计 Architecture Design for Particle Transport Code Acceleration 计算机科学, 2022, 49(6): 81-88. https://doi.org/10.11896/jsjkx.210600179
[3]	王国澎, 杨剑新, 尹飞, 蒋生健. 负载均衡的处理器运算资源分配方法 Computing Resources Allocation with Load Balance in Modern Processor 计算机科学, 2020, 47(8): 41-48. https://doi.org/10.11896/jsjkx.191000148
[4]	陈晓杰,周清雷,李斌. 基于FPGA的7-Zip加密文档高能效口令恢复方法 Energy-efficient Password Recovery Method for 7-Zip Document Based on FPGA 计算机科学, 2020, 47(1): 321-328. https://doi.org/10.11896/jsjkx.190100027
[5]	吴琪, 王兴伟, 黄敏. 基于SDN的OpenFlow交换机数据包流水线处理机制 OpenFlow Switch Packets Pipeline Processing Mechanism Based on SDN 计算机科学, 2018, 45(10): 295-299. https://doi.org/10.11896／j.issn.1002-137X.2018.10.055
[6]	张少楠,邱柯妮,张伟功,王晶,郑佳欣,白瑞英,朱晓燕. 基于排队论的UM-BUS总线性能建模与评估 Queuing Theory-guided Performance Evaluation on Reconfigurable High-speed Device Connected Bus 计算机科学, 2017, 44(Z6): 504-509. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.112
[7]	刘翱,邓旭东,李维刚. 基于自适应控制参数的改进水波优化算法 Improved Water Wave Optimization Algorithm with Adaptive Control Parameters 计算机科学, 2017, 44(7): 203-209. https://doi.org/10.11896/j.issn.1002-137X.2017.07.036
[8]	赵月,任永功,刘洋. 基于MapReduce的改进的Apriori算法及其应用研究 Improved Apriori Algorithm and Its Application Based on MapReduce 计算机科学, 2017, 44(6): 250-254. https://doi.org/10.11896/j.issn.1002-137X.2017.06.043
[9]	都志辉,林璋熙,顾彦祺,Eric O.LEBIGOT,郭翔宇. 引力波cWB处理流水线的GPU加速 GPU Accelerated cWB Pipeline for Gravitational Waves Discovery 计算机科学, 2017, 44(10): 26-32. https://doi.org/10.11896/j.issn.1002-137X.2017.10.005
[10]	张丽红,余世明. 求解置换流水线调度问题的改进萤火虫优化算法 Improved GSO Algorithm for Solving PFSP 计算机科学, 2016, 43(8): 240-243. https://doi.org/10.11896/j.issn.1002-137X.2016.08.048
[11]	尹孟嘉,许先斌,熊曾刚,张涛. GPU矩阵乘法的性能定量分析模型 Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU 计算机科学, 2015, 42(12): 13-17.
[12]	王卓薇,程良伦,赵武清. 基于GPU的并行计算性能分析模型 Parallel Computation Performance Analysis Model Based on GPU 计算机科学, 2014, 41(1): 31-38.
[13]	周季华,叶春明,盛晓华. 基于智能水滴算法置换流水线调度问题的研究 Research on Permutation Flow-shop Scheduling Problem by Intelligent Water Drop Algorithm 计算机科学, 2013, 40(9): 250-253.
[14]	刘莹,谷文祥,李向涛. 置换流水线车间调度问题的研究 Research on Permutation Flow-shop Scheduling Problem 计算机科学, 2013, 40(11): 1-7.
[15]	陈德华,周蒙,孙延青,郑亮亮. MR-GSpar:一种基于MapReduce的大图稀疏化算法 MR-GSpar:A Distributed Large Graph Sparsification Algorithm Based on MapReduce 计算机科学, 2013, 40(10): 190-193.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

面向MapReduce的中间数据传输流水线优化机制

Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0