Computer Science ›› 2021, Vol. 48 ›› Issue (2): 41-46.doi: 10.11896/jsjkx.191000103

• New Distributed Computing Technologies and Systems • Previous Articles     Next Articles

Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework

ZHANG Yuan-ming, YU Jia-rui, JIANG Jian-bo, LU Jia-wei, XIAO Gang   

  1. College of Computer Science and Technology,Zhejiang University of Technology,Hangzhou 310023,China
  • Received:2019-10-16 Revised:2019-12-06 Online:2021-02-15 Published:2021-02-04
  • About author:ZHANG Yuan-ming,born in 1977,Ph.D,associate professor.His main research interests include parallel computing and cloud computing.
    XIAO Gang,born in 1965,Ph.D,professor.His main research interests include cloud computing and intelligent information system.
  • Supported by:
    The Open Projects Funding of State Key Lab of Computer Architecture(CARCH201804),ICT,CAS.

Abstract: MapReduce is an important parallel computing framework for large data processing,which greatly improves the performance of data processing by performing multiple tasks in parallel on a large number of cluster nodes.However,since the intermediate data needs to wait until the Mapper task is completed,it can be sent to the Reducer task.The massive transmission delay becomes an important bottleneck of the MapReduce framework performance.To this end,an intermediate data transmission pipeline mechanism for MapReduce is proposed.It decouples the effective computation from intermediate data transmission,overlaps each stage in pipeline mode,and effectively hides data transmission delay.The execution mechanism and implementation strategy of the approach are given,including pipeline partition,data subdivision,data merging and data transmission granularity.The proposed mechanism is evaluated on public data sets.When the Shuffle data volume is large,the overall performance improves by 60.2% compared with the default framework.

Key words: Intermediate data, MapReduce framework, Overflow file merging, Pipeline, Transmission delay

CLC Number: 

  • TP391
[1] DEAN J,GHEMAWAT S.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[2] WANG S,WANG H J,QIN X P.Architecting Big Data:Challenges,Studies and Forecasts[J].Chinese Journal of Compu-ters,2011,34(10):1741-1752.
[3] ZAHARIA M,CHOWDHURY M,FRANKLIN M,et al.Spark:cluster computing with working sets[C]//Usenix Conference on Hot Topics in Cloud Computing.Berkeley:USENIX Association,2010:1-12.
[4] XUN Y L,ZHANG J F,QIN X.Data Placement Strategy for MapReduce Cluster Environment[J].Jounary of Software,2015(8):2056-2073.
[5] AHMAD F,LEE S,THOTTETHODI M,et al.MapReducewith communication overlap (MaRCO)[J].Journal of Parallel &Distributed Computing,2013,73(5):608-620.
[6] YU W,WANG Y,QUE X.Design and Evaluation of Network-Levitated Merge for Hadoop Acceleration[J].IEEE Transactions on Parallel & Distributed Systems,2014,25(3):602-611.
[7] CHOWDHURY M,ZAHARIA M,MA J,et al.Managing data transfers in computer clusters with orchestra[J].ACM SIGCOMM Computer Communication Review,2011,41(4):98-109.
[8] LI J J,WU J,YANG X L,et al.Optimizing MapReduce Based on Locality of K-V Pairs and Overlap between Shuffle and Local Reduce[C]//IEEE International Conference on Parallel Processing.2015:939-948.
[9] RAHMAN M W,ISLAM N S,LU X,et al.A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters[J].IEEE Transactions on Parallel & Distributed Systems,2017,PP(99):1-1.
[10] ARSLAN E,SHEKHAR M,KOSAR T.Locality and Network-Aware Reduce Task Scheduling for Data-Intensive Applications[C]//International Workshop on Data-intensive Computing in the Clouds.IEEE,2014.
[11] CAO Y,WANG H.Communication optimization for interme-diate data of MapReduce computing model[J].Jounary of Computer Applications,2018,38(4):1078-1083.
[12] LIU S,WANG H,LI B.Optimizing Shuffle in Wide-Area Data Analytics[C]//2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).IEEE Computer Socie-ty,2017.
[13] SEO S,JANG I,WOO K,et al.HPMR:Prefetching and Pre-Shuffling in Shared MapReduce Computation Environment[C]//Proc of the IEEE International Conference on Cluster Computing and Workshops.2009:1-8.
[14] ZHANG S,HAN J,LIU Z,et al.Accelerating MapReduce with Distributed Memory Cache [C]//International Conference on Parallel & Distributed Systems.2009:472-478.
[15] WANG B,JIANG J,YANG G.mpCache: Accelerating MapReduce with Hybrid Storage System on Many-Core Clusters [C]//NPC 2014.2014:220-233.
[16] LI J,LIN X,CUI X,et al.Improving the Shuffle of Hadoop Map-Reduce[C]//IEEE International Conference on Cloud Computing Technology & Science.IEEE,2014.
[17] QI K Y,HAN Y B,ZHAO Z F.MapReduce Intermediate Result Cache for Concurrent Data Stream Processing[J].Journal of Computer Research and Development,2013,50(1):111-121.
[18] WANG J,QIU M,GUO B,et al.Phase-Reconfigurable Shuffle Optimization for Hadoop MapReduce[J].IEEE Transactions on Cloud Computing,2015,8(2):418-431.
[19] TAN J,CHIN A,HU Z Z,et al.DynMR: dynamic MapReduce with ReduceTask interleaving and MapTask backfilling[C]//Proc of the Ninth European Conference on Computer Systems.ACM,2014:1-14.
[20] QIN X P,WANG H J,DU X Y.Big Data Analysis-Computition and Symbiosis of RDBMS and MapReduce[J].Journal of Software,2012,23(1):32-45.
[21] KAMBATLA K,KOLLIAS G,KUMAR V,et al.Trends in Big Data Analytics [J].Journal of Parallel & Distributed Computing,2014,74(7):2561-2573.
[22] HO L Y,WU J J,LIU P.Optimal Algorithms for Cross-Rack Communication Optimization in MapReduce Framework [C]//IEEE International Conference on Cloud Computing.2011.
[23] YAN W M,WANG W M.Data Structure[M].Beijing:Tsinghua University Press,2013:297-304.
[1] FU Si-qing, LI Tie-jun, ZHANG Jian-min. Architecture Design for Particle Transport Code Acceleration [J]. Computer Science, 2022, 49(6): 81-88.
[2] LIU Dan, GUO Shao-zhong, HAO Jiang-wei, XU Jin-chen. Implementation of Transcendental Functions on Vectors Based on SIMD Extensions [J]. Computer Science, 2021, 48(6): 26-33.
[3] WANG Guo-peng, YANG Jian-xin, YIN Fei, JIANG Sheng-jian. Computing Resources Allocation with Load Balance in Modern Processor [J]. Computer Science, 2020, 47(8): 41-48.
[4] CHEN Xiao-jie,ZHOU Qing-lei,LI Bin. Energy-efficient Password Recovery Method for 7-Zip Document Based on FPGA [J]. Computer Science, 2020, 47(1): 321-328.
[5] HAN Ye-fei, BAI Guang-wei, ZHANG Gong-xuan. Routing Optimization Algorithm of Wireless Sensor Network Based on Improved SVM [J]. Computer Science, 2018, 45(8): 131-133.
[6] WU Qi, WANG Xing-wei, HUANG Min. OpenFlow Switch Packets Pipeline Processing Mechanism Based on SDN [J]. Computer Science, 2018, 45(10): 295-299.
[7] ZHANG Shao-nan, QIU Ke-ni, ZHANG Wei-gong, WANG Jing, ZHENG Jia-xin, BAI Rui-ying and ZHU Xiao-yan. Queuing Theory-guided Performance Evaluation on Reconfigurable High-speed Device Connected Bus [J]. Computer Science, 2017, 44(Z6): 504-509.
[8] ZHAO Yue, REN Yong-gong and LIU Yang. Improved Apriori Algorithm and Its Application Based on MapReduce [J]. Computer Science, 2017, 44(6): 250-254.
[9] DU Zhi-hui, LIN Zhang-xi, GU Yan-qi, Eric O.LEBIGOT and GUO Xiang-yu. GPU Accelerated cWB Pipeline for Gravitational Waves Discovery [J]. Computer Science, 2017, 44(10): 26-32.
[10] YIN Meng-jia, XU Xian-bin, XIONG Zeng-gang and ZHANG Tao. Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU [J]. Computer Science, 2015, 42(12): 13-17.
[11] WANG Zhuo-wei,CHENG Liang-lun and ZHAO Wu-qing. Parallel Computation Performance Analysis Model Based on GPU [J]. Computer Science, 2014, 41(1): 31-38.
[12] CHEN De-hua,ZHOU Meng,SUN Yan-qing and ZHENG Liang-liang. MR-GSpar:A Distributed Large Graph Sparsification Algorithm Based on MapReduce [J]. Computer Science, 2013, 40(10): 190-193.
[13] . EDFUSE : FUSE Framework Based on Asynchronous Event-driven [J]. Computer Science, 2012, 39(Z6): 389-391.
[14] . Pipeline Based Skein Algorithm Design and Implementation [J]. Computer Science, 2012, 39(1): 65-68.
[15] LI Lei, HAN Wen-hao. Implementation of Pipeline Structure on FPGA for SHA-1 [J]. Computer Science, 2011, 38(7): 58-60.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!