基于混合内存的Apache Spark缓存系统实现与优化

doi:10.11896/jsjkx.220900261

计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 10-21.doi: 10.11896/jsjkx.220900261

基于混合内存的Apache Spark缓存系统实现与优化

魏森, 周浩然, 胡创, 程大钊

武汉大学计算机学院武汉 430072

收稿日期:2022-09-28 修回日期:2022-11-03 出版日期:2023-06-15 发布日期:2023-06-06
通讯作者: 程大钊(dcheng@whu.edu.cn)
作者简介:(weisen@whu.edu.cn)
基金资助:
之江实验室开放课题(K2022PI0AB01);湖北珞珈实验室专项基金资助项目(220100016)

Implementation and Optimization of Apache Spark Cache System Based on Mixed Memory

WEI Sen, ZHOU Haoran, HU Chuang, CHENG Dazhao

School of Computer Science,Wuhan University,Wuhan 430072,China

Received:2022-09-28 Revised:2022-11-03 Online:2023-06-15 Published:2023-06-06
About author:WEI Sen,born in 1999,postgraduate.His main research interests include big data systems and distributed systems.CHENG Dazhao,born in 1984,Ph.D,professor,is a member of China Computer Federation.His main research interests include big data and cloud computing.
Supported by:
Zhejiang Lab Open Research Project(K2022PI0AB01) and Special Fund of Hubei Luojia Laboratory(220100016).

摘要/Abstract

摘要： 随着大数据时代数据规模的激增,内存计算框架得到了长足发展。主流内存计算框架Apache Spark使用内存来缓存中间结果,大幅度地提升了数据处理速度。同时,具有较快的读写速度和较大容量的非易失性存储器NVM在内存计算领域展现出了巨大的发展前景,使用DRAM和NVM构建Spark混合缓存系统成为一种可行方案。文中提出了一种基于DRAM-NVM混合内存的Spark缓存系统,该系统选择平面混合缓存模型作为设计方案,然后为缓存块管理系统设计了专用的数据结构,并提出了适用于Spark的混合缓存系统整体设计架构。另外,为了将频繁访问的缓存块保存在DRAM缓存中,提出了基于缓存块最小重用代价的混合缓存管理策略。首先从DAG信息中获取RDD的未来重用次数,未来重用次数多的缓存块将被优先保存在DRAM缓存中,并在缓存块迁移时考虑了迁移成本。设计实验表明,DRAM-NVM混合缓存相比原有缓存系统的性能平均提升了53.06%,对于相同的混合内存,所提策略相比默认缓存策略有平均35.09%的提升。同时,使用文中设计的混合系统只需要1/4的DRAM和3/4的NVM作为缓存,就能达到全部DRAM缓存约79％的性能表现。

关键词: Spark, 缓存管理策略, NVM, 混合内存

Abstract: With increasing data scale in the “big data era”,in-memory computing frameworks have grown significantly.The mainstream in-memory computing framework Apache Spark uses memory to cache intermediate results,which greatly improves data processing performance.At the same time,non-volatile memory (NVM) with fast read and write performance has great development prospects in the field of in-memory computing,so there is huge promise in building Spark's cache with a mix of DRAM and NVM.In this paper,a Spark cache system based on DRAM-NVM hybrid memory is proposed,which selects the flat hybrid cache model as the design scheme,and then designs a dedicated data structure for the cache block management system,and proposes the overall design architecture of the hybrid cache system for Spark.In addition,in order to save frequently accessed cache blocks in the DRAM cache,a hybrid cache management strategy based on the minimum reuse cost of cache blocks is proposed.First,the future reuse of RDD is obtained from the DAG information,and the cache blocks with high future reuse times will be stored in the DRAM cache first,and the migration cost is considered when the cache block is migrated.The design experiments show that the DRAM-NVM hybrid cache has an average performance improvement of 53.06% compared to the original cache system,and the proposed strategy has an average improvement of 35.09%compared to the default cache strategy for the same hybrid memory.At the same time,the hybrid system designed in this paper only needs 1/4 of the DRAM and 3/4 of the NVM as the cache,and the running time of the total DRAM cache can be achieved by 85.49%.

Key words: Spark, Cache management strategy, NVM, Hybrid memory

中图分类号:

TP311.13

魏森, 周浩然, 胡创, 程大钊. 基于混合内存的Apache Spark缓存系统实现与优化[J]. 计算机科学, 2023, 50(6): 10-21. https://doi.org/10.11896/jsjkx.220900261

WEI Sen, ZHOU Haoran, HU Chuang, CHENG Dazhao. Implementation and Optimization of Apache Spark Cache System Based on Mixed Memory[J]. Computer Science, 2023, 50(6): 10-21. https://doi.org/10.11896/jsjkx.220900261

参考文献

[1]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]//2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10.
[2]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[3]LANKHORST M H R,KETELAARS B W,WOLTERS R A M.Low-cost and nanoscale non-volatile memory concept for future silicon chips[J].Nature Materials,2005,4(4):347-352.
[4]CHEN A.A review of emerging non-volatilememory(NVM)technologies and applications[J].Solid-State Electronics,2016,125:25-38.
[5]IZRAELEVITZ J,YANG J,ZHANG L,et al.Basic performance measurements of the intel optane DC persistent memory module[J].arXiv:1903.05714,2019.
[6]WU X,LI J,ZHANG L,et al.Power and performance of read-write aware hybrid caches with non-volatile memories[C]//2009 Design,Automation & Test in Europe Conference & Exhibition.IEEE,2009:737-742.
[7]MENG H T,YU S P,LIU F,et al.Research on Memory Management and Cache Replacement Policies in Spark[J].Computer Science,2017,44(6):31-35,74.
[8]XU L,LI M,ZHANG L,et al.Memtune:Dynamic memorymanagement for in-memory data analytic platforms[C]//2016 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).IEEE,2016:383-392.
[9]YU Y,WANG W,ZHANG J,et al.LRC:Dependency-awarecache management for data analytics clusters[C]//IEEE INFOCOM 2017IEEE Conference on Computer Communications.IEEE,2017:1-9.
[10]YU Y,WANG W,ZHANG J,et al.LERC:coordinated cache management for data-parallel systems[C]//2017 IEEE Global Communications Conference(GLOBECOM 2017).IEEE,2017.
[11]WANG B,TANG J,ZHANG R,et al.LCRC:A dependency-aware cache management policy for Spark[C]//2018 IEEE International Conference on Parallel & Distributed Processing with Applications,Ubiquitous Computing & Communications,Big Data & Cloud Computing,Social Computing & Networking,Sustainable Computing & Communications(ISPA/IUCC/BDCloud/SocialCom/SustainCom).IEEE,2018:956-963.
[12]PEREZ T B G,ZHOU X,CHENG D.Reference-distance eviction and prefetching for cache management in spark[C]//Proceedings of the 47th International Conference on Parallel Processing.2018:1-10.
[13]ZHAO Y,DONG J,LIU H,et al.Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark[J].Electronics,2021,10(16):1874.
[14]SONG Y,YU J,WANG J J,et al.Memory management optimization strategy in Spark framework based on less contention[J].The Journal of Supercomputing,2023,79(2):1504-1525.
[15]YU B,FENG G,CAO H,et al.Chukonu:a fully-featured high-performance big data framework that integrates a native compute engine into Spark[C]//Proceedings of the VLDB Endowment.2021:872-885.
[16]WANG C,CUI H,CAO T,et al.Panthera:Holistic memorymanagement for big data processing over hybrid memories[C]//Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation.2019:347-362.
[17]KHAN M M,ALAM M A U,NATH A K,et al.Exploration of memory hybridization for RDD caching in Spark[C]//Procee-dings of the 2019 ACM SIGPLAN International Symposium on Memory Management.2019:41-52.
[18]CHEN L,ZHAO J,WANG C,et al.Unified Holistic MemoryManagement Supporting Multiple Big Data Processing Frameworks over Hybrid Memories[J].ACM Transactions on Computer Systems(TOCS),2022,39(1/2/3/4):1-38.
[19]QURESHI M K,SRINIVASAN V,RIVERS J A.Scalable high performance main memory system using phase-change memory technology[C]//Proceedings of the 36th Annual International Symposium on Computer Architecture.2009:24-33.
[20]RAMOS L E,GORBATOV E,BIANCHINI R.Page placement in hybrid memory systems[C]//Proceedings of the International Conference on Supercomputing.2011:85-95.
[21]CHEN Y,PENG I B,PENG Z,et al.Atmem:Adaptive dataplacement in graph applications on heterogeneous memories[C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.2020:293-304.
[22]DOUDALI T D,BLAGODUROV S,VISHNU A,et al.Kleio:A hybrid memory page scheduler with machine intelligence[C]//Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing.2019:37-48.
[23]CHAE S J,CHUNG T S.Dsmm:A dynamic setting for memory management in apache spark[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).IEEE,2019:143-144.
[24]SENAPATI R K,PATI U C,MAHAPATRA K K.Listlessblock-tree set partitioning algorithm for very low bit rate embedded image compression[J].AEU-International Journal of Electronics and Communications,2012,66(12):985-995.
[25]HANKE S.The performance of concurrent red-black tree algorithms[C]//International Workshop on Algorithm Engineering.Berlin:Springer,1999:286-300.
[26]GENG Y,SHI X,PEI C,et al.Lcs:an efficient data evictionstrategy for spark[J].International Journal of Parallel Programming,2017,45(6):1285-1297.
[27]RUAN K.Cache Optimization in Spark[D].Shanghai:Shanghai Jiaotong University,2020.
[28]GLEICH D F.PageRank beyond the Web[J].SIAM Review,2015,57(3):321-363.
[29]WICKBERG T,CAROTHERS C.The RAMDISK storage accelerator:a method of accelerating I/O performance on HPC systems using RAMDISKs[C]//Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers.2012:1-8.
[30]SEHGAL P,BASU S,SRINIVASAN K,et al.An empiricalstudy of file systems on NVM[C]//2015 31st Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2015:1-14.
[31]LI M,TAN J,WANG Y,et al.Sparkbench:a comprehensivebenchmarking suite for in memory data analytic platform spark[C]//Proceedings of the 12th ACM International Conference on Computing Frontiers.2015:1-8.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于混合内存的Apache Spark缓存系统实现与优化

Implementation and Optimization of Apache Spark Cache System Based on Mixed Memory

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0