计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 10-21.doi: 10.11896/jsjkx.220900261
魏森, 周浩然, 胡创, 程大钊
WEI Sen, ZHOU Haoran, HU Chuang, CHENG Dazhao
摘要: 随着大数据时代数据规模的激增,内存计算框架得到了长足发展。主流内存计算框架Apache Spark使用内存来缓存中间结果,大幅度地提升了数据处理速度。同时,具有较快的读写速度和较大容量的非易失性存储器NVM在内存计算领域展现出了巨大的发展前景,使用DRAM和NVM构建Spark混合缓存系统成为一种可行方案。文中提出了一种基于DRAM-NVM混合内存的Spark缓存系统,该系统选择平面混合缓存模型作为设计方案,然后为缓存块管理系统设计了专用的数据结构,并提出了适用于Spark的混合缓存系统整体设计架构。另外,为了将频繁访问的缓存块保存在DRAM缓存中,提出了基于缓存块最小重用代价的混合缓存管理策略。首先从DAG信息中获取RDD的未来重用次数,未来重用次数多的缓存块将被优先保存在DRAM缓存中,并在缓存块迁移时考虑了迁移成本。设计实验表明,DRAM-NVM混合缓存相比原有缓存系统的性能平均提升了53.06%,对于相同的混合内存,所提策略相比默认缓存策略有平均35.09%的提升。同时,使用文中设计的混合系统只需要1/4的DRAM和3/4的NVM作为缓存,就能达到全部DRAM缓存约79%的性能表现。
中图分类号:
[1]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]//2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10. [2]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [3]LANKHORST M H R,KETELAARS B W,WOLTERS R A M.Low-cost and nanoscale non-volatile memory concept for future silicon chips[J].Nature Materials,2005,4(4):347-352. [4]CHEN A.A review of emerging non-volatilememory(NVM)technologies and applications[J].Solid-State Electronics,2016,125:25-38. [5]IZRAELEVITZ J,YANG J,ZHANG L,et al.Basic performance measurements of the intel optane DC persistent memory module[J].arXiv:1903.05714,2019. [6]WU X,LI J,ZHANG L,et al.Power and performance of read-write aware hybrid caches with non-volatile memories[C]//2009 Design,Automation & Test in Europe Conference & Exhibition.IEEE,2009:737-742. [7]MENG H T,YU S P,LIU F,et al.Research on Memory Management and Cache Replacement Policies in Spark[J].Computer Science,2017,44(6):31-35,74. [8]XU L,LI M,ZHANG L,et al.Memtune:Dynamic memorymanagement for in-memory data analytic platforms[C]//2016 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).IEEE,2016:383-392. [9]YU Y,WANG W,ZHANG J,et al.LRC:Dependency-awarecache management for data analytics clusters[C]//IEEE INFOCOM 2017IEEE Conference on Computer Communications.IEEE,2017:1-9. [10]YU Y,WANG W,ZHANG J,et al.LERC:coordinated cache management for data-parallel systems[C]//2017 IEEE Global Communications Conference(GLOBECOM 2017).IEEE,2017. [11]WANG B,TANG J,ZHANG R,et al.LCRC:A dependency-aware cache management policy for Spark[C]//2018 IEEE International Conference on Parallel & Distributed Processing with Applications,Ubiquitous Computing & Communications,Big Data & Cloud Computing,Social Computing & Networking,Sustainable Computing & Communications(ISPA/IUCC/BDCloud/SocialCom/SustainCom).IEEE,2018:956-963. [12]PEREZ T B G,ZHOU X,CHENG D.Reference-distance eviction and prefetching for cache management in spark[C]//Proceedings of the 47th International Conference on Parallel Processing.2018:1-10. [13]ZHAO Y,DONG J,LIU H,et al.Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark[J].Electronics,2021,10(16):1874. [14]SONG Y,YU J,WANG J J,et al.Memory management optimization strategy in Spark framework based on less contention[J].The Journal of Supercomputing,2023,79(2):1504-1525. [15]YU B,FENG G,CAO H,et al.Chukonu:a fully-featured high-performance big data framework that integrates a native compute engine into Spark[C]//Proceedings of the VLDB Endowment.2021:872-885. [16]WANG C,CUI H,CAO T,et al.Panthera:Holistic memorymanagement for big data processing over hybrid memories[C]//Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation.2019:347-362. [17]KHAN M M,ALAM M A U,NATH A K,et al.Exploration of memory hybridization for RDD caching in Spark[C]//Procee-dings of the 2019 ACM SIGPLAN International Symposium on Memory Management.2019:41-52. [18]CHEN L,ZHAO J,WANG C,et al.Unified Holistic MemoryManagement Supporting Multiple Big Data Processing Frameworks over Hybrid Memories[J].ACM Transactions on Computer Systems(TOCS),2022,39(1/2/3/4):1-38. [19]QURESHI M K,SRINIVASAN V,RIVERS J A.Scalable high performance main memory system using phase-change memory technology[C]//Proceedings of the 36th Annual International Symposium on Computer Architecture.2009:24-33. [20]RAMOS L E,GORBATOV E,BIANCHINI R.Page placement in hybrid memory systems[C]//Proceedings of the International Conference on Supercomputing.2011:85-95. [21]CHEN Y,PENG I B,PENG Z,et al.Atmem:Adaptive dataplacement in graph applications on heterogeneous memories[C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.2020:293-304. [22]DOUDALI T D,BLAGODUROV S,VISHNU A,et al.Kleio:A hybrid memory page scheduler with machine intelligence[C]//Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing.2019:37-48. [23]CHAE S J,CHUNG T S.Dsmm:A dynamic setting for memory management in apache spark[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).IEEE,2019:143-144. [24]SENAPATI R K,PATI U C,MAHAPATRA K K.Listlessblock-tree set partitioning algorithm for very low bit rate embedded image compression[J].AEU-International Journal of Electronics and Communications,2012,66(12):985-995. [25]HANKE S.The performance of concurrent red-black tree algorithms[C]//International Workshop on Algorithm Engineering.Berlin:Springer,1999:286-300. [26]GENG Y,SHI X,PEI C,et al.Lcs:an efficient data evictionstrategy for spark[J].International Journal of Parallel Programming,2017,45(6):1285-1297. [27]RUAN K.Cache Optimization in Spark[D].Shanghai:Shanghai Jiaotong University,2020. [28]GLEICH D F.PageRank beyond the Web[J].SIAM Review,2015,57(3):321-363. [29]WICKBERG T,CAROTHERS C.The RAMDISK storage accelerator:a method of accelerating I/O performance on HPC systems using RAMDISKs[C]//Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers.2012:1-8. [30]SEHGAL P,BASU S,SRINIVASAN K,et al.An empiricalstudy of file systems on NVM[C]//2015 31st Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2015:1-14. [31]LI M,TAN J,WANG Y,et al.Sparkbench:a comprehensivebenchmarking suite for in memory data analytic platform spark[C]//Proceedings of the 12th ACM International Conference on Computing Frontiers.2015:1-8. |
|