Computer Science ›› 2023, Vol. 50 ›› Issue (6): 10-21.doi: 10.11896/jsjkx.220900261

• High Performance Computing • Previous Articles     Next Articles

Implementation and Optimization of Apache Spark Cache System Based on Mixed Memory

WEI Sen, ZHOU Haoran, HU Chuang, CHENG Dazhao   

  1. School of Computer Science,Wuhan University,Wuhan 430072,China
  • Received:2022-09-28 Revised:2022-11-03 Online:2023-06-15 Published:2023-06-06
  • About author:WEI Sen,born in 1999,postgraduate.His main research interests include big data systems and distributed systems.CHENG Dazhao,born in 1984,Ph.D,professor,is a member of China Computer Federation.His main research interests include big data and cloud computing.
  • Supported by:
    Zhejiang Lab Open Research Project(K2022PI0AB01) and Special Fund of Hubei Luojia Laboratory(220100016).

Abstract: With increasing data scale in the “big data era”,in-memory computing frameworks have grown significantly.The mainstream in-memory computing framework Apache Spark uses memory to cache intermediate results,which greatly improves data processing performance.At the same time,non-volatile memory (NVM) with fast read and write performance has great development prospects in the field of in-memory computing,so there is huge promise in building Spark's cache with a mix of DRAM and NVM.In this paper,a Spark cache system based on DRAM-NVM hybrid memory is proposed,which selects the flat hybrid cache model as the design scheme,and then designs a dedicated data structure for the cache block management system,and proposes the overall design architecture of the hybrid cache system for Spark.In addition,in order to save frequently accessed cache blocks in the DRAM cache,a hybrid cache management strategy based on the minimum reuse cost of cache blocks is proposed.First,the future reuse of RDD is obtained from the DAG information,and the cache blocks with high future reuse times will be stored in the DRAM cache first,and the migration cost is considered when the cache block is migrated.The design experiments show that the DRAM-NVM hybrid cache has an average performance improvement of 53.06% compared to the original cache system,and the proposed strategy has an average improvement of 35.09%compared to the default cache strategy for the same hybrid memory.At the same time,the hybrid system designed in this paper only needs 1/4 of the DRAM and 3/4 of the NVM as the cache,and the running time of the total DRAM cache can be achieved by 85.49%.

Key words: Spark, Cache management strategy, NVM, Hybrid memory

CLC Number: 

  • TP311.13
[1]SHVACHKO K,KUANG H,RADIA S,et al.The hadoop distributed file system[C]//2010 IEEE 26th Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2010:1-10.
[2]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[3]LANKHORST M H R,KETELAARS B W,WOLTERS R A M.Low-cost and nanoscale non-volatile memory concept for future silicon chips[J].Nature Materials,2005,4(4):347-352.
[4]CHEN A.A review of emerging non-volatilememory(NVM)technologies and applications[J].Solid-State Electronics,2016,125:25-38.
[5]IZRAELEVITZ J,YANG J,ZHANG L,et al.Basic performance measurements of the intel optane DC persistent memory module[J].arXiv:1903.05714,2019.
[6]WU X,LI J,ZHANG L,et al.Power and performance of read-write aware hybrid caches with non-volatile memories[C]//2009 Design,Automation & Test in Europe Conference & Exhibition.IEEE,2009:737-742.
[7]MENG H T,YU S P,LIU F,et al.Research on Memory Management and Cache Replacement Policies in Spark[J].Computer Science,2017,44(6):31-35,74.
[8]XU L,LI M,ZHANG L,et al.Memtune:Dynamic memorymanagement for in-memory data analytic platforms[C]//2016 IEEE International Parallel and Distributed Processing Sympo-sium(IPDPS).IEEE,2016:383-392.
[9]YU Y,WANG W,ZHANG J,et al.LRC:Dependency-awarecache management for data analytics clusters[C]//IEEE INFOCOM 2017IEEE Conference on Computer Communications.IEEE,2017:1-9.
[10]YU Y,WANG W,ZHANG J,et al.LERC:coordinated cache management for data-parallel systems[C]//2017 IEEE Global Communications Conference(GLOBECOM 2017).IEEE,2017.
[11]WANG B,TANG J,ZHANG R,et al.LCRC:A dependency-aware cache management policy for Spark[C]//2018 IEEE International Conference on Parallel & Distributed Processing with Applications,Ubiquitous Computing & Communications,Big Data & Cloud Computing,Social Computing & Networking,Sustainable Computing & Communications(ISPA/IUCC/BDCloud/SocialCom/SustainCom).IEEE,2018:956-963.
[12]PEREZ T B G,ZHOU X,CHENG D.Reference-distance eviction and prefetching for cache management in spark[C]//Proceedings of the 47th International Conference on Parallel Processing.2018:1-10.
[13]ZHAO Y,DONG J,LIU H,et al.Performance Improvement of DAG-Aware Task Scheduling Algorithms with Efficient Cache Management in Spark[J].Electronics,2021,10(16):1874.
[14]SONG Y,YU J,WANG J J,et al.Memory management optimization strategy in Spark framework based on less contention[J].The Journal of Supercomputing,2023,79(2):1504-1525.
[15]YU B,FENG G,CAO H,et al.Chukonu:a fully-featured high-performance big data framework that integrates a native compute engine into Spark[C]//Proceedings of the VLDB Endowment.2021:872-885.
[16]WANG C,CUI H,CAO T,et al.Panthera:Holistic memorymanagement for big data processing over hybrid memories[C]//Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation.2019:347-362.
[17]KHAN M M,ALAM M A U,NATH A K,et al.Exploration of memory hybridization for RDD caching in Spark[C]//Procee-dings of the 2019 ACM SIGPLAN International Symposium on Memory Management.2019:41-52.
[18]CHEN L,ZHAO J,WANG C,et al.Unified Holistic MemoryManagement Supporting Multiple Big Data Processing Frameworks over Hybrid Memories[J].ACM Transactions on Computer Systems(TOCS),2022,39(1/2/3/4):1-38.
[19]QURESHI M K,SRINIVASAN V,RIVERS J A.Scalable high performance main memory system using phase-change memory technology[C]//Proceedings of the 36th Annual International Symposium on Computer Architecture.2009:24-33.
[20]RAMOS L E,GORBATOV E,BIANCHINI R.Page placement in hybrid memory systems[C]//Proceedings of the International Conference on Supercomputing.2011:85-95.
[21]CHEN Y,PENG I B,PENG Z,et al.Atmem:Adaptive dataplacement in graph applications on heterogeneous memories[C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.2020:293-304.
[22]DOUDALI T D,BLAGODUROV S,VISHNU A,et al.Kleio:A hybrid memory page scheduler with machine intelligence[C]//Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing.2019:37-48.
[23]CHAE S J,CHUNG T S.Dsmm:A dynamic setting for memory management in apache spark[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).IEEE,2019:143-144.
[24]SENAPATI R K,PATI U C,MAHAPATRA K K.Listlessblock-tree set partitioning algorithm for very low bit rate embedded image compression[J].AEU-International Journal of Electronics and Communications,2012,66(12):985-995.
[25]HANKE S.The performance of concurrent red-black tree algorithms[C]//International Workshop on Algorithm Engineering.Berlin:Springer,1999:286-300.
[26]GENG Y,SHI X,PEI C,et al.Lcs:an efficient data evictionstrategy for spark[J].International Journal of Parallel Programming,2017,45(6):1285-1297.
[27]RUAN K.Cache Optimization in Spark[D].Shanghai:Shanghai Jiaotong University,2020.
[28]GLEICH D F.PageRank beyond the Web[J].SIAM Review,2015,57(3):321-363.
[29]WICKBERG T,CAROTHERS C.The RAMDISK storage accelerator:a method of accelerating I/O performance on HPC systems using RAMDISKs[C]//Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers.2012:1-8.
[30]SEHGAL P,BASU S,SRINIVASAN K,et al.An empiricalstudy of file systems on NVM[C]//2015 31st Symposium on Mass Storage Systems and Technologies(MSST).IEEE,2015:1-14.
[31]LI M,TAN J,WANG Y,et al.Sparkbench:a comprehensivebenchmarking suite for in memory data analytic platform spark[C]//Proceedings of the 12th ACM International Conference on Computing Frontiers.2015:1-8.
[1] LIU Gao-cong, LUO Yong-ping, JIN Pei-quan. Accelerating Persistent Memory-based Indices Based on Hotspot Data [J]. Computer Science, 2022, 49(8): 26-32.
[2] DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[3] YU Jian-ye, QI Yong, WANG Bao-zhuo. Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark [J]. Computer Science, 2021, 48(6A): 518-523.
[4] YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming [J]. Computer Science, 2020, 47(4): 36-41.
[5] ZHU An-qing, LI Shuai, TANG Xiao-dong. Parallel FP_growth Association Rules Mining Method on Spark Platform [J]. Computer Science, 2020, 47(12): 139-143.
[6] DENG Ding-sheng. Application of Improved DBSCAN Algorithm on Spark Platform [J]. Computer Science, 2020, 47(11A): 425-429.
[7] YU Xin-yi, SHI Tian-feng, TANG Quan-rui, YIN Hui-wu, OU Lin-lin. Industrial Equipment Management System for Predictive Maintenance [J]. Computer Science, 2020, 47(11A): 667-672.
[8] LIU Wei, SUN Tong-xin, DU Wei. Access Pattern-oriented Cache Replacement Strategy for Hybrid Memory Architecture [J]. Computer Science, 2020, 47(10): 130-135.
[9] JIA Ning, LI Ying-da. Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device [J]. Computer Science, 2019, 46(6A): 566-570.
[10] ZHAO Jun-xian, YU Jian. Optimization of Spark RDD Based on Non-serialization Native Storage [J]. Computer Science, 2019, 46(5): 143-149.
[11] WEI Liang, LIN Zi-yu, LAI Yong-xuan. DFTS:A Top-k Skyline Query for Large Datasets [J]. Computer Science, 2019, 46(5): 150-156.
[12] CUI Guang-fan, XU Li-jie, LIU Jie, YE Dan, ZHONG Hua. Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL [J]. Computer Science, 2018, 45(9): 104-112.
[13] ZHAO Er-ping, MENG Xiao-feng. Spatial Index of 3D Point Cloud Data Based on Spark [J]. Computer Science, 2018, 45(9): 213-219.
[14] LIAO Hu-sheng, HUANG Shan-shan, XU Jun-gang, LIU Ren-feng. Survey on Performance Optimization Technologies for Spark [J]. Computer Science, 2018, 45(7): 7-15.
[15] SHI Jin-ping,LI Jin,HE Feng-zhen. Diversity Recommendation Approach Based on Social Relationship and User Preference [J]. Computer Science, 2018, 45(6A): 423-427.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!