计算机科学 ›› 2017, Vol. 44 ›› Issue (6): 31-35.doi: 10.11896/j.issn.1002-137X.2017.06.005

• 2016 年全国信息存储技术学术年会 • 上一篇    下一篇

Spark内存管理及缓存策略研究

孟红涛,余松平,刘芳,肖侬   

  1. 国防科学技术大学计算机学院 长沙410072,国防科学技术大学计算机学院 长沙410072,国防科学技术大学计算机学院 长沙410072,国防科学技术大学计算机学院 长沙410072
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受863计划“面向大数据的内存计算关键技术与系统”子课题“基于内存计算的并行处理系统与研究”资助

Research on Memory Management and Cache Replacement Policies in Spark

MENG Hong-tao, YU Song-ping, LIU Fang and XIAO Nong   

  • Online:2018-11-13 Published:2018-11-13

摘要: Spark系统是基于Map-Reduce模型的大数据处理框架。Spark能够充分利用集群的内存,从而加快数据的处理速度。Spark按照功能把内存分成不同的区域:Shuffle Memory和Storage Memory,Unroll Memory,不同的区域有不同的使用特点。首先,测试并分析了Shuffle Memory和Storage Memory的使用特点。RDD是Spark系统最重要的抽象,能够缓存在集群的内存中;在内存不足时,需要淘汰部分RDD分区。接着,提出了一种新的RDD分布式权值缓存策略,通过RDD分区的存储时间、大小、使用次数等来分析RDD分区的权值,并根据RDD的分布式特征对需要淘汰的RDD分区进行选择。最后,测试和分析了多种缓存策略的性能。

关键词: 大数据,Spark内存管理,RDD缓存,缓存策略

Abstract: Spark is a big data processing framework based on Map-Reduce.Spark can make full use of cluster memory,thus accelerating data processing.Spark divides memory into Shuffle Memory,Storage Memory and Unroll Memory according to their functions.These different memory zones have different characteristics.The features of Shuffle Memory and Storage Memory were tested and analyzed.RDD (Resilient Distributed Datasets) is the most important abstract in spark,which can cache in cluster memory.When the cluster memory is insufficient,Spark must select some RDD partitions to discard to make room for the new ones.A new cache replacement policies called DWRP (Distributed Weight Replacement Policy) was proposed.DWRP can compute the weight of every RDD partition based on the time of store in memory,size and frequency of use,and then select possible RDD partition to discard based on distribution features.The performance of different cache replacement policies was tested and analyzed at last.

Key words: Big data,Spark memory management,RDD cache,Cache replacement policies

[1] ZAHARIA M,CHOWDHURY M,FRANKLINM J,et al.Spark:cluster computing with working sets[C]∥ Usenix Conference on Hot Topics in Cloud Computing.2010:10 .
[2] WARNEKE D,LENG C.A Case For Dynamic Memory Partitioning in Data Centers[C]∥ The Workshop on Data Analytics in the Cloud.2013:41-45.
[3] LI H,GHODSI A,ZAHARIA M,et al.Tachyon:Reliable,memory speed storage for cluster computing frameworks[C]∥Proceedings of the ACM Symposium on Cloud Computing.ACM,2014:1-15.
[4] ANANTHANARAYANAN G,GHODSI A,WANG A,et al.PACMan:coordinated memory caching for parallel jobs[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012:20.
[5] DUAN M,LI K,TANG Z,et al.Selection and replacement algorithms for memory performance improvement in Spark[J].Concurrency and Computation:Practice and Experience,2015,28(8):2473-2486.
[6] FENG L.Research and Implementation of Memory Optimization Based on Parallel Computing Engine Spark[D].Beijing:Tsinghua University,2013.(in Chinese) 冯琳.集群计算引擎Spark中的内存优化研究与实现[D].北京:清华大学,2013.
[7] ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing: UCB/EECS-2011-82[R].EECS Department,University of California,Berkeley,2011.
[8] ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Confe-rence on Networked Systems Design and Implementation.USENIX Association,2012:2.
[9] GRISHCHENKO A.Spark Architecture:Shuffle[EB/OL].(2015-08)[2016-09].https://0x0fff.com/spark-architecture-shuffle.
[10] WHITE T.Hadoop:The Definitive Guide,3E.[M].California:O’Reilly Medis,2012:226-227.
[11] WANG L,ZHAN J,LUO C,et al.Bigdatabench:A big databenchmark suite from internet services[C]∥2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).IEEE,2014:488-499.
[12] GAO Y J.DataProcessing with Spark,Technology,Application and Performance Optimization[J].Beijing:China Machine Press,2014:38-39.(in Chinese) 高彦杰.Spark大数据处理技术,应用与性能优化[M].北京:机械工业出版社,2014:38-39.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!