Spark内存管理及缓存策略研究

doi:10.11896/j.issn.1002-137X.2017.06.005

Abstract

Abstract: Spark is a big data processing framework based on Map-Reduce.Spark can make full use of cluster memory,thus accelerating data processing.Spark divides memory into Shuffle Memory,Storage Memory and Unroll Memory according to their functions.These different memory zones have different characteristics.The features of Shuffle Memory and Storage Memory were tested and analyzed.RDD (Resilient Distributed Datasets) is the most important abstract in spark,which can cache in cluster memory.When the cluster memory is insufficient,Spark must select some RDD partitions to discard to make room for the new ones.A new cache replacement policies called DWRP (Distributed Weight Replacement Policy) was proposed.DWRP can compute the weight of every RDD partition based on the time of store in memory,size and frequency of use,and then select possible RDD partition to discard based on distribution features.The performance of different cache replacement policies was tested and analyzed at last.

Key words: Big data,Spark memory management,RDD cache,Cache replacement policies

MENG Hong-tao, YU Song-ping, LIU Fang and XIAO Nong. Research on Memory Management and Cache Replacement Policies in Spark[J].Computer Science, 2017, 44(6): 31-35.

References

[1] ZAHARIA M,CHOWDHURY M,FRANKLINM J,et al.Spark:cluster computing with working sets[C]∥ Usenix Conference on Hot Topics in Cloud Computing.2010:10 .
[2] WARNEKE D,LENG C.A Case For Dynamic Memory Partitioning in Data Centers[C]∥ The Workshop on Data Analytics in the Cloud.2013:41-45.
[3] LI H,GHODSI A,ZAHARIA M,et al.Tachyon:Reliable,memory speed storage for cluster computing frameworks[C]∥Proceedings of the ACM Symposium on Cloud Computing.ACM,2014:1-15.
[4] ANANTHANARAYANAN G,GHODSI A,WANG A,et al.PACMan:coordinated memory caching for parallel jobs[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012:20.
[5] DUAN M,LI K,TANG Z,et al.Selection and replacement algorithms for memory performance improvement in Spark[J].Concurrency and Computation:Practice and Experience,2015,28(8):2473-2486.
[6] FENG L.Research and Implementation of Memory Optimization Based on Parallel Computing Engine Spark[D].Beijing:Tsinghua University,2013.(in Chinese) 冯琳.集群计算引擎Spark中的内存优化研究与实现[D].北京:清华大学,2013.
[7] ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing: UCB/EECS-2011-82[R].EECS Department,University of California,Berkeley,2011.
[8] ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Confe-rence on Networked Systems Design and Implementation.USENIX Association,2012:2.
[9] GRISHCHENKO A.Spark Architecture:Shuffle[EB/OL].(2015-08)[2016-09].https://0x0fff.com/spark-architecture-shuffle.
[10] WHITE T.Hadoop:The Definitive Guide,3E.[M].California:O’Reilly Medis,2012:226-227.
[11] WANG L,ZHAN J,LUO C,et al.Bigdatabench:A big databenchmark suite from internet services[C]∥2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).IEEE,2014:488-499.
[12] GAO Y J.DataProcessing with Spark,Technology,Application and Performance Optimization[J].Beijing:China Machine Press,2014:38-39.(in Chinese) 高彦杰.Spark大数据处理技术,应用与性能优化[M].北京:机械工业出版社,2014:38-39.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Research on Memory Management and Cache Replacement Policies in Spark

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0