基于RDD非序列化本地存储的Spark存储性能优化

doi:10.11896/j.issn.1002-137X.2019.05.022

Abstract

Abstract: Spark framework is taken as the computing framework of big data by more and more enterprises.However,with the increasing of available memory resource of current severs,Spark can’t match with new environment well.Spark runs on Java Virtual Machine (JVM).Asheap space memory is used heavily,the ratio of time cost produced by Java virtual machine to provide space for new objects by reclaiming memory(GC) to total time cost of Spark jobs increases significantly,but the efficiency of Spark jobs doesn’t improve with a certain ratio when the available memory increases.After using OffHeap (native) memory storage mode,the cost of serialization/deserialization becomes the new conflict point instead of GC.This paper used the way of native storage to deal with GC problem,and speeded up the job by reducing the overhead of GC.This paper also proposed and modified the storage structure of Spark,and improved the elimination mechanism and the caching way of RDD.The data without serialization are moved into native memory,realizing low garbage collection overhead and avoiding the time spending on serialization.Experimental results demonstrate that the GC cost of modification method on server with single node and large memory is 5% to 30% compared with the storage on heap of Spark.Meanwhile,the overhead of serialization decreases,the throughput increases and the running time of job can be reduced by more than 8%.

Key words: Deserialization, Garbage collection, Native memory, Spark, Storage system

CLC Number:

TP391

ZHAO Jun-xian, YU Jian. Optimization of Spark RDD Based on Non-serialization Native Storage[J].Computer Science, 2019, 46(5): 143-149.

References

[1]WHITE T,CUTTING D.Hadoop:the definitive guide[J].O’reilly Media Inc Gravenstein Highway North,2012,215(11):1-4.
[2]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.
Spark:cluster computing with working sets[C]∥Usenix Conference on Hot Topics in Cloud Computing.USENIX Association,2010.
[3]NGUYEN K,WANG K,BU Y,et al.FACADE:A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications[C]∥Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems.ACM,2015:675-690.
[4]LU L,SHI X,ZHOU Y,et al.Lifetime-based memory management for distributed data processing systems[J].Proceedings of the Vldb Endowment,2016,9(12):936-947.
[5]FANG L,NGUYEN K,XU G,et al.Interruptible tasks:treating memory pressure as interrupts for highly scalable data-pa-rallel programs[C]∥Symposium on Operating Systems Principles.ACM,2015:394-409.
[6]Project Tungsten[EB/OL].http://tinyurl.com/mzw7hew.
[7]BIAN C,YU J,RONG C T,et al.Self-Adaptive Strategy forCache Management in Spark[J].Acta Electronica Sinica,2017,45(2):278-284.(in Chinese)卞琛,于炯,英昌甜,等.并行计算框架Spark的自适应缓存管理策略[J].电子学报,2017,45(2):278-284.
[8]MENG H T,YU S P,LIU F,et al.Research on Memory Ma-nagement and Cache Replacement Polices in Spark[J].Computer Science,2017,44(6):31-35.(in Chinese)孟红涛,余松平,刘芳,等.Spark内存管理及缓存策略研究[J].计算机科学,2017,44(6):31-35.
[9]NGUYEN K,FANG L,XU G,et al.Yak:a high-performance big-data-friendly garbage collector[C]∥Usenix Conference on Operating Systems Design and Implementation.USENIX Association,2016:349-365.
[10]KIM M,LI J,VOLOS H,et al.Sparkle:optimizing spark for large memory machines and analytics[J].arXiv preprint arXiv:2017:656-656.
[11]NGUYEN K,FANG L,NAVASCA C,et al.Skyway:Connec-ting Managed Heaps in Distributed Big Data Systems[C]∥International Conference.2018:56-69.
[12]Hotspot[EB/OL].http://openjdk.java.net/groups/hotspot.
[13]MURRAY D G,MCSHERRY F,ISAACS R,et al.Naiad:atimely dataflow system[C]∥Twenty-Fourth ACM Symposium on Operating Systems Principles.ACM,2013:439-455.
[14]BORKAR V,CAREY M,GROVER R,et al.Hyracks:A flexible and extensible foundation for data-intensive computing[C]∥IEEE,International Conference on Data Engineering.IEEE Computer Society,2011:1151-1162.
[15]Oracle.Java Platform,Standard Edition HotSpotVirtual Ma-chine Garbage Collection Tuning Guide [EB/OL].https://docs.oracle.com/javase/9/gctuning/garbage-collector-implemen-tation.html.
[16]BU Y,BORKAR V,XU G,et al.A bloat-aware design for big data applications[C]∥International Symposium on Memory Management.ACM,2013:119-130.
[17]NGUYEN K,FANG L,XU G,et al.Speculative region-basedmemory management for big data systems[C]∥The Workshop on Programming Languages and Operating Systems.ACM,2015:27-32.
[18]YAN L,SHEN R.Java Serialization Technology[J].Yunnan:Journal of Honghe University,2011,9(4):37-39.
[19]LIAO W J,HUANG Y F,BAO C K.Memory optimization ofSpark parallel computing framework[J].Computer Engineering &Science,2018,40(4):21-27.
[20]HUANG T H,WANG Y L,WANG Z,et al.Spark I/O Per-formance Optimization Based on Memory and File Sharing Mechanism[J].Computing Engineering,2017,43(3):1-6.
[21]CHAI N,WU Y J,ZHAO W Y.Optimization For Spark MissionPerformance Based on Data Characteristics[J].Computer Applications and Software,2018,35(1):52-58.
[22]Scala Collections[EB/OL].https://www.scala-lang.org/docu/files/collections-api/collections.html.
[23]MetricsSystem[EB/OL].https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/metrics/MetricsSystem.html.

Related Articles 15

[1]	DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[2]	YU Jian-ye, QI Yong, WANG Bao-zhuo. Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark [J]. Computer Science, 2021, 48(6A): 518-523.
[3]	ZHANG Hang, TANG Dan, CAI Hong-liang. Study on Predictive Erasure Codes in Distributed Storage System [J]. Computer Science, 2021, 48(5): 130-139.
[4]	ZHANG Xiao, ZHANG Si-meng, SHI Jia, DONG Cong, LI Zhan-huai. Review on Performance Optimization of Ceph Distributed Storage System [J]. Computer Science, 2021, 48(2): 1-12.
[5]	YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming [J]. Computer Science, 2020, 47(4): 36-41.
[6]	ZHU An-qing, LI Shuai, TANG Xiao-dong. Parallel FP_growth Association Rules Mining Method on Spark Platform [J]. Computer Science, 2020, 47(12): 139-143.
[7]	JIN Hui-fang, LYU Zong-wang, ZHEN Tong. Study on New Model of Food Supply Chain Finance Based on Internet of Things＋Blockchain [J]. Computer Science, 2020, 47(11A): 604-608.
[8]	YU Xin-yi, SHI Tian-feng, TANG Quan-rui, YIN Hui-wu, OU Lin-lin. Industrial Equipment Management System for Predictive Maintenance [J]. Computer Science, 2020, 47(11A): 667-672.
[9]	DENG Ding-sheng. Application of Improved DBSCAN Algorithm on Spark Platform [J]. Computer Science, 2020, 47(11A): 425-429.
[10]	ZHONG Feng-yan, WANG Yan, LI Nian-shuang. Node Selection Scheme for Data Repair in Heterogeneous Distributed Storage Systems [J]. Computer Science, 2019, 46(8): 35-41.
[11]	JIA Ning, LI Ying-da. Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device [J]. Computer Science, 2019, 46(6A): 566-570.
[12]	GUO Jia. Method of Predicting Performance of Storage System Based on Improved Artificial Neural Network [J]. Computer Science, 2019, 46(6A): 52-55.
[13]	WEI Liang, LIN Zi-yu, LAI Yong-xuan. DFTS:A Top-k Skyline Query for Large Datasets [J]. Computer Science, 2019, 46(5): 150-156.
[14]	WU Xiu-guo, LIU Cui. Data Replicas Distribution Transition Strategy in Cloud Storage System [J]. Computer Science, 2019, 46(10): 202-208.
[15]	CUI Guang-fan, XU Li-jie, LIU Jie, YE Dan, ZHONG Hua. Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL [J]. Computer Science, 2018, 45(9): 104-112.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Optimization of Spark RDD Based on Non-serialization Native Storage

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0