计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 143-149.doi: 10.11896/j.issn.1002-137X.2019.05.022
所属专题: 数据库技术
赵俊先, 喻剑
ZHAO Jun-xian, YU Jian
摘要: Spark框架被越来越多的企业用作大数据的计算框架,但随着现有服务器的可用内存资源增加,Spark并不能与新环境相匹配。Spark运行在Java虚拟机上,随着堆空间内存被大量使用,Java虚拟机通过回收内存来为新对象提供空间(垃圾回收机制,GC)的时间开销占Spark作业总耗时的比例显著增加,但Spark作业的效率并未随着可用内存的增加而保持一定比例的提升。在使用非堆(本地)内存存储模式后,GC开销问题得以缓解,但缓存数据的序列化开销成为新的矛盾点。文中利用本地存储方式解决GC问题,同时通过减少序列化开销以加快作业速度,提出并修改了Spark的存储结构,改进了RDD的淘汰机制和缓存方式,将去序列化的数据引入到本地内存中,在保持较低的垃圾回收开销的同时,降低了序列化的开销。实验结果表明,与原Spark的堆上存储方式相比,非序列化的本地存储方法在单结点、大内存的服务器上的GC时间缩短到5%~30%,同时,序列化开销显著降低,吞吐量得到提升,作业耗时缩短8%以上。
中图分类号:
[1]WHITE T,CUTTING D.Hadoop:the definitive guide[J].O’reilly Media Inc Gravenstein Highway North,2012,215(11):1-4. [2]ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al. Spark:cluster computing with working sets[C]∥Usenix Conference on Hot Topics in Cloud Computing.USENIX Association,2010. [3]NGUYEN K,WANG K,BU Y,et al.FACADE:A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications[C]∥Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems.ACM,2015:675-690. [4]LU L,SHI X,ZHOU Y,et al.Lifetime-based memory management for distributed data processing systems[J].Proceedings of the Vldb Endowment,2016,9(12):936-947. [5]FANG L,NGUYEN K,XU G,et al.Interruptible tasks:treating memory pressure as interrupts for highly scalable data-pa-rallel programs[C]∥Symposium on Operating Systems Principles.ACM,2015:394-409. [6]Project Tungsten[EB/OL].http://tinyurl.com/mzw7hew. [7]BIAN C,YU J,RONG C T,et al.Self-Adaptive Strategy forCache Management in Spark[J].Acta Electronica Sinica,2017,45(2):278-284.(in Chinese)卞琛,于炯,英昌甜,等.并行计算框架Spark的自适应缓存管理策略[J].电子学报,2017,45(2):278-284. [8]MENG H T,YU S P,LIU F,et al.Research on Memory Ma-nagement and Cache Replacement Polices in Spark[J].Computer Science,2017,44(6):31-35.(in Chinese)孟红涛,余松平,刘芳,等.Spark内存管理及缓存策略研究[J].计算机科学,2017,44(6):31-35. [9]NGUYEN K,FANG L,XU G,et al.Yak:a high-performance big-data-friendly garbage collector[C]∥Usenix Conference on Operating Systems Design and Implementation.USENIX Association,2016:349-365. [10]KIM M,LI J,VOLOS H,et al.Sparkle:optimizing spark for large memory machines and analytics[J].arXiv preprint arXiv:2017:656-656. [11]NGUYEN K,FANG L,NAVASCA C,et al.Skyway:Connec-ting Managed Heaps in Distributed Big Data Systems[C]∥International Conference.2018:56-69. [12]Hotspot[EB/OL].http://openjdk.java.net/groups/hotspot. [13]MURRAY D G,MCSHERRY F,ISAACS R,et al.Naiad:atimely dataflow system[C]∥Twenty-Fourth ACM Symposium on Operating Systems Principles.ACM,2013:439-455. [14]BORKAR V,CAREY M,GROVER R,et al.Hyracks:A flexible and extensible foundation for data-intensive computing[C]∥IEEE,International Conference on Data Engineering.IEEE Computer Society,2011:1151-1162. [15]Oracle.Java Platform,Standard Edition HotSpotVirtual Ma-chine Garbage Collection Tuning Guide [EB/OL].https://docs.oracle.com/javase/9/gctuning/garbage-collector-implemen-tation.html. [16]BU Y,BORKAR V,XU G,et al.A bloat-aware design for big data applications[C]∥International Symposium on Memory Management.ACM,2013:119-130. [17]NGUYEN K,FANG L,XU G,et al.Speculative region-basedmemory management for big data systems[C]∥The Workshop on Programming Languages and Operating Systems.ACM,2015:27-32. [18]YAN L,SHEN R.Java Serialization Technology[J].Yunnan:Journal of Honghe University,2011,9(4):37-39. [19]LIAO W J,HUANG Y F,BAO C K.Memory optimization ofSpark parallel computing framework[J].Computer Engineering &Science,2018,40(4):21-27. [20]HUANG T H,WANG Y L,WANG Z,et al.Spark I/O Per-formance Optimization Based on Memory and File Sharing Mechanism[J].Computing Engineering,2017,43(3):1-6. [21]CHAI N,WU Y J,ZHAO W Y.Optimization For Spark MissionPerformance Based on Data Characteristics[J].Computer Applications and Software,2018,35(1):52-58. [22]Scala Collections[EB/OL].https://www.scala-lang.org/docu/files/collections-api/collections.html. [23]MetricsSystem[EB/OL].https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/metrics/MetricsSystem.html. |
[1] | 陈晋鹏, 胡哈蕾, 张帆, 曹源, 孙鹏飞. 融合时间特性和用户偏好的卷积序列化推荐 Convolutional Sequential Recommendation with Temporal Feature and User Preference 计算机科学, 2022, 49(1): 115-120. https://doi.org/10.11896/jsjkx.201200192 |
[2] | 戴宏亮, 钟国金, 游志铭, 戴宏明. 基于Spark的舆情情感大数据分析集成方法 Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark 计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280 |
[3] | 俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129 |
[4] | 张航, 唐聃, 蔡红亮. 分布式存储系统中的预测式纠删码研究 Study on Predictive Erasure Codes in Distributed Storage System 计算机科学, 2021, 48(5): 130-139. https://doi.org/10.11896/jsjkx.200300124 |
[5] | 张晓, 张思蒙, 石佳, 董聪, 李战怀. Ceph分布式存储系统性能优化技术研究综述 Review on Performance Optimization of Ceph Distributed Storage System 计算机科学, 2021, 48(2): 1-12. https://doi.org/10.11896/jsjkx.201000149 |
[6] | 杨宗霖, 李天瑞, 刘胜久, 殷成凤, 贾真, 珠杰. 基于Spark Streaming的流式并行文本校对 Streaming Parallel Text Proofreading Based on Spark Streaming 计算机科学, 2020, 47(4): 36-41. https://doi.org/10.11896/jsjkx.190300070 |
[7] | 朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法 Parallel FP_growth Association Rules Mining Method on Spark Platform 计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110 |
[8] | 禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统 Industrial Equipment Management System for Predictive Maintenance 计算机科学, 2020, 47(11A): 667-672. https://doi.org/10.11896/jsjkx.200100091 |
[9] | 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071 |
[10] | 钟凤艳, 王艳, 李念爽. 异构分布式存储系统再生码数据修复的节点选择方案 Node Selection Scheme for Data Repair in Heterogeneous Distributed Storage Systems 计算机科学, 2019, 46(8): 35-41. https://doi.org/10.11896/j.issn.1002-137X.2019.08.006 |
[11] | 郭佳. 基于改进的人工神经网络对存储系统性能进行预测的方法 Method of Predicting Performance of Storage System Based on Improved Artificial Neural Network 计算机科学, 2019, 46(6A): 52-55. |
[12] | 贾宁, 李瑛达. 基于智能可穿戴设备的个性化健康监管平台的构建 Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device 计算机科学, 2019, 46(6A): 566-570. |
[13] | 魏亮, 林子雨, 赖永炫. DFTS:面向大数据集的Top-k Skyline查询算法 DFTS:A Top-k Skyline Query for Large Datasets 计算机科学, 2019, 46(5): 150-156. https://doi.org/10.11896/j.issn.1002-137X.2019.05.023 |
[14] | 童泽平, 吴应强, 任亮, 李巍. SP-AS/RS基于新型出入口结构的行程时间分析 Travel Time Analysis of SP-AS/RS with New Configuration for I/O Point 计算机科学, 2019, 46(4): 315-320. https://doi.org/10.11896/j.issn.1002-137X.2019.04.049 |
[15] | 吴修国, 刘翠. 云存储系统中最小开销的数据副本布局转换策略 Data Replicas Distribution Transition Strategy in Cloud Storage System 计算机科学, 2019, 46(10): 202-208. https://doi.org/10.11896/jsjkx.180901623 |
|