Spark性能优化技术研究综述

doi:10.11896／j.issn.1002-137X.2018.07.002

摘要/Abstract

摘要： 近年来,随着大数据时代的到来,大数据处理平台发展迅速,产生了诸如Hadoop,Spark,Storm等优秀的大数据处理平台,其中Spark最为突出。随着Spark在国内外的广泛应用,其许多性能问题尚待解决。由于Spark底层的执行机制极为复杂,用户很难找到其性能瓶颈,更不要说进一步的优化。针对以上问题,从开发原则优化、内存优化、配置参数优化、调度优化、Shuffle过程优化5个方面对目前国内外的Spark优化技术进行总结和分析。最后,总结了目前Spark优化技术新的核心问题,并提出了未来的主要研究方向。

关键词: Shuffle过程优化, Spark, 参数优化, 调度优化, 开发原则优化, 内存优化

Abstract: In recent years,with the advent of the era of big data,big data processing platform is developing very fast.A large number of big data processing platforms,including Hadoop,Spark,Strom and etc.,have appeared,among which Apache Spark is the most prominent one.With the wide applications of Spark at home and abroad,there are many performance problems to be solved.As the underlying implementation mechanism of Spark is very complex,it is difficult for ordinary users to find performance bottlenecks,let alone further optimization.In light of the above problems,the performance optimization technologies for Sparkwere summarized and analyzed from five aspects,including development principles optimization,memory optimization,configuration parameter optimization,scheduling optimization and shuffle process optimization.Finally,the key problems of Spark optimization technologies were summarized and future research issues were proposed.

Key words: Configuration parameter optimization, Development principle optimization, Memory optimization, Scheduling optimization, Shuffle process optimization, Spark

中图分类号:

TP391

廖湖声, 黄珊珊, 徐俊刚, 刘仁峰. Spark性能优化技术研究综述[J]. 计算机科学, 2018, 45(7): 7-15. https://doi.org/10.11896／j.issn.1002-137X.2018.07.002

LIAO Hu-sheng, HUANG Shan-shan, XU Jun-gang, LIU Ren-feng. Survey on Performance Optimization Technologies for Spark[J]. Computer Science, 2018, 45(7): 7-15. https://doi.org/10.11896／j.issn.1002-137X.2018.07.002

参考文献

[1]ZAHARIA M.Anarchitecture for fast and general data proces-sing on large clusters[M].Morgan & Claypool,2016.
[2]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Con-ference on Networked Systems Design and Implementation.USENIX Association,2012.
[3]高彦杰.Spark 大数据处理:技术,应用与性能优化[M].北京:机械工业出版社,2015.
[4]Apache Spark[EB/OL].[2017-3-15].http://Spark.apache.org.
[5]ApacheHadoop[EB/OL].[2017-3-20].http://apache.hadoop.org.
[6]Apache Mesos[EB/OL].[2017-4-18].http://mesos.apache.org.
[7]Apache Hbase[EB/OL].[2017-4-18].http://hbase.apache.org.
[8]ApacheCassandra[EB/OL].[2017-4-23].https://cassandra.apache.org.
[9]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[10]Apache Pig[EB/OL].[2017-4-25].http://pig.apache.org.
[11]ApacheHive[EB/OL].[2017-4-25].https://hive.apache.org.
[12]BU Y,HOWE B,BALAZINSKA M,et al.HaLoop:efficientiterative data processing on large clusters[J].Proceedings of the VLDB Endowment,2010,3(1／2):285-296.
[13]BU Y,HOWE B,BALAZINSKA M,et al.The HaLoop ap-proach to large-scale iterative data analysis[J].The VLDB Journal—The International Journal on Very Large Data Bases,2012,21(2):169-190.
[14]ANANTHANARAYANAN G,GHODSI A,WANG A,et al.PACMan:coordinated memory caching for parallel jobs[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012:20.
[15]PAVLO A,PAULSON E,RASIN A,et al.A comparison of approaches to large-scale data analysis[C]∥Proceedings of the 2009 ACM SIGMOD International Conference on Management of data.ACM,2009:165-178.
[16]JIANG D,OOI B C,SHI L,et al.The performance of MapReduce:an in-depth study[J].Proceedings of the VLDB Endowment,2010,3(1／2):472-483.
[17]LI X R.Meituan Comment Techical Group.Spark Performance Tuning Guide[EB/OL].[2017-04-28].http://tech.meituan.com/Spark-tuning-basic.html.
[18]PIAO H Q,CHEN Y G,DU X Y,et al.Equi-join optimization on Spark[J].Journal of East China Normal University(Natural Science),2014(5):261-270.(in Chinese)
卞昊穹,陈跃国,杜小勇,等.Spark 上的等值连接优化[J].华东师范大学学报 (自然科学版),2014(5):261-270.
[19]BLANAS S,PATEL J M,ERCEGOVAC V,et al.A comparison of join algorithms for log processing in mapreduce[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.ACM,2010:975-986.
[20]SAKR S,LIU A,FAYOUMI A G.The family of mapreduce and large-scale data processing systems[J].ACM Computing Surveys (CSUR),2013,46(1):11.
[21]CHEN K,WANG B,FENG L.Data Object Cache in SparkComputing Engine[J].ZTE Technology Journal,2016,22(2):23-27.(in Chinese)
陈康,王彬,冯琳.Spark 计算引擎的数据对象缓存[J].中兴通讯技术,2016,22(2):23-27.
[22]FENG L.Research and Implementation of Memory Optimaza-tion Based on Parallel Computing Engine Spark[D].Beijing:Tsinghua University,2013.(in Chinese)
冯琳.集群计算引擎 Spark 中的内存优化研究与实现[D].北京:清华大学,2013.
[23]CHURILA S A,ZHOU G L,SHI L,et al.Parallel cube computing in Spark[J].Journal of Computer Applications,2016,36(2):348-352.(in Chinese)
萨初日拉,周国亮,时磊,等.Spark 环境下并行立方体计算方法[J].计算机应用,2016,36(2):348-352.
[24]LI M,TAN J,WANG Y,et al.Sparkbench:a comprehensivebenchmarking suite for in memory data analytic platform spark[C]∥Proceedings of the 12th ACM International Conference on Computing Frontiers.ACM,2015:53.
[25]HERODOTOU H,LIM H,LUO G,et al.Starfish:A Self-tuning System for Big Data Analytics[C]∥Fifth Biennial Conference on Innovative Data Systems Research,Asilomar.DBLP,2011:261-272.
[26]HERODOTOU H,BABU S.Profiling,what-if analysis,andcost-based optimization of mapreduce programs[J].Proceedings of the VLDB Endowment,2011,4(11):1111-1122.
[27]HERODOTOU H.Hadoop performance models[J].arXiv preprint arXiv.2011,1106.0940.
[28]WU D,GOKHALE A.A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration[C]∥20th Annual InternationalConference on High Performance Computing.IEEE,2013:89-98.
[29]WU D.A Profiling and Performance Analysis based Self-tuning System for Optimization of Hadoop MapReduce Cluster Confi-guration[D].Nashvile:Vanderbilt University,2013.
[30]CHEN C O,ZHUO Y Q,YEH C C,et al.Machine Learning-Based Configuration Parameter Tuning on Hadoop System[C]∥2015 IEEE International Congress on Big Data (BigData Congress).IEEE,2015:386-392.
[31]RAVI N.Configuring and optimizing Spark applications withease-Nishkam ravi,Cloudera[EB/OL].(2015-09-01).https://apachebigdata2015.sched.org/event/55afa6d65370a56bdbcb5eba5166f010#.VemuzvaqpEN.
[32]CHEN Q A,LI F,CAO Y,et al.Parameter optimation for Spark jobs based on runtime data analysis[J].Computer Engineering & Science,2016,38(1):11-19.(in Chinese)
陈侨安,李峰,曹越,等.基于运行数据分析的 Spark 任务参数优化[J].计算机工程与科学,2016,38(1):11-19.
[33]XU J G,WANG G L,LIU S Y,et al.A Novel Performance Evaluation and Optimization Model for Big Data System [C]∥Proceedings of the 15th International Symposium on Parallel and Distributed Computing (ISPDC 2016).Fuzhou,China,2016:1765-1773.
[34]RUMI G,COLELLA C,ARDAGNA D.Optimization Tech-niques within the Hadoop Eco-system:A Survey[C]∥2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).IEEE,2014:437-444.
[35]VERMA A,CHERKASOVA L,CAMPBELL R H.ARIA:automatic resource inference and allocation for mapreduce environments[C]∥Proceedings of the 8th ACM International Confe-rence on Autonomic Computing.ACM,2011:235-244.
[36]SANDHOLM T,LAI K.Dynamic proportional share scheduling in hadoop[C]∥Workshop on Job Scheduling Strategies for Paral-lel Processing.Springer Berlin Heidelberg,2010:110-131.
[37]RAO B T,REDDY L S S.Survey on improved scheduling in Hadoop MapReduce in cloud environments[J].arXiv preprintarXiv:1207.0780,2012.
[38]KC K,ANYANWU K.Scheduling hadoop jobs to meet deadlines[C]∥IEEE Second International Conference on Cloud Computing Technology and Science.IEEE,2011:388-392.
[39]VERMA A,CHERKASOVA L,KUMAR V S,et al.Deadline-based workload management for mapreduce environments:Pieces of the performance puzzle[C]∥Network Operations and Management Symposium (NOMS).IEEE,2012:900-905.
[40]ZACHEILAS N,KALOGERAKI V.Real-Time Scheduling ofSkewed MapReduce Jobs in Heterogeneous Environments[C]∥ICAC.2014:189-200.
[41]XU X,CAO L,WANG X.Adaptive task scheduling strategybased on dynamic workload adjustment for heterogeneous Hadoop clusters[J].IEEE Systems Journal,2016,10(2):471-482.
[42]NIGHTINGALE E B,CHEN P M,FLINN J.Speculative execution in a distributed file system [J].ACM SIGOPS Operating Systems Review,2005,39(5):191-205.
[43]YANG Z W,ZHENG Q,WANG S,et al.Adaptive Task Sche-duling Strategy for heterogeneous Spark Cluster[J].Computer Engineering,2016,42(1):31-35,40.(in Chinese)
杨志伟,郑烇,王嵩,等.异构 Spark 集群下自适应任务调度策略[J].计算机工程,2016,42(1):31-35,40.
[44]KANG H M.Research on Spark Optimization Based on Fine-Grained Monitoring[D].Harbin:Harbin Institute of Technology,2016.(in Chinese)
康海蒙.基于细粒度监控的 Spark 优化研究[D].哈尔滨:哈尔滨工业大学,2016.
[45]RANA N,DESHMUKH S.Shuffle Performance in ApacheSpark[C]∥International Journal of Engineering Research and Technology.ESRSA Publications,2015.
[46]DAVIDSON A,OR A.Optimizing Shuffle performance in Spark[R].University of California,Berkeley-Department of Electrical Engineering and Computer Sciences,2013.
[47]JASON D.Consolidating Shuffle Files in Spark[EB/OL].[2017-04-28].https://issues.apache.org/jira/browse/SPARK-751.
[48]CHERN Y Z.Analysis and optimization of Memory Scheduling Algorithm of Spark Shuffle[D].Hangzhou:Zhejiang University,2016.(in Chinese)
陈英芝.Spark Shuffle的内存调度算法分析及优化[D].杭州:浙江大学,2016.
[49]YIGITBASI N,WILLKE T L,LIAO G,et al.Towards machine learning-based auto-tuning of mapreduce[C]∥2013 IEEE 21st International Symposium on Modelling,Analysis and Simulation of Computer and Telecommunication Systems.IEEE,2013:11-20.
[50]CHEN C O,ZHUO Y Q,YEH C C,et al.Machine Learning-Based Configuration Parameter Tuning on Hadoop System[C]∥2015 IEEE International Congress on Big Data.IEEE,2015:386-392.

相关文章 15

[1]	王兵, 吴洪亮, 牛新征. 基于改进势场法的机器人路径规划 Robot Path Planning Based on Improved Potential Field Method 计算机科学, 2022, 49(7): 196-203. https://doi.org/10.11896/jsjkx.210500020
[2]	戴宏亮, 钟国金, 游志铭, 戴宏明. 基于Spark的舆情情感大数据分析集成方法 Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark 计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280
[3]	俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129
[4]	周益旻, 刘方正, 王勇. 基于混合方法的IPSec VPN加密流量识别 IPSec VPN Encrypted Traffic Identification Based on Hybrid Method 计算机科学, 2021, 48(4): 295-302. https://doi.org/10.11896/jsjkx.200700189
[5]	邓丽, 武金达, 李科学, 卢亚康. 基于TPE的SpaRC算法超参数优化方法 SpaRC Algorithm Hyperparameter Optimization Methodology Based on TPE 计算机科学, 2021, 48(2): 70-75. https://doi.org/10.11896/jsjkx.200500156
[6]	李欣, 段詠程. 基于改进隐马尔可夫模型的网络安全态势评估方法 Network Security Situation Assessment Method Based on Improved Hidden Markov Model 计算机科学, 2020, 47(7): 287-291. https://doi.org/10.11896/jsjkx.190300045
[7]	杨宗霖, 李天瑞, 刘胜久, 殷成凤, 贾真, 珠杰. 基于Spark Streaming的流式并行文本校对 Streaming Parallel Text Proofreading Based on Spark Streaming 计算机科学, 2020, 47(4): 36-41. https://doi.org/10.11896/jsjkx.190300070
[8]	武玉坤,肖杰,李伟,楼吉林. 融合渐近性的灰狼优化支持向量机模型 Support Vector Machine Model Based on Grey Wolf Optimization Fused Asymptotic 计算机科学, 2020, 47(2): 37-43. https://doi.org/10.11896/jsjkx.190100092
[9]	朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法 Parallel FP_growth Association Rules Mining Method on Spark Platform 计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110
[10]	邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
[11]	禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统 Industrial Equipment Management System for Predictive Maintenance 计算机科学, 2020, 47(11A): 667-672. https://doi.org/10.11896/jsjkx.200100091
[12]	周欣悦, 钱丽萍, 黄玉蘋, 吴远. 一种基于蚁群的电动汽车充电调度优化方法 Optimization Method of Electric Vehicles Charging Scheduling Based on Ant Colony 计算机科学, 2020, 47(11): 280-285. https://doi.org/10.11896/jsjkx.190700129
[13]	吴英杰, 黄鑫, 葛晨, 孙岚. 差分隐私流数据实时发布中的自适应参数优化 Adaptive Parameter Optimization for Real-time Differential Privacy Streaming Data Publication 计算机科学, 2019, 46(9): 99-105. https://doi.org/10.11896/j.issn.1002-137X.2019.09.013
[14]	胡鑫楠. 基于改进型混沌粒子群优化算法的FIR高通数字滤波器设计 FIR High Pass Digital Filter Design Based on Improved Chaos Particle Swarm Optimization Algorithm 计算机科学, 2019, 46(6A): 601-604.
[15]	贾宁, 李瑛达. 基于智能可穿戴设备的个性化健康监管平台的构建 Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device 计算机科学, 2019, 46(6A): 566-570.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed