计算机科学 ›› 2018, Vol. 45 ›› Issue (7): 7-15.doi: 10.11896/j.issn.1002-137X.2018.07.002
廖湖声1,黄珊珊1,徐俊刚2,刘仁峰2
LIAO Hu-sheng1,HUANG Shan-shan1,XU Jun-gang2,LIU Ren-feng2
摘要: 近年来,随着大数据时代的到来,大数据处理平台发展迅速,产生了诸如Hadoop,Spark,Storm等优秀的大数据处理平台,其中Spark最为突出。随着Spark在国内外的广泛应用,其许多性能问题尚待解决。由于Spark底层的执行机制极为复杂,用户很难找到其性能瓶颈,更不要说进一步的优化。针对以上问题,从开发原则优化、内存优化、配置参数优化、调度优化、Shuffle过程优化5个方面对目前国内外的Spark优化技术进行总结和分析。最后,总结了目前Spark优化技术新的核心问题,并提出了未来的主要研究方向。
中图分类号:
[1]ZAHARIA M.Anarchitecture for fast and general data proces-sing on large clusters[M].Morgan & Claypool,2016. [2]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Con-ference on Networked Systems Design and Implementation.USENIX Association,2012. [3]高彦杰.Spark 大数据处理:技术,应用与性能优化[M].北京:机械工业出版社,2015. [4]Apache Spark[EB/OL].[2017-3-15].http://Spark.apache.org. [5]ApacheHadoop[EB/OL].[2017-3-20].http://apache.hadoop.org. [6]Apache Mesos[EB/OL].[2017-4-18].http://mesos.apache.org. [7]Apache Hbase[EB/OL].[2017-4-18].http://hbase.apache.org. [8]ApacheCassandra[EB/OL].[2017-4-23].https://cassandra.apache.org. [9]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [10]Apache Pig[EB/OL].[2017-4-25].http://pig.apache.org. [11]ApacheHive[EB/OL].[2017-4-25].https://hive.apache.org. [12]BU Y,HOWE B,BALAZINSKA M,et al.HaLoop:efficientiterative data processing on large clusters[J].Proceedings of the VLDB Endowment,2010,3(1/2):285-296. [13]BU Y,HOWE B,BALAZINSKA M,et al.The HaLoop ap-proach to large-scale iterative data analysis[J].The VLDB Journal—The International Journal on Very Large Data Bases,2012,21(2):169-190. [14]ANANTHANARAYANAN G,GHODSI A,WANG A,et al.PACMan:coordinated memory caching for parallel jobs[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012:20. [15]PAVLO A,PAULSON E,RASIN A,et al.A comparison of approaches to large-scale data analysis[C]∥Proceedings of the 2009 ACM SIGMOD International Conference on Management of data.ACM,2009:165-178. [16]JIANG D,OOI B C,SHI L,et al.The performance of MapReduce:an in-depth study[J].Proceedings of the VLDB Endowment,2010,3(1/2):472-483. [17]LI X R.Meituan Comment Techical Group.Spark Performance Tuning Guide[EB/OL].[2017-04-28].http://tech.meituan.com/Spark-tuning-basic.html. [18]PIAO H Q,CHEN Y G,DU X Y,et al.Equi-join optimization on Spark[J].Journal of East China Normal University(Natural Science),2014(5):261-270.(in Chinese) 卞昊穹,陈跃国,杜小勇,等.Spark 上的等值连接优化[J].华东师范大学学报 (自然科学版),2014(5):261-270. [19]BLANAS S,PATEL J M,ERCEGOVAC V,et al.A comparison of join algorithms for log processing in mapreduce[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.ACM,2010:975-986. [20]SAKR S,LIU A,FAYOUMI A G.The family of mapreduce and large-scale data processing systems[J].ACM Computing Surveys (CSUR),2013,46(1):11. [21]CHEN K,WANG B,FENG L.Data Object Cache in SparkComputing Engine[J].ZTE Technology Journal,2016,22(2):23-27.(in Chinese) 陈康,王彬,冯琳.Spark 计算引擎的数据对象缓存[J].中兴通讯技术,2016,22(2):23-27. [22]FENG L.Research and Implementation of Memory Optimaza-tion Based on Parallel Computing Engine Spark[D].Beijing:Tsinghua University,2013.(in Chinese) 冯琳.集群计算引擎 Spark 中的内存优化研究与实现[D].北京:清华大学,2013. [23]CHURILA S A,ZHOU G L,SHI L,et al.Parallel cube computing in Spark[J].Journal of Computer Applications,2016,36(2):348-352.(in Chinese) 萨初日拉,周国亮,时磊,等.Spark 环境下并行立方体计算方法[J].计算机应用,2016,36(2):348-352. [24]LI M,TAN J,WANG Y,et al.Sparkbench:a comprehensivebenchmarking suite for in memory data analytic platform spark[C]∥Proceedings of the 12th ACM International Conference on Computing Frontiers.ACM,2015:53. [25]HERODOTOU H,LIM H,LUO G,et al.Starfish:A Self-tuning System for Big Data Analytics[C]∥Fifth Biennial Conference on Innovative Data Systems Research,Asilomar.DBLP,2011:261-272. [26]HERODOTOU H,BABU S.Profiling,what-if analysis,andcost-based optimization of mapreduce programs[J].Proceedings of the VLDB Endowment,2011,4(11):1111-1122. [27]HERODOTOU H.Hadoop performance models[J].arXiv preprint arXiv.2011,1106.0940. [28]WU D,GOKHALE A.A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration[C]∥20th Annual InternationalConference on High Performance Computing.IEEE,2013:89-98. [29]WU D.A Profiling and Performance Analysis based Self-tuning System for Optimization of Hadoop MapReduce Cluster Confi-guration[D].Nashvile:Vanderbilt University,2013. [30]CHEN C O,ZHUO Y Q,YEH C C,et al.Machine Learning-Based Configuration Parameter Tuning on Hadoop System[C]∥2015 IEEE International Congress on Big Data (BigData Congress).IEEE,2015:386-392. [31]RAVI N.Configuring and optimizing Spark applications withease-Nishkam ravi,Cloudera[EB/OL].(2015-09-01).https://apachebigdata2015.sched.org/event/55afa6d65370a56bdbcb5eba5166f010#.VemuzvaqpEN. [32]CHEN Q A,LI F,CAO Y,et al.Parameter optimation for Spark jobs based on runtime data analysis[J].Computer Engineering & Science,2016,38(1):11-19.(in Chinese) 陈侨安,李峰,曹越,等.基于运行数据分析的 Spark 任务参数优化[J].计算机工程与科学,2016,38(1):11-19. [33]XU J G,WANG G L,LIU S Y,et al.A Novel Performance Evaluation and Optimization Model for Big Data System [C]∥Proceedings of the 15th International Symposium on Parallel and Distributed Computing (ISPDC 2016).Fuzhou,China,2016:1765-1773. [34]RUMI G,COLELLA C,ARDAGNA D.Optimization Tech-niques within the Hadoop Eco-system:A Survey[C]∥2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).IEEE,2014:437-444. [35]VERMA A,CHERKASOVA L,CAMPBELL R H.ARIA:automatic resource inference and allocation for mapreduce environments[C]∥Proceedings of the 8th ACM International Confe-rence on Autonomic Computing.ACM,2011:235-244. [36]SANDHOLM T,LAI K.Dynamic proportional share scheduling in hadoop[C]∥Workshop on Job Scheduling Strategies for Paral-lel Processing.Springer Berlin Heidelberg,2010:110-131. [37]RAO B T,REDDY L S S.Survey on improved scheduling in Hadoop MapReduce in cloud environments[J].arXiv preprintarXiv:1207.0780,2012. [38]KC K,ANYANWU K.Scheduling hadoop jobs to meet deadlines[C]∥IEEE Second International Conference on Cloud Computing Technology and Science.IEEE,2011:388-392. [39]VERMA A,CHERKASOVA L,KUMAR V S,et al.Deadline-based workload management for mapreduce environments:Pieces of the performance puzzle[C]∥Network Operations and Management Symposium (NOMS).IEEE,2012:900-905. [40]ZACHEILAS N,KALOGERAKI V.Real-Time Scheduling ofSkewed MapReduce Jobs in Heterogeneous Environments[C]∥ICAC.2014:189-200. [41]XU X,CAO L,WANG X.Adaptive task scheduling strategybased on dynamic workload adjustment for heterogeneous Hadoop clusters[J].IEEE Systems Journal,2016,10(2):471-482. [42]NIGHTINGALE E B,CHEN P M,FLINN J.Speculative execution in a distributed file system [J].ACM SIGOPS Operating Systems Review,2005,39(5):191-205. [43]YANG Z W,ZHENG Q,WANG S,et al.Adaptive Task Sche-duling Strategy for heterogeneous Spark Cluster[J].Computer Engineering,2016,42(1):31-35,40.(in Chinese) 杨志伟,郑烇,王嵩,等.异构 Spark 集群下自适应任务调度策略[J].计算机工程,2016,42(1):31-35,40. [44]KANG H M.Research on Spark Optimization Based on Fine-Grained Monitoring[D].Harbin:Harbin Institute of Technology,2016.(in Chinese) 康海蒙.基于细粒度监控的 Spark 优化研究[D].哈尔滨:哈尔滨工业大学,2016. [45]RANA N,DESHMUKH S.Shuffle Performance in ApacheSpark[C]∥International Journal of Engineering Research and Technology.ESRSA Publications,2015. [46]DAVIDSON A,OR A.Optimizing Shuffle performance in Spark[R].University of California,Berkeley-Department of Electrical Engineering and Computer Sciences,2013. [47]JASON D.Consolidating Shuffle Files in Spark[EB/OL].[2017-04-28].https://issues.apache.org/jira/browse/SPARK-751. [48]CHERN Y Z.Analysis and optimization of Memory Scheduling Algorithm of Spark Shuffle[D].Hangzhou:Zhejiang University,2016.(in Chinese) 陈英芝.Spark Shuffle的内存调度算法分析及优化[D].杭州:浙江大学,2016. [49]YIGITBASI N,WILLKE T L,LIAO G,et al.Towards machine learning-based auto-tuning of mapreduce[C]∥2013 IEEE 21st International Symposium on Modelling,Analysis and Simulation of Computer and Telecommunication Systems.IEEE,2013:11-20. [50]CHEN C O,ZHUO Y Q,YEH C C,et al.Machine Learning-Based Configuration Parameter Tuning on Hadoop System[C]∥2015 IEEE International Congress on Big Data.IEEE,2015:386-392. |
[1] | 王兵, 吴洪亮, 牛新征. 基于改进势场法的机器人路径规划 Robot Path Planning Based on Improved Potential Field Method 计算机科学, 2022, 49(7): 196-203. https://doi.org/10.11896/jsjkx.210500020 |
[2] | 戴宏亮, 钟国金, 游志铭, 戴宏明. 基于Spark的舆情情感大数据分析集成方法 Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark 计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280 |
[3] | 俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129 |
[4] | 周益旻, 刘方正, 王勇. 基于混合方法的IPSec VPN加密流量识别 IPSec VPN Encrypted Traffic Identification Based on Hybrid Method 计算机科学, 2021, 48(4): 295-302. https://doi.org/10.11896/jsjkx.200700189 |
[5] | 邓丽, 武金达, 李科学, 卢亚康. 基于TPE的SpaRC算法超参数优化方法 SpaRC Algorithm Hyperparameter Optimization Methodology Based on TPE 计算机科学, 2021, 48(2): 70-75. https://doi.org/10.11896/jsjkx.200500156 |
[6] | 李欣, 段詠程. 基于改进隐马尔可夫模型的网络安全态势评估方法 Network Security Situation Assessment Method Based on Improved Hidden Markov Model 计算机科学, 2020, 47(7): 287-291. https://doi.org/10.11896/jsjkx.190300045 |
[7] | 杨宗霖, 李天瑞, 刘胜久, 殷成凤, 贾真, 珠杰. 基于Spark Streaming的流式并行文本校对 Streaming Parallel Text Proofreading Based on Spark Streaming 计算机科学, 2020, 47(4): 36-41. https://doi.org/10.11896/jsjkx.190300070 |
[8] | 武玉坤,肖杰,李伟,楼吉林. 融合渐近性的灰狼优化支持向量机模型 Support Vector Machine Model Based on Grey Wolf Optimization Fused Asymptotic 计算机科学, 2020, 47(2): 37-43. https://doi.org/10.11896/jsjkx.190100092 |
[9] | 朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法 Parallel FP_growth Association Rules Mining Method on Spark Platform 计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110 |
[10] | 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071 |
[11] | 禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统 Industrial Equipment Management System for Predictive Maintenance 计算机科学, 2020, 47(11A): 667-672. https://doi.org/10.11896/jsjkx.200100091 |
[12] | 周欣悦, 钱丽萍, 黄玉蘋, 吴远. 一种基于蚁群的电动汽车充电调度优化方法 Optimization Method of Electric Vehicles Charging Scheduling Based on Ant Colony 计算机科学, 2020, 47(11): 280-285. https://doi.org/10.11896/jsjkx.190700129 |
[13] | 吴英杰, 黄鑫, 葛晨, 孙岚. 差分隐私流数据实时发布中的自适应参数优化 Adaptive Parameter Optimization for Real-time Differential Privacy Streaming Data Publication 计算机科学, 2019, 46(9): 99-105. https://doi.org/10.11896/j.issn.1002-137X.2019.09.013 |
[14] | 胡鑫楠. 基于改进型混沌粒子群优化算法的FIR高通数字滤波器设计 FIR High Pass Digital Filter Design Based on Improved Chaos Particle Swarm Optimization Algorithm 计算机科学, 2019, 46(6A): 601-604. |
[15] | 贾宁, 李瑛达. 基于智能可穿戴设备的个性化健康监管平台的构建 Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device 计算机科学, 2019, 46(6A): 566-570. |
|