Spark性能优化技术研究综述

doi:10.11896／j.issn.1002-137X.2018.07.002

Abstract

Abstract: In recent years,with the advent of the era of big data,big data processing platform is developing very fast.A large number of big data processing platforms,including Hadoop,Spark,Strom and etc.,have appeared,among which Apache Spark is the most prominent one.With the wide applications of Spark at home and abroad,there are many performance problems to be solved.As the underlying implementation mechanism of Spark is very complex,it is difficult for ordinary users to find performance bottlenecks,let alone further optimization.In light of the above problems,the performance optimization technologies for Sparkwere summarized and analyzed from five aspects,including development principles optimization,memory optimization,configuration parameter optimization,scheduling optimization and shuffle process optimization.Finally,the key problems of Spark optimization technologies were summarized and future research issues were proposed.

Key words: Configuration parameter optimization, Development principle optimization, Memory optimization, Scheduling optimization, Shuffle process optimization, Spark

CLC Number:

TP391

LIAO Hu-sheng, HUANG Shan-shan, XU Jun-gang, LIU Ren-feng. Survey on Performance Optimization Technologies for Spark[J].Computer Science, 2018, 45(7): 7-15.

References

[1]ZAHARIA M.Anarchitecture for fast and general data proces-sing on large clusters[M].Morgan & Claypool,2016.
[2]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Con-ference on Networked Systems Design and Implementation.USENIX Association,2012.
[3]高彦杰.Spark 大数据处理:技术,应用与性能优化[M].北京:机械工业出版社,2015.
[4]Apache Spark[EB/OL].[2017-3-15].http://Spark.apache.org.
[5]ApacheHadoop[EB/OL].[2017-3-20].http://apache.hadoop.org.
[6]Apache Mesos[EB/OL].[2017-4-18].http://mesos.apache.org.
[7]Apache Hbase[EB/OL].[2017-4-18].http://hbase.apache.org.
[8]ApacheCassandra[EB/OL].[2017-4-23].https://cassandra.apache.org.
[9]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[10]Apache Pig[EB/OL].[2017-4-25].http://pig.apache.org.
[11]ApacheHive[EB/OL].[2017-4-25].https://hive.apache.org.
[12]BU Y,HOWE B,BALAZINSKA M,et al.HaLoop:efficientiterative data processing on large clusters[J].Proceedings of the VLDB Endowment,2010,3(1／2):285-296.
[13]BU Y,HOWE B,BALAZINSKA M,et al.The HaLoop ap-proach to large-scale iterative data analysis[J].The VLDB Journal—The International Journal on Very Large Data Bases,2012,21(2):169-190.
[14]ANANTHANARAYANAN G,GHODSI A,WANG A,et al.PACMan:coordinated memory caching for parallel jobs[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012:20.
[15]PAVLO A,PAULSON E,RASIN A,et al.A comparison of approaches to large-scale data analysis[C]∥Proceedings of the 2009 ACM SIGMOD International Conference on Management of data.ACM,2009:165-178.
[16]JIANG D,OOI B C,SHI L,et al.The performance of MapReduce:an in-depth study[J].Proceedings of the VLDB Endowment,2010,3(1／2):472-483.
[17]LI X R.Meituan Comment Techical Group.Spark Performance Tuning Guide[EB/OL].[2017-04-28].http://tech.meituan.com/Spark-tuning-basic.html.
[18]PIAO H Q,CHEN Y G,DU X Y,et al.Equi-join optimization on Spark[J].Journal of East China Normal University(Natural Science),2014(5):261-270.(in Chinese)
卞昊穹,陈跃国,杜小勇,等.Spark 上的等值连接优化[J].华东师范大学学报 (自然科学版),2014(5):261-270.
[19]BLANAS S,PATEL J M,ERCEGOVAC V,et al.A comparison of join algorithms for log processing in mapreduce[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of data.ACM,2010:975-986.
[20]SAKR S,LIU A,FAYOUMI A G.The family of mapreduce and large-scale data processing systems[J].ACM Computing Surveys (CSUR),2013,46(1):11.
[21]CHEN K,WANG B,FENG L.Data Object Cache in SparkComputing Engine[J].ZTE Technology Journal,2016,22(2):23-27.(in Chinese)
陈康,王彬,冯琳.Spark 计算引擎的数据对象缓存[J].中兴通讯技术,2016,22(2):23-27.
[22]FENG L.Research and Implementation of Memory Optimaza-tion Based on Parallel Computing Engine Spark[D].Beijing:Tsinghua University,2013.(in Chinese)
冯琳.集群计算引擎 Spark 中的内存优化研究与实现[D].北京:清华大学,2013.
[23]CHURILA S A,ZHOU G L,SHI L,et al.Parallel cube computing in Spark[J].Journal of Computer Applications,2016,36(2):348-352.(in Chinese)
萨初日拉,周国亮,时磊,等.Spark 环境下并行立方体计算方法[J].计算机应用,2016,36(2):348-352.
[24]LI M,TAN J,WANG Y,et al.Sparkbench:a comprehensivebenchmarking suite for in memory data analytic platform spark[C]∥Proceedings of the 12th ACM International Conference on Computing Frontiers.ACM,2015:53.
[25]HERODOTOU H,LIM H,LUO G,et al.Starfish:A Self-tuning System for Big Data Analytics[C]∥Fifth Biennial Conference on Innovative Data Systems Research,Asilomar.DBLP,2011:261-272.
[26]HERODOTOU H,BABU S.Profiling,what-if analysis,andcost-based optimization of mapreduce programs[J].Proceedings of the VLDB Endowment,2011,4(11):1111-1122.
[27]HERODOTOU H.Hadoop performance models[J].arXiv preprint arXiv.2011,1106.0940.
[28]WU D,GOKHALE A.A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration[C]∥20th Annual InternationalConference on High Performance Computing.IEEE,2013:89-98.
[29]WU D.A Profiling and Performance Analysis based Self-tuning System for Optimization of Hadoop MapReduce Cluster Confi-guration[D].Nashvile:Vanderbilt University,2013.
[30]CHEN C O,ZHUO Y Q,YEH C C,et al.Machine Learning-Based Configuration Parameter Tuning on Hadoop System[C]∥2015 IEEE International Congress on Big Data (BigData Congress).IEEE,2015:386-392.
[31]RAVI N.Configuring and optimizing Spark applications withease-Nishkam ravi,Cloudera[EB/OL].(2015-09-01).https://apachebigdata2015.sched.org/event/55afa6d65370a56bdbcb5eba5166f010#.VemuzvaqpEN.
[32]CHEN Q A,LI F,CAO Y,et al.Parameter optimation for Spark jobs based on runtime data analysis[J].Computer Engineering & Science,2016,38(1):11-19.(in Chinese)
陈侨安,李峰,曹越,等.基于运行数据分析的 Spark 任务参数优化[J].计算机工程与科学,2016,38(1):11-19.
[33]XU J G,WANG G L,LIU S Y,et al.A Novel Performance Evaluation and Optimization Model for Big Data System [C]∥Proceedings of the 15th International Symposium on Parallel and Distributed Computing (ISPDC 2016).Fuzhou,China,2016:1765-1773.
[34]RUMI G,COLELLA C,ARDAGNA D.Optimization Tech-niques within the Hadoop Eco-system:A Survey[C]∥2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).IEEE,2014:437-444.
[35]VERMA A,CHERKASOVA L,CAMPBELL R H.ARIA:automatic resource inference and allocation for mapreduce environments[C]∥Proceedings of the 8th ACM International Confe-rence on Autonomic Computing.ACM,2011:235-244.
[36]SANDHOLM T,LAI K.Dynamic proportional share scheduling in hadoop[C]∥Workshop on Job Scheduling Strategies for Paral-lel Processing.Springer Berlin Heidelberg,2010:110-131.
[37]RAO B T,REDDY L S S.Survey on improved scheduling in Hadoop MapReduce in cloud environments[J].arXiv preprintarXiv:1207.0780,2012.
[38]KC K,ANYANWU K.Scheduling hadoop jobs to meet deadlines[C]∥IEEE Second International Conference on Cloud Computing Technology and Science.IEEE,2011:388-392.
[39]VERMA A,CHERKASOVA L,KUMAR V S,et al.Deadline-based workload management for mapreduce environments:Pieces of the performance puzzle[C]∥Network Operations and Management Symposium (NOMS).IEEE,2012:900-905.
[40]ZACHEILAS N,KALOGERAKI V.Real-Time Scheduling ofSkewed MapReduce Jobs in Heterogeneous Environments[C]∥ICAC.2014:189-200.
[41]XU X,CAO L,WANG X.Adaptive task scheduling strategybased on dynamic workload adjustment for heterogeneous Hadoop clusters[J].IEEE Systems Journal,2016,10(2):471-482.
[42]NIGHTINGALE E B,CHEN P M,FLINN J.Speculative execution in a distributed file system [J].ACM SIGOPS Operating Systems Review,2005,39(5):191-205.
[43]YANG Z W,ZHENG Q,WANG S,et al.Adaptive Task Sche-duling Strategy for heterogeneous Spark Cluster[J].Computer Engineering,2016,42(1):31-35,40.(in Chinese)
杨志伟,郑烇,王嵩,等.异构 Spark 集群下自适应任务调度策略[J].计算机工程,2016,42(1):31-35,40.
[44]KANG H M.Research on Spark Optimization Based on Fine-Grained Monitoring[D].Harbin:Harbin Institute of Technology,2016.(in Chinese)
康海蒙.基于细粒度监控的 Spark 优化研究[D].哈尔滨:哈尔滨工业大学,2016.
[45]RANA N,DESHMUKH S.Shuffle Performance in ApacheSpark[C]∥International Journal of Engineering Research and Technology.ESRSA Publications,2015.
[46]DAVIDSON A,OR A.Optimizing Shuffle performance in Spark[R].University of California,Berkeley-Department of Electrical Engineering and Computer Sciences,2013.
[47]JASON D.Consolidating Shuffle Files in Spark[EB/OL].[2017-04-28].https://issues.apache.org/jira/browse/SPARK-751.
[48]CHERN Y Z.Analysis and optimization of Memory Scheduling Algorithm of Spark Shuffle[D].Hangzhou:Zhejiang University,2016.(in Chinese)
陈英芝.Spark Shuffle的内存调度算法分析及优化[D].杭州:浙江大学,2016.
[49]YIGITBASI N,WILLKE T L,LIAO G,et al.Towards machine learning-based auto-tuning of mapreduce[C]∥2013 IEEE 21st International Symposium on Modelling,Analysis and Simulation of Computer and Telecommunication Systems.IEEE,2013:11-20.
[50]CHEN C O,ZHUO Y Q,YEH C C,et al.Machine Learning-Based Configuration Parameter Tuning on Hadoop System[C]∥2015 IEEE International Congress on Big Data.IEEE,2015:386-392.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey on Performance Optimization Technologies for Spark

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[2]	YU Jian-ye, QI Yong, WANG Bao-zhuo. Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark [J]. Computer Science, 2021, 48(6A): 518-523.
[3]	YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming [J]. Computer Science, 2020, 47(4): 36-41.
[4]	ZHU An-qing, LI Shuai, TANG Xiao-dong. Parallel FP_growth Association Rules Mining Method on Spark Platform [J]. Computer Science, 2020, 47(12): 139-143.
[5]	YU Xin-yi, SHI Tian-feng, TANG Quan-rui, YIN Hui-wu, OU Lin-lin. Industrial Equipment Management System for Predictive Maintenance [J]. Computer Science, 2020, 47(11A): 667-672.
[6]	DENG Ding-sheng. Application of Improved DBSCAN Algorithm on Spark Platform [J]. Computer Science, 2020, 47(11A): 425-429.
[7]	ZHOU Xin-yue, QIAN Li-ping, HUANG Yu-pin, WU Yuan. Optimization Method of Electric Vehicles Charging Scheduling Based on Ant Colony [J]. Computer Science, 2020, 47(11): 280-285.
[8]	JIA Ning, LI Ying-da. Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device [J]. Computer Science, 2019, 46(6A): 566-570.
[9]	ZHAO Jun-xian, YU Jian. Optimization of Spark RDD Based on Non-serialization Native Storage [J]. Computer Science, 2019, 46(5): 143-149.
[10]	WEI Liang, LIN Zi-yu, LAI Yong-xuan. DFTS:A Top-k Skyline Query for Large Datasets [J]. Computer Science, 2019, 46(5): 150-156.
[11]	CUI Guang-fan, XU Li-jie, LIU Jie, YE Dan, ZHONG Hua. Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL [J]. Computer Science, 2018, 45(9): 104-112.
[12]	ZHAO Er-ping, MENG Xiao-feng. Spatial Index of 3D Point Cloud Data Based on Spark [J]. Computer Science, 2018, 45(9): 213-219.
[13]	SHI Jin-ping,LI Jin,HE Feng-zhen. Diversity Recommendation Approach Based on Social Relationship and User Preference [J]. Computer Science, 2018, 45(6A): 423-427.
[14]	SHI Jing-qi, YANG Geng, SUN Yan-jun, BAI Shuang-jie and MIN Zhao-e. Efficient Parallel Algorithm of Fully Homomorphic Encryption Supporting Operation of Floating-point Number [J]. Computer Science, 2018, 45(5): 116-122.
[15]	PENG Zheng, WANG Ling-jiao, GUO Hua. Parallel Text Categorization of Random Forest [J]. Computer Science, 2018, 45(12): 148-152.