分布式机器学习平台与算法综述

doi:10.11896/j.issn.1002-137X.2019.03.002

摘要/Abstract

摘要： 分布式机器学习研究将具有大规模数据量和计算量的任务分布式地部署到多台机器上,其核心思想在于“分而治之”,有效提高了大规模数据计算的速度并节省了开销。分布式机器学习作为机器学习最重要的研究领域之一,受到各界研究者的广泛关注。鉴于分布式机器学习的研究意义和实用价值,文中系统综述了分布式机器学习的主流平台Spark,MXNet,Petuum,TensorFlow及PyTorch,并从各个角度深入总结、分析对比其特性;其次,从数据并行和模型并行两方面深入阐述了机器学习算法的分布式实现方式,而后依照整体同步并行模型、异步并行模型和延迟异步并行模型3种方法对机器学习算法的分布式计算模型进行概述;最后,从平台性能改进研究、算法优化、模型通信方式、大规模计算下算法的可扩展性和分布式环境下模型的容错性5个方面探讨了分布式机器学习在未来的研究方向。

关键词: 并行计算, 大数据, 分布式机器学习, 机器学习, 算法分析

Abstract: Distributed machine learning deploys many tasks which have large-scale data and computation in multiple machines.For improving the speed of largek-scale calculation and less overhead effectively,its core idea is “divide and conquer”.As one of the most important fields of machine learning,distributed machine learning has been widely concerned by researchers in each field.In view of research significance and practical value of distributed machine learning,this paper gave a summarization of mainstream platforms like Spark,MXNet,Petuum,TensorFlow and PyTorch,and analyzed their characteristics from different sides.Then,this paper made a deep explain for the implementation of machine learning algorithm from data parallel and model parallel,and gave a view of distributed computing model from bulk synchronous parallel model,asynchronous parallel model and delayed asynchronous parallel model.Finally,this paper discussed the future work of distributed machine learning from five aspects:improvement of platform,algorithms optimization,communication of networks,scalability of large-scale data algorithms and fault-tolerance.

Key words: Algorithm analysis, Big data, Distributed machine learning, Machine learning, Parallel computing

中图分类号:

TP301

舒娜,刘波,林伟伟,李鹏飞. 分布式机器学习平台与算法综述[J]. 计算机科学, 2019, 46(3): 9-18. https://doi.org/10.11896/j.issn.1002-137X.2019.03.002

SHU Na,LIU Bo,LIN Wei-wei,LI Peng-fei. Survey of Distributed Machine Learning Platforms and Algorithms[J]. Computer Science, 2019, 46(3): 9-18. https://doi.org/10.11896/j.issn.1002-137X.2019.03.002

参考文献

[1]PRESS G.A very short history of big data[EB/OL].https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/#3cf546e65a18.
[2]XING E P,HO Q,XIE P,et al.Strategies and principles of distributed machine learning on big data[J].Engineering,2016,2(2):179-195.
[3]HE Q,LI N,LUO W J,et al.A survey of machine learning algo-
rithms for big data[J].Pattern Recognition and Artificial Intelligence,2014,27(4):327-336.(in Chinese)
何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336.
[4]ZHANG K,ALQAHTANI S,DEMIRBAS M.A Comparison of Distributed Machine Learning Platforms[C]∥2017 26th International Conference on Computer Communication and Networks (ICCCN).IEEE,2017:1-9.
[5]LIU B,HE J R,GENG Y J,et al.Recent advances in infrastructure architecture of parallel machine learning algorithms[J].Computer Engineering and Applications,2017,53(11):31-38.(in Chinese)
刘斌,何进荣,耿耀君,等.并行机器学习算法基础体系前沿进展综述[J].计算机工程与应用,2017,53(11):31-38.
[6]WANG Z,LIAO J,CAO Q,et al.Friendbook:a semantic-based friend recommendation system for social networks[J].IEEE Transactions on Mobile Computing,2015,14(3):538-551.
[7]BOUAKAZ A,TALPIN J P,VITEK J.Affine data-flow graphs for the synthesis of hard real-time applications[C]∥2012 12th International Conference on Application of Concurrency to System Design (ACSD).IEEE,2012:183-192.
[8]AKIDAU T,BRADSHAW R,CHAMBERS C,et al.The dataflow model:a practical approach to balancing correctness,latency,and cost in massive-scale,unbounded,out-of-order data processing[J].Proceedings of the VLDB Endowment,2015,8(12):1792-1803.
[9]MENG X,BRADLEY J,YAVUZ B,et al.Mllib:Machine lear-
ning in apache spark[J].The Journal of Machine Learning Research,2016,17(1):1235-1241.
[10]LU J,WU D,MAO M,et al.Recommender system application developments:A survey[J].Decision Support Systems,2015,74(C):12-32.
[11]ALEXANDER M,NARAYANAMURTHY S.An architecture
for parallel topic models[J].Proceedings of the VLDB Endowment,2010,3(1):703-710.
[12]LI M,ZHOU L,YANG Z,et al.Parameter server for distributed machine learning[C]∥Big Learning NIPS Workshop.2013.
[13]LI M.Scaling Distributed Machine Learning with the Parameter Server[C]∥International Conference on Big Data Science and Computing.ACM,2014.
[14]LI M,ANDERSEN D G,SMOLA A J,et al.Communication efficient distributed machine learning with the parameter server[C]∥Advances in Neural Information Processing Systems.2014:19-27.
[15]HO Q,CIPAR J,CUI H,et al.More effective distributed ml via a stale synchronous parallel parameter server[C]∥Advances in neural information processing systems.2013:1223-1231.
[16]AHMED A,SHERVASHIDZE N,NARAYANAMURTHY S,et al.Distributed large-scale natural graph factorization[C]∥Proceedings of the 22nd International Conference on World Wide Web.ACM,2013:37-48.
[17]DEAN J,CORRADO G,MONGA R,et al.Large scale distributed deep networks[C]∥Advances in neural information proces-sing systems.2012:1223-1231.
[18]XING E P,HO Q,DAI W,et al.Petuum:A new platform for dis-
tributed machine learning on big data[J].IEEE Transactions on Big Data,2015,1(2):49-67.
[19]DROR G,KOENIGSTEIN N,KOREN Y,et al.The yahoo! music dataset and kdd-cup’11[C]∥Proceedings of KDD Cup 2011.2012:3-18.
[20]HE K,ZHANG X,REN S,et al.Deep residual learning for ima-
ge recognition∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[21]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet
classification with deep convolutional neural networks[C]∥Advances in neural information processing systems.2012:1097-1105.
[22]DAI W,KUMAR A,WEI J,et al.High-Performance Distributed ML at Scale through Parameter Server Consistency Models[C]∥29th AAAI Conference on Artificial Intelligence(AAA-15).2015:79-87.
[23]LIAW A,WIENER M.Classification and regression by random
Forest[J].R News,2002,2(3):18-22.
[24]HOSMER J D W,LEMESHOW S,STURDIVANT R X.Applied logistic regression[M].New York:John Wiley & Sons,2013.
[25]ABADI M,BARHAM P,CHEN J,et al.TensorFlow:A System for Large-Scale Machine Learning[J].arXiv:1605.08695,2016.
[26]ARVIND,CULLER D E.Dataflow Architectures.Annual
Review of Computer Science,2010,1(1):225-253.
[27]SAK H,SENIOR A,BEAUFAYS F.Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]∥Fifteenth Annual Conference of the International Speech Communication Association.2014.
[28]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence
learning with neural networks[C]∥Advances in neural information processing systems.2014:3104-3112.
[29]VISHNU A,SIEGEL C,DAILY J.Distributed tensorflow with MPI[J].arXiv:1603.02339,2016.
[30]JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional architecture for fast feature embedding[C]∥Proceedings of the 22nd ACM International Conference on Multimedia.ACM,2014:675-678.
[31]GOODFELLOW I,BENGIO Y,COURVILLE A.Deep learning[M].Massachusetts:MIT press,2016.
[32]KANG L Y,WANG J F,LIU J,et al.Survey on parallel and dis-
tributed optimization algorithms for scalable machine learning[J].Journal of Software,2018,29(1):109-130.(in Chinese)
亢良伊,王建飞,刘杰,等.可扩展机器学习的并行与分布式优化算法综述[J].软件学报,2018,29(1):109-130.
[33]LIU T Y,CHEN W,WANG T.Distributed machine learning:Foundations,trends,and practices[C]∥Proceedings of the 26th International Conference on World Wide Web Companion.International World Wide Web Conferences Steering Committee,2017:913-915.
[34]ZHOU J,DING Y,et al.KunPeng:Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2017:1693-1702.
[35]SUNG N,KIM M,JO H,et al.NSML:A Machine Learning
Platform That Enables You to Focus on Your Models[J].ar-Xiv:1712.05902.
[36]SABOUR S,FROSST N,HINTON G E.Dynamic routing between capsules[C]∥Advances in Neural Information Processing Systems.2017:3859-3869.
[37]GAO Y,PHILLIPS J M,ZHENG Y,et al.Fully convolutional
structured LSTM networks for joint 4D medical image segmentation[C]∥2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).IEEE,2018:1104-1108.
[38]NAZARI M,OROOJLOOY A,SNYDER L V,et al.Deep Reinforcement Learning for Solving the Vehicle Routing Problem[J].arXiv:1802.04240.
[39]LEE K,LAM M,PEDARSANI R,et al.Speeding up distributed
machine learning using codes[J].IEEE Transactions on Information Theory,2017,PP(99):1.

相关文章 15

[1]	冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2]	宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3]	何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[4]	李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5]	张光华, 高天娇, 陈振国, 于乃文. 基于N-Gram静态分析技术的恶意软件分类研究 Study on Malware Classification Based on N-Gram Static Analysis Technology 计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[6]	陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[7]	陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[8]	李亚茹, 张宇来, 王佳晨. 面向超参数估计的贝叶斯优化方法综述 Survey on Bayesian Optimization Methods for Hyper-parameter Tuning 计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[9]	赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[10]	肖治鸿, 韩晔彤, 邹永攀. 基于多源数据和逻辑推理的行为识别技术研究 Study on Activity Recognition Based on Multi-source Data and Logical Reasoning 计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[11]	姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮. 一种基于异质模型融合的 Android 终端恶意软件检测方法 Android Malware Detection Method Based on Heterogeneous Model Fusion 计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[12]	王飞, 黄涛, 杨晔. 基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究 Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion 计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[13]	许杰, 祝玉坤, 邢春晓. 机器学习在金融资产定价中的应用研究综述 Application of Machine Learning in Financial Asset Pricing:A Review 计算机科学, 2022, 49(6): 276-286. https://doi.org/10.11896/jsjkx.210900127
[14]	陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[15]	李野, 陈松灿. 基于物理信息的神经网络:最新进展与展望 Physics-informed Neural Networks:Recent Advances and Prospects 计算机科学, 2022, 49(4): 254-262. https://doi.org/10.11896/jsjkx.210500158

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed