计算机科学 ›› 2019, Vol. 46 ›› Issue (3): 9-18.doi: 10.11896/j.issn.1002-137X.2019.03.002

• 综述 • 上一篇    下一篇

分布式机器学习平台与算法综述

舒娜1,刘波1,林伟伟2,李鹏飞1   

  1. 华南师范大学计算机学院 广州 510631)1
    (华南理工大学计算机科学与工程学院 广州 510640)2
  • 收稿日期:2018-04-06 修回日期:2018-06-28 出版日期:2019-03-15 发布日期:2019-03-22
  • 作者简介:舒娜(1993-),女,硕士生,主要研究方向为分布式机器学习;刘波(1968-),男,博士,教授,主要研究方向为云存储技术、云计算与大数据技术;林伟伟(1980-),男,博士,教授,CCF高级会员,主要研究方向为云计算、大数据,E-mail:linww@scut.edu.cn(通信作者);李鹏飞(1993-),男,硕士生,主要研究方向为Docker容器调度、云计算。
  • 基金资助:
    国家自然科学基金项目(61772205),广东省科技计划项目(2017B010126002,2017A010101008,2017A010101014,2017B090901061,2016B090918021,2016A010101018,2016A010119171),广州市南沙区科技计划项目(2017GJ001)资助

Survey of Distributed Machine Learning Platforms and Algorithms

SHU Na1,LIU Bo1,LIN Wei-wei2,LI Peng-fei1   

  1. (School of Computer,South China Normal University,Guangzhou 510631,China)1
    (School of Computer Science and Technology,South China University of Technology,Guangzhou 510640,China)2
  • Received:2018-04-06 Revised:2018-06-28 Online:2019-03-15 Published:2019-03-22

摘要: 分布式机器学习研究将具有大规模数据量和计算量的任务分布式地部署到多台机器上,其核心思想在于“分而治之”,有效提高了大规模数据计算的速度并节省了开销。分布式机器学习作为机器学习最重要的研究领域之一,受到各界研究者的广泛关注。鉴于分布式机器学习的研究意义和实用价值,文中系统综述了分布式机器学习的主流平台Spark,MXNet,Petuum,TensorFlow及PyTorch,并从各个角度深入总结、分析对比其特性;其次,从数据并行和模型并行两方面深入阐述了机器学习算法的分布式实现方式,而后依照整体同步并行模型、异步并行模型和延迟异步并行模型3种方法对机器学习算法的分布式计算模型进行概述;最后,从平台性能改进研究、算法优化、模型通信方式、大规模计算下算法的可扩展性和分布式环境下模型的容错性5个方面探讨了分布式机器学习在未来的研究方向。

关键词: 大数据, 分布式机器学习, 机器学习, 算法分析, 并行计算

Abstract: Distributed machine learning deploys many tasks which have large-scale data and computation in multiple machines.For improving the speed of largek-scale calculation and less overhead effectively,its core idea is “divide and conquer”.As one of the most important fields of machine learning,distributed machine learning has been widely concerned by researchers in each field.In view of research significance and practical value of distributed machine learning,this paper gave a summarization of mainstream platforms like Spark,MXNet,Petuum,TensorFlow and PyTorch,and analyzed their characteristics from different sides.Then,this paper made a deep explain for the implementation of machine learning algorithm from data parallel and model parallel,and gave a view of distributed computing model from bulk synchronous parallel model,asynchronous parallel model and delayed asynchronous parallel model.Finally,this paper discussed the future work of distributed machine learning from five aspects:improvement of platform,algorithms optimization,communication of networks,scalability of large-scale data algorithms and fault-tolerance.

Key words: Big data, Distributed machine learning, Machine learning, Algorithm analysis, Parallel computing

中图分类号: 

  • TP301
[1] PRESS G.A very short history of big data[EB/OL].https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/#3cf546e65a18.
[2] XING E P,HO Q,XIE P,et al.Strategies and principles of distributed machine learning on big data[J].Engineering,2016,2(2):179-195.
[3] HE Q,LI N,LUO W J,et al.A survey of machine learning algo-rithms for big data[J].Pattern Recognition and Artificial Intelligence,2014,27(4):327-336.(in Chinese)何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336.
[4] ZHANG K,ALQAHTANI S,DEMIRBAS M.A Comparison of Distributed Machine Learning Platforms[C]∥2017 26th International Conference on Computer Communication and Networks (ICCCN).IEEE,2017:1-9.
[5] LIU B,HE J R,GENG Y J,et al.Recent advances in infrastructure architecture of parallel machine learning algorithms[J].Computer Engineering and Applications,2017,53(11):31-38.(in Chinese)刘斌,何进荣,耿耀君,等.并行机器学习算法基础体系前沿进展综述[J].计算机工程与应用,2017,53(11):31-38.
[6] WANG Z,LIAO J,CAO Q,et al.Friendbook:a semantic-based friend recommendation system for social networks[J].IEEE Transactions on Mobile Computing,2015,14(3):538-551.
[7] BOUAKAZ A,TALPIN J P,VITEK J.Affine data-flow graphs for the synthesis of hard real-time applications[C]∥2012 12th International Conference on Application of Concurrency to System Design (ACSD).IEEE,2012:183-192.
[8] AKIDAU T,BRADSHAW R,CHAMBERS C,et al.The dataflow model:a practical approach to balancing correctness,latency,and cost in massive-scale,unbounded,out-of-order data processing[J].Proceedings of the VLDB Endowment,2015,8(12):1792-1803.
[9] MENG X,BRADLEY J,YAVUZ B,et al.Mllib:Machine lear-ning in apache spark[J].The Journal of Machine Learning Research,2016,17(1):1235-1241.
[10] LU J,WU D,MAO M,et al.Recommender system application developments:A survey[J].Decision Support Systems,2015,74(C):12-32.
[11] ALEXANDER M,NARAYANAMURTHY S.An architecturefor parallel topic models[J].Proceedings of the VLDB Endowment,2010,3(1):703-710.
[12] LI M,ZHOU L,YANG Z,et al.Parameter server for distributed machine learning[C]∥Big Learning NIPS Workshop.2013.
[13] LI M.Scaling Distributed Machine Learning with the Parameter Server[C]∥International Conference on Big Data Science and Computing.ACM,2014.
[14] LI M,ANDERSEN D G,SMOLA A J,et al.Communication efficient distributed machine learning with the parameter server[C]∥Advances in Neural Information Processing Systems.2014:19-27.
[15] HO Q,CIPAR J,CUI H,et al.More effective distributed ml via a stale synchronous parallel parameter server[C]∥Advances in neural information processing systems.2013:1223-1231.
[16] AHMED A,SHERVASHIDZE N,NARAYANAMURTHY S,et al.Distributed large-scale natural graph factorization[C]∥Proceedings of the 22nd International Conference on World Wide Web.ACM,2013:37-48.
[17] DEAN J,CORRADO G,MONGA R,et al.Large scale distributed deep networks[C]∥Advances in neural information proces-sing systems.2012:1223-1231.
[18] XING E P,HO Q,DAI W,et al.Petuum:A new platform for dis-tributed machine learning on big data[J].IEEE Transactions on Big Data,2015,1(2):49-67.
[19] DROR G,KOENIGSTEIN N,KOREN Y,et al.The yahoo! music dataset and kdd-cup’11[C]∥Proceedings of KDD Cup 2011.2012:3-18.
[20] HE K,ZHANG X,REN S,et al.Deep residual learning for ima-ge recognition∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[21] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]∥Advances in neural information processing systems.2012:1097-1105.
[22] DAI W,KUMAR A,WEI J,et al.High-Performance Distributed ML at Scale through Parameter Server Consistency Models[C]∥29th AAAI Conference on Artificial Intelligence(AAA-15).2015:79-87.
[23] LIAW A,WIENER M.Classification and regression by randomForest[J].R News,2002,2(3):18-22.
[24] HOSMER J D W,LEMESHOW S,STURDIVANT R X.Applied logistic regression[M].New York:John Wiley & Sons,2013.
[25] ABADI M,BARHAM P,CHEN J,et al.TensorFlow:A System for Large-Scale Machine Learning[J].arXiv:1605.08695,2016.
[26] ARVIND,CULLER D E.Dataflow Architectures.AnnualReview of Computer Science,2010,1(1):225-253.
[27] SAK H,SENIOR A,BEAUFAYS F.Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]∥Fifteenth Annual Conference of the International Speech Communication Association.2014.
[28] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequencelearning with neural networks[C]∥Advances in neural information processing systems.2014:3104-3112.
[29] VISHNU A,SIEGEL C,DAILY J.Distributed tensorflow with MPI[J].arXiv:1603.02339,2016.
[30] JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional architecture for fast feature embedding[C]∥Proceedings of the 22nd ACM International Conference on Multimedia.ACM,2014:675-678.
[31] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep learning[M].Massachusetts:MIT press,2016.
[32] KANG L Y,WANG J F,LIU J,et al.Survey on parallel and dis-tributed optimization algorithms for scalable machine learning[J].Journal of Software,2018,29(1):109-130.(in Chinese)亢良伊,王建飞,刘杰,等.可扩展机器学习的并行与分布式优化算法综述[J].软件学报,2018,29(1):109-130.
[33] LIU T Y,CHEN W,WANG T.Distributed machine learning:Foundations,trends,and practices[C]∥Proceedings of the 26th International Conference on World Wide Web Companion.International World Wide Web Conferences Steering Committee,2017:913-915.
[34] ZHOU J,DING Y,et al.KunPeng:Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2017:1693-1702.
[35] SUNG N,KIM M,JO H,et al.NSML:A Machine LearningPlatform That Enables You to Focus on Your Models[J].ar-Xiv:1712.05902.
[36] SABOUR S,FROSST N,HINTON G E.Dynamic routing between capsules[C]∥Advances in Neural Information Processing Systems.2017:3859-3869.
[37] GAO Y,PHILLIPS J M,ZHENG Y,et al.Fully convolutionalstructured LSTM networks for joint 4D medical image segmentation[C]∥2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).IEEE,2018:1104-1108.
[38] NAZARI M,OROOJLOOY A,SNYDER L V,et al.Deep Reinforcement Learning for Solving the Vehicle Routing Problem[J].arXiv:1802.04240.
[39] LEE K,LAM M,PEDARSANI R,et al.Speeding up distributedmachine learning using codes[J].IEEE Transactions on Information Theory,2017,PP(99):1.
[1] 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式[J]. 计算机科学, 2021, 48(1): 119-124.
[2] 李吟, 李必信. 基于脚本预测和重组的内存泄漏测试加速技术[J]. 计算机科学, 2020, 47(9): 31-39.
[3] 丁钰, 魏浩, 潘志松, 刘鑫. 网络表示学习算法综述[J]. 计算机科学, 2020, 47(9): 52-59.
[4] 赵会群, 吴凯锋. 一种大数据估价算法[J]. 计算机科学, 2020, 47(9): 110-116.
[5] 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术[J]. 计算机科学, 2020, 47(9): 117-122.
[6] 苏畅, 张定权, 谢显中, 谭娅. 面向5G通信网络的NFV内存资源管理方法[J]. 计算机科学, 2020, 47(9): 246-251.
[7] 陈国良, 张玉杰. 并行计算学科发展历程[J]. 计算机科学, 2020, 47(8): 1-4.
[8] 阳王东, 王昊天, 张宇峰, 林圣乐, 蔡沁耘. 异构混合并行计算综述[J]. 计算机科学, 2020, 47(8): 5-16.
[9] 王慧, 乐孜纯, 龚轩, 武玉坤, 左浩. 基于特征分类的链路预测方法综述[J]. 计算机科学, 2020, 47(8): 302-312.
[10] 朝乐门. 数据科学导论的课程设计及教学改革[J]. 计算机科学, 2020, 47(7): 1-7.
[11] 冯凯, 李婧. k元n方体的子网络可靠性研究[J]. 计算机科学, 2020, 47(7): 31-36.
[12] 袁野, 和晓歌, 朱定坤, 王富利, 谢浩然, 汪俊, 魏明强, 郭延文. 视觉图像显著性检测综述[J]. 计算机科学, 2020, 47(7): 84-91.
[13] 彭伟, 胡宁, 胡璟璟. 图像隐写分析算法研究概述[J]. 计算机科学, 2020, 47(6A): 325-331.
[14] 顾荣杰, 吴治平, 石焕. 基于TFR 模型的公安云平台数据分级分类安全访问控制模型研究[J]. 计算机科学, 2020, 47(6A): 400-403.
[15] 包振山, 郭俊南, 谢源, 张文博. 基于LSTM-GA的股票价格涨跌预测模型[J]. 计算机科学, 2020, 47(6A): 467-473.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .