Computer Science ›› 2019, Vol. 46 ›› Issue (3): 9-18.doi: 10.11896/j.issn.1002-137X.2019.03.002

• Surveys • Previous Articles     Next Articles

Survey of Distributed Machine Learning Platforms and Algorithms

SHU Na1,LIU Bo1,LIN Wei-wei2,LI Peng-fei1   

  1. (School of Computer,South China Normal University,Guangzhou 510631,China)1
    (School of Computer Science and Technology,South China University of Technology,Guangzhou 510640,China)2
  • Received:2018-04-06 Revised:2018-06-28 Online:2019-03-15 Published:2019-03-22

Abstract: Distributed machine learning deploys many tasks which have large-scale data and computation in multiple machines.For improving the speed of largek-scale calculation and less overhead effectively,its core idea is “divide and conquer”.As one of the most important fields of machine learning,distributed machine learning has been widely concerned by researchers in each field.In view of research significance and practical value of distributed machine learning,this paper gave a summarization of mainstream platforms like Spark,MXNet,Petuum,TensorFlow and PyTorch,and analyzed their characteristics from different sides.Then,this paper made a deep explain for the implementation of machine learning algorithm from data parallel and model parallel,and gave a view of distributed computing model from bulk synchronous parallel model,asynchronous parallel model and delayed asynchronous parallel model.Finally,this paper discussed the future work of distributed machine learning from five aspects:improvement of platform,algorithms optimization,communication of networks,scalability of large-scale data algorithms and fault-tolerance.

Key words: Big data, Distributed machine learning, Machine learning, Algorithm analysis, Parallel computing

CLC Number: 

  • TP301
[1] PRESS G.A very short history of big data[EB/OL].
[2] XING E P,HO Q,XIE P,et al.Strategies and principles of distributed machine learning on big data[J].Engineering,2016,2(2):179-195.
[3] HE Q,LI N,LUO W J,et al.A survey of machine learning algo-rithms for big data[J].Pattern Recognition and Artificial Intelligence,2014,27(4):327-336.(in Chinese)何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336.
[4] ZHANG K,ALQAHTANI S,DEMIRBAS M.A Comparison of Distributed Machine Learning Platforms[C]∥2017 26th International Conference on Computer Communication and Networks (ICCCN).IEEE,2017:1-9.
[5] LIU B,HE J R,GENG Y J,et al.Recent advances in infrastructure architecture of parallel machine learning algorithms[J].Computer Engineering and Applications,2017,53(11):31-38.(in Chinese)刘斌,何进荣,耿耀君,等.并行机器学习算法基础体系前沿进展综述[J].计算机工程与应用,2017,53(11):31-38.
[6] WANG Z,LIAO J,CAO Q,et al.Friendbook:a semantic-based friend recommendation system for social networks[J].IEEE Transactions on Mobile Computing,2015,14(3):538-551.
[7] BOUAKAZ A,TALPIN J P,VITEK J.Affine data-flow graphs for the synthesis of hard real-time applications[C]∥2012 12th International Conference on Application of Concurrency to System Design (ACSD).IEEE,2012:183-192.
[8] AKIDAU T,BRADSHAW R,CHAMBERS C,et al.The dataflow model:a practical approach to balancing correctness,latency,and cost in massive-scale,unbounded,out-of-order data processing[J].Proceedings of the VLDB Endowment,2015,8(12):1792-1803.
[9] MENG X,BRADLEY J,YAVUZ B,et al.Mllib:Machine lear-ning in apache spark[J].The Journal of Machine Learning Research,2016,17(1):1235-1241.
[10] LU J,WU D,MAO M,et al.Recommender system application developments:A survey[J].Decision Support Systems,2015,74(C):12-32.
[11] ALEXANDER M,NARAYANAMURTHY S.An architecturefor parallel topic models[J].Proceedings of the VLDB Endowment,2010,3(1):703-710.
[12] LI M,ZHOU L,YANG Z,et al.Parameter server for distributed machine learning[C]∥Big Learning NIPS Workshop.2013.
[13] LI M.Scaling Distributed Machine Learning with the Parameter Server[C]∥International Conference on Big Data Science and Computing.ACM,2014.
[14] LI M,ANDERSEN D G,SMOLA A J,et al.Communication efficient distributed machine learning with the parameter server[C]∥Advances in Neural Information Processing Systems.2014:19-27.
[15] HO Q,CIPAR J,CUI H,et al.More effective distributed ml via a stale synchronous parallel parameter server[C]∥Advances in neural information processing systems.2013:1223-1231.
[16] AHMED A,SHERVASHIDZE N,NARAYANAMURTHY S,et al.Distributed large-scale natural graph factorization[C]∥Proceedings of the 22nd International Conference on World Wide Web.ACM,2013:37-48.
[17] DEAN J,CORRADO G,MONGA R,et al.Large scale distributed deep networks[C]∥Advances in neural information proces-sing systems.2012:1223-1231.
[18] XING E P,HO Q,DAI W,et al.Petuum:A new platform for dis-tributed machine learning on big data[J].IEEE Transactions on Big Data,2015,1(2):49-67.
[19] DROR G,KOENIGSTEIN N,KOREN Y,et al.The yahoo! music dataset and kdd-cup’11[C]∥Proceedings of KDD Cup 2011.2012:3-18.
[20] HE K,ZHANG X,REN S,et al.Deep residual learning for ima-ge recognition∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[21] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]∥Advances in neural information processing systems.2012:1097-1105.
[22] DAI W,KUMAR A,WEI J,et al.High-Performance Distributed ML at Scale through Parameter Server Consistency Models[C]∥29th AAAI Conference on Artificial Intelligence(AAA-15).2015:79-87.
[23] LIAW A,WIENER M.Classification and regression by randomForest[J].R News,2002,2(3):18-22.
[24] HOSMER J D W,LEMESHOW S,STURDIVANT R X.Applied logistic regression[M].New York:John Wiley & Sons,2013.
[25] ABADI M,BARHAM P,CHEN J,et al.TensorFlow:A System for Large-Scale Machine Learning[J].arXiv:1605.08695,2016.
[26] ARVIND,CULLER D E.Dataflow Architectures.AnnualReview of Computer Science,2010,1(1):225-253.
[27] SAK H,SENIOR A,BEAUFAYS F.Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]∥Fifteenth Annual Conference of the International Speech Communication Association.2014.
[28] SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequencelearning with neural networks[C]∥Advances in neural information processing systems.2014:3104-3112.
[29] VISHNU A,SIEGEL C,DAILY J.Distributed tensorflow with MPI[J].arXiv:1603.02339,2016.
[30] JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional architecture for fast feature embedding[C]∥Proceedings of the 22nd ACM International Conference on Multimedia.ACM,2014:675-678.
[31] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep learning[M].Massachusetts:MIT press,2016.
[32] KANG L Y,WANG J F,LIU J,et al.Survey on parallel and dis-tributed optimization algorithms for scalable machine learning[J].Journal of Software,2018,29(1):109-130.(in Chinese)亢良伊,王建飞,刘杰,等.可扩展机器学习的并行与分布式优化算法综述[J].软件学报,2018,29(1):109-130.
[33] LIU T Y,CHEN W,WANG T.Distributed machine learning:Foundations,trends,and practices[C]∥Proceedings of the 26th International Conference on World Wide Web Companion.International World Wide Web Conferences Steering Committee,2017:913-915.
[34] ZHOU J,DING Y,et al.KunPeng:Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2017:1693-1702.
[35] SUNG N,KIM M,JO H,et al.NSML:A Machine LearningPlatform That Enables You to Focus on Your Models[J].ar-Xiv:1712.05902.
[36] SABOUR S,FROSST N,HINTON G E.Dynamic routing between capsules[C]∥Advances in Neural Information Processing Systems.2017:3859-3869.
[37] GAO Y,PHILLIPS J M,ZHENG Y,et al.Fully convolutionalstructured LSTM networks for joint 4D medical image segmentation[C]∥2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).IEEE,2018:1104-1108.
[38] NAZARI M,OROOJLOOY A,SNYDER L V,et al.Deep Reinforcement Learning for Solving the Vehicle Routing Problem[J].arXiv:1802.04240.
[39] LEE K,LAM M,PEDARSANI R,et al.Speeding up distributedmachine learning using codes[J].IEEE Transactions on Information Theory,2017,PP(99):1.
[1] YE Ya-zhen, LIU Guo-hua, ZHU Yang-yong. Two-step Authorization Pattern of Data Product Circulation [J]. Computer Science, 2021, 48(1): 119-124.
[2] LI Yin, LI Bi-xin. Memory Leak Test Acceleration Based on Script Prediction and Reconstruction [J]. Computer Science, 2020, 47(9): 31-39.
[3] DING Yu, WEI Hao, PAN Zhi-song, LIU Xin. Survey of Network Representation Learning [J]. Computer Science, 2020, 47(9): 52-59.
[4] ZHAO Hui-qun, WU Kai-feng. Big Data Valuation Algorithm [J]. Computer Science, 2020, 47(9): 110-116.
[5] MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.
[6] SU Chang, ZHANG Ding-quan, XIE Xian-zhong, TAN Ya. NFV Memory Resource Management in 5G Communication Network [J]. Computer Science, 2020, 47(9): 246-251.
[7] CHEN Guo-liang, ZHANG Yu-jie, . Development of Parallel Computing Subject [J]. Computer Science, 2020, 47(8): 1-4.
[8] YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le, CAI Qin-yun. Survey of Heterogeneous Hybrid Parallel Computing [J]. Computer Science, 2020, 47(8): 5-16.
[9] WANG Hui, LE Zi-chun, GONG Xuan, WU Yu-kun, ZUO Hao. Review of Link Prediction Methods Based on Feature Classification [J]. Computer Science, 2020, 47(8): 302-312.
[10] CHAO Le-men. Course Design and Redesign for Introduction to Data Science [J]. Computer Science, 2020, 47(7): 1-7.
[11] YUAN Ye, HE Xiao-ge, ZHU Ding-kun, WANG Fu-lee, XIE Hao-ran, WANG Jun, WEI Ming-qiang, GUO Yan-wen. Survey of Visual Image Saliency Detection [J]. Computer Science, 2020, 47(7): 84-91.
[12] PENG Wei, HU Ning and HU Jing-Jing. Overview of Research on Image Steganalysis Algorithms [J]. Computer Science, 2020, 47(6A): 325-331.
[13] GU Rong-Jie, WU Zhi-ping and SHI Huan. New Approach for Graded and Classified Cloud Data Access Control for Public Security Based on TFR Model [J]. Computer Science, 2020, 47(6A): 400-403.
[14] BAO Zhen-shan, GUO Jun-nan, XIE Yuan and ZHANG Wen-bo. Model for Stock Price Trend Prediction Based on LSTM and GA [J]. Computer Science, 2020, 47(6A): 467-473.
[15] LI Yong. Stock Investment Strategy Development Based on BigQuant Platform [J]. Computer Science, 2020, 47(6A): 612-615.
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .