计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 120-128.doi: 10.11896/jsjkx.231200128

• 高性能计算 • 上一篇    下一篇

基于混合并行的分布式训练优化研究

徐金龙1,3, 李鹏飞2, 李嘉楠2, 陈飙元2, 高伟1, 韩林1   

  1. 1 国家超级计算郑州中心(郑州大学) 郑州 450000
    2 郑州大学计算机与人工智能学院 郑州 450000
    3 战略支援部队信息工程大学 郑州 450000
  • 收稿日期:2023-12-19 修回日期:2024-04-29 出版日期:2024-12-15 发布日期:2024-12-10
  • 通讯作者: 韩林(strollerlin@163.com)
  • 作者简介:(longkaizh@163.com)
  • 基金资助:
    河南省重大科技专项(221100210600)

Study on Distributed Training Optimization Based on Hybrid Parallel

XU Jinlong1,3, LI Pengfei2, LI Jianan2, CHEN Biaoyuan2, GAO Wei1, HAN Lin1   

  1. 1 National Supercomputing Center in Zhengzhou(Zhengzhou University), Zhengzhou 450000, China
    2 School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450000, China
    3 Strategic Support Force Information Engineering University, Zhengzhou 450000, China
  • Received:2023-12-19 Revised:2024-04-29 Online:2024-12-15 Published:2024-12-10
  • About author:XU Jinlong,born in 1985,Ph.D,master’ssupervisor.His main research intere-sts include high-performance computing and paralle compilation.
    HAN Lin,born in 1978,Ph.D,associate professor,is a senior member of CCF(No.16416M).His main research interests include compiler optimization and high-performance computing.
  • Supported by:
    Major Science and Technology Project of Henan Province(221100210600).

摘要: 大型神经网络训练是深度学习领域的一个热点话题,而分布式训练是基于多节点实现大型神经网络训练的最佳方法之一。分布式训练通常包含数据并行、层间并行和层内并行3种并行方法。然而现有的框架在层间并行时只能对模型进行手动切分,增加了模型设计的抽象复杂度,对此提出了节点约束关系搜索算法,实现了模型的自动切分。另外,在传统的数据并行和层间并行中,由于模型的复杂约束关系和通信操作的需要,计算和通信往往受到严格的序列化限制,为此引入了同步优化算法,实现了计算和通信的重叠,有效提高了整体训练的效率。实验对不同规模的GPT-2,AlexNet,VGG16和ResNet50模型进行训练,使用同步优化算法在6节点条件下可以将GPT2-XL,GPT2-LARGE和GPT2-MEDIUM模型的训练性能分别提升1.14倍、1.18倍和1.23倍,在1节点条件下将AlexNet,VGG16和ResNet50模型的训练性能分别提升1.31倍、1.14倍和1.03倍。实验结果表明,同步优化算法能够提升混合并行中的训练效率。

关键词: 分布式训练, 混合并行, 自动切分, 通信优化, 梯度同步

Abstract: Large-scale neural network training is a hot topic in the field of deep learning,and distributed training stands out as one of the most effective methods for training large neural networks across multiple nodes.Distributed training typically involves three parallel methods:data parallelism,inter-layer parallelism,and intra-layer parallelism.However,in existing frameworks,manual model partitioning is required for inter-layer parallelism,which increases the abstract complexity of model design.To address this issue,we propose a node-constrained relationship search algorithm that automates the model partitioning process.Moreover,in traditional data parallelism and inter-layer parallelism,strict serialization limits the overlap of computation and communication due to complex model constraints and the need for communication operations.To overcome this challenge,we introduce a synchronous optimization algorithm,enabling the overlap of computation and communication and effectively enhancing the overall training efficiency.The experiments involve training GPT-2 of different sizes,AlexNet,VGG16,and ResNet50 models.Using the synchronous optimization algorithm under a 6-node configuration,the training performance of GPT2-XL,GPT2-LARGE,and GPT2-MEDIUM models is improved,achieving speed-ups of 1.14,1.18,and 1.23,respectively.Under 1-node configuration,performance enhancements are also observed for AlexNet,VGG16,and ResNet50 models,with speed-ups of 1.31,1.14,and 1.03,respectively.The experimental results indicate that the synchronous optimization algorithm effectively enhances the training efficiency in mixed parallelism.

Key words: Distributed learning, Hybrid parallel, Automatic segmentation, Communication optimization, Gradient synchronization

中图分类号: 

  • TP391
[1]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019.
[2]AO Y,WU Z,YU D,et al.End-to-end adaptive distributedtraining on paddlepaddle[J].arXiv:2112.02752,2021.
[3]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3505-3506.
[4]SERGEEV A,DEL BALSO M.Horovod:fast and easy distribu-ted deep learning in TensorFlow[J].arXiv:1802.05799,2018.
[5]GAN S,JIANG J,YUAN B,et al.Bagua:scaling up distributed learning with system relaxations[J].Proceedings of the VLDB Endowment,2021,15(4):804-813.
[6]LI,ZHAO S A,VARMA Y A,et.al.Pytorch distributed:Experiences on accelerating data parallel training[J].VLDB Endowment,2020,13(12):3005-3018.
[7]SINGH S,BHATELE A.AxoNN:An asynchronous,message-driven parallel framework for extreme-scale deep learning[C]//2022 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2022:606-616.
[8]SONG J,YIM J,JUNG J,et al.Optimus-CC:Efficient LargeNLP Model Training with 3D Parallelism Aware Communication Compression[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2.2023:560-573.
[9]JEAUGEY S.Nccl 2.0[C]//GPU Technology Conference(GTC).2017.
[10]GABRIEL E,FAGG G E,BOSILCA G,et al.Open MPI:Goals,concept,and design of a next generation MPI implementation[C]//Recent Advances in Parallel Virtual Machine and Message Passing Interface:11th European PVM/MPI Users’ Group Meeting Budapest,Hungary,September 19-22,2004.Procee-dings 11.Springer Berlin Heidelberg,2004:97-104.
[11]COATES A,HUVAL B,WANG T,et al.Deep learning withCOTS HPC systems[C]//International Conference on Machine Learning.PMLR,2013:1337-1345.
[12]WANG E D,YAN R D,GUO Z H et al.Review of distributed training systems and their optimization algorithms[J].Chinese Journal of Computers,2024,47(1):1-28.
[13]NARAYANAN D,HARLAP A,PHANISHAYEE A,et al.PipeDream:Generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles.2019:1-15.
[14]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[15]PATARASUK P,YUAN X.Bandwidth optimal all-reduce algorithms for clusters of workstations[J].Journal of Parallel and Distributed Computing,2009,69(2):117-124.
[16]ZHENG S,MENG Q,WANG T,et al.Asynchronous stochastic gradient descent with delay compensation[C]//International Conference on Machine Learning.PMLR,2017:4120-4129.
[17]ZHANG H,ZHENG Z,XU S,et al.Poseidon:An efficient communication architecture for distributed deep learning on GPU clusters[C]//2017 USENIX Annual Technical Conference.2017:181-193.
[18]REED J,DEVITO Z,HE H,et al.Torch.fx:Practical program capture and transformation for deep learning in python[J].Proceedings of Machine Learning and Systems,2022,4:638-651.
[19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9.
[20]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45.
[21]XU Y,DONG D,XU W,et al.SketchDLC:A sketch on distributed deep learning communication via trace capturing[J].ACM Transactions on Architecture and Code Optimization(TACO),2019,16(2):1-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!