计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 120-128.doi: 10.11896/jsjkx.231200128
徐金龙1,3, 李鹏飞2, 李嘉楠2, 陈飙元2, 高伟1, 韩林1
XU Jinlong1,3, LI Pengfei2, LI Jianan2, CHEN Biaoyuan2, GAO Wei1, HAN Lin1
摘要: 大型神经网络训练是深度学习领域的一个热点话题,而分布式训练是基于多节点实现大型神经网络训练的最佳方法之一。分布式训练通常包含数据并行、层间并行和层内并行3种并行方法。然而现有的框架在层间并行时只能对模型进行手动切分,增加了模型设计的抽象复杂度,对此提出了节点约束关系搜索算法,实现了模型的自动切分。另外,在传统的数据并行和层间并行中,由于模型的复杂约束关系和通信操作的需要,计算和通信往往受到严格的序列化限制,为此引入了同步优化算法,实现了计算和通信的重叠,有效提高了整体训练的效率。实验对不同规模的GPT-2,AlexNet,VGG16和ResNet50模型进行训练,使用同步优化算法在6节点条件下可以将GPT2-XL,GPT2-LARGE和GPT2-MEDIUM模型的训练性能分别提升1.14倍、1.18倍和1.23倍,在1节点条件下将AlexNet,VGG16和ResNet50模型的训练性能分别提升1.31倍、1.14倍和1.03倍。实验结果表明,同步优化算法能够提升混合并行中的训练效率。
中图分类号:
[1]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019. [2]AO Y,WU Z,YU D,et al.End-to-end adaptive distributedtraining on paddlepaddle[J].arXiv:2112.02752,2021. [3]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3505-3506. [4]SERGEEV A,DEL BALSO M.Horovod:fast and easy distribu-ted deep learning in TensorFlow[J].arXiv:1802.05799,2018. [5]GAN S,JIANG J,YUAN B,et al.Bagua:scaling up distributed learning with system relaxations[J].Proceedings of the VLDB Endowment,2021,15(4):804-813. [6]LI,ZHAO S A,VARMA Y A,et.al.Pytorch distributed:Experiences on accelerating data parallel training[J].VLDB Endowment,2020,13(12):3005-3018. [7]SINGH S,BHATELE A.AxoNN:An asynchronous,message-driven parallel framework for extreme-scale deep learning[C]//2022 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2022:606-616. [8]SONG J,YIM J,JUNG J,et al.Optimus-CC:Efficient LargeNLP Model Training with 3D Parallelism Aware Communication Compression[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2.2023:560-573. [9]JEAUGEY S.Nccl 2.0[C]//GPU Technology Conference(GTC).2017. [10]GABRIEL E,FAGG G E,BOSILCA G,et al.Open MPI:Goals,concept,and design of a next generation MPI implementation[C]//Recent Advances in Parallel Virtual Machine and Message Passing Interface:11th European PVM/MPI Users’ Group Meeting Budapest,Hungary,September 19-22,2004.Procee-dings 11.Springer Berlin Heidelberg,2004:97-104. [11]COATES A,HUVAL B,WANG T,et al.Deep learning withCOTS HPC systems[C]//International Conference on Machine Learning.PMLR,2013:1337-1345. [12]WANG E D,YAN R D,GUO Z H et al.Review of distributed training systems and their optimization algorithms[J].Chinese Journal of Computers,2024,47(1):1-28. [13]NARAYANAN D,HARLAP A,PHANISHAYEE A,et al.PipeDream:Generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles.2019:1-15. [14]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901. [15]PATARASUK P,YUAN X.Bandwidth optimal all-reduce algorithms for clusters of workstations[J].Journal of Parallel and Distributed Computing,2009,69(2):117-124. [16]ZHENG S,MENG Q,WANG T,et al.Asynchronous stochastic gradient descent with delay compensation[C]//International Conference on Machine Learning.PMLR,2017:4120-4129. [17]ZHANG H,ZHENG Z,XU S,et al.Poseidon:An efficient communication architecture for distributed deep learning on GPU clusters[C]//2017 USENIX Annual Technical Conference.2017:181-193. [18]REED J,DEVITO Z,HE H,et al.Torch.fx:Practical program capture and transformation for deep learning in python[J].Proceedings of Machine Learning and Systems,2022,4:638-651. [19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9. [20]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45. [21]XU Y,DONG D,XU W,et al.SketchDLC:A sketch on distributed deep learning communication via trace capturing[J].ACM Transactions on Architecture and Code Optimization(TACO),2019,16(2):1-26. |
|