计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 129-136.doi: 10.11896/jsjkx.231000110
钟震宇, 林勇良, 王昊天, 李东闻, 孙羽菲, 张玉志
ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi
摘要: 训练大规模神经网络通常会出现单个计算节点的内存和计算能力不足的情况,需要通过多个节点分布式训练来实现。现有的分布式深度学习框架主要针对特定的硬件环境设计,不能够有效适应各类通用计算设备。为支持大规模深度神经网络的高效训练,实现了一种通用的自动流水线并行分布式训练框架。本框架通过结合基于流水线并行的模型并行策略与神经网络模型自动拆分算法,实现了在包括国内新一代超级计算机在内的通用计算机集群上,对大规模神经网络模型与训练数据进行自动并行化处理和训练,显著减轻单个计算节点的内存和计算压力。该框架无需人工调整,可以自动高效地在多节点分布式环境中部署深度神经网络,不仅适用于超级计算机等高性能计算机集群,还可以部署到其他通用的分布式计算环境中,为大规模神经网络的自动化分布式训练提供支持。
中图分类号:
[1]BROWN T B,MANN B,RYDER N,et al.Language models arefewshot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901. [2]FEDUS W,ZOPH B,SHAZEER N.Switch transformers:Sca-ling to trillion parameter models with simple and efficient sparsity[J].arXiv:2101.03961,2021. [3]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [4]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]//Computer Vision-ECCV 2016.Springer,2016:630-645. [5]NVIDIA.CUDA toolkit[EB/OL].https://developer.nvidia.com/cuda-toolkit. [6]LU K,WANG Y,GUO Y,et al.MT-3000:a heterogeneousmulti-zone processor for HPC[J].CCF Transactions on High Performance Computing,2022,4(2):150-164. [7]AWAN A A,CHU C,SUBRAMONI H,et al.OC-DNN:exploiting advanced unified memory capabilities in CUDA 9 and volta gpus for out-of-core DNN training[C]//25th IEEE International Conference on High Performance Computing.IEEE,2018:143-152. [8]MARKTHUB P,BELVIRANLI M E,LEE S,et al.DRAGON:breaking GPU memory capacity limits with direct NVM access[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage,and Analysis.IEEE/ACM,2018:32:1-32:13. [9]RHU M,GIMELSHEIN N,CLEMONS J,et al.vdnn:Virtua-lized deep neural networks for scalable,memory-efficient neural network design[C]//49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE Computer Society,2016:18:1-18:13. [10]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019. [11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [12]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Efficient trai-ning of giant neural networks using pipeline parallelism[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:103-112. [13]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.ACM,2020:3505-3506. [14]RAJBHANDARI S,RASLEY J,RUWASE O,et al.Zero:me-mory optimizations toward training trillion parameter models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE/ACM,2020. [15]REN J,RAJBHANDARI S,AMINABADI R Y,et al.Zero-offload:Democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference.USENIX Association,2021:551-564. [16]RAJBHANDARI S,RUWASE O,RASLEY J,et al.Zero-infinity:breaking the GPU memory wall for extreme scale deep lear-ning[C]//International Conference for High Performance Computing,Networking,Storage and Analysis.ACM,2021. [17]ZHAO Y,GU A,VARMA R,et al.PyTorch FSDP:Experiences on Scaling Fully Sharded Data Parallel[J].Proceedings of the VLDB Endowment,2023,16(12):3848-3860. [18]BI R,XU T,XU M,et al.PaddlePaddle:A Production-Oriented Deep Learning Platform Facilitating the Competency of Enterprises[C]//24th IEEE International Conference on High Performance Computing & Communications; 8th International Conference on Data Science & Systems; 20th Int Conf on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:92-99. [19]KIM T,KIM H,YU G,et al.BPipe:Memory-Balanced Pipeline Parallelism for Training Large Language Models[C]//Procee-dings of Machine Learning Research:International Conference on Machine Learning.PMLR,2023:16639-16653. [20]GONG C,LIU J,BAO W,et al.Review on Ecological Construction of Domestic High-performance Parallel Application Software in Post Moore Era[J].Journal of System Simulation,2022,34(10):2107-2118. [21]DENG L.The MNIST Database of Handwritten Digit Images for Machine Learning Research[Best of the Web][J].IEEE Signal Processing Magazine,2012,29(6):141-142. |
|