一种面向通用计算设备的自动流水线并行训练框架

doi:10.11896/jsjkx.231000110

计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 129-136.doi: 10.11896/jsjkx.231000110

一种面向通用计算设备的自动流水线并行训练框架

钟震宇, 林勇良, 王昊天, 李东闻, 孙羽菲, 张玉志

南开大学软件学院天津 300350

收稿日期:2023-10-18 修回日期:2024-03-11 出版日期:2024-12-15 发布日期:2024-12-10
通讯作者: 孙羽菲(yufei_sun@sina.com)
作者简介:(zyzhong@mail.nankai.edu.cn)
基金资助:
国家重点研发计划(2021YFB0300104)

Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices

ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi

College of Software, Nankai University, Tianjin 300350, China

Received:2023-10-18 Revised:2024-03-11 Online:2024-12-15 Published:2024-12-10
About author:ZHONG Zhenyu,born in 1997,Ph.D candidate.His main research interests include natural language processing,high performance computing and AIOps.
SUN Yufei,born in 1976,Ph.D,professor.Her main research interests include deep learning,heterogeneous computing,artificial intelligence,etc.
Supported by:
National Key Research and Development Program of China(2021YFB0300104).

摘要/Abstract

摘要： 训练大规模神经网络通常会出现单个计算节点的内存和计算能力不足的情况,需要通过多个节点分布式训练来实现。现有的分布式深度学习框架主要针对特定的硬件环境设计,不能够有效适应各类通用计算设备。为支持大规模深度神经网络的高效训练,实现了一种通用的自动流水线并行分布式训练框架。本框架通过结合基于流水线并行的模型并行策略与神经网络模型自动拆分算法,实现了在包括国内新一代超级计算机在内的通用计算机集群上,对大规模神经网络模型与训练数据进行自动并行化处理和训练,显著减轻单个计算节点的内存和计算压力。该框架无需人工调整,可以自动高效地在多节点分布式环境中部署深度神经网络,不仅适用于超级计算机等高性能计算机集群,还可以部署到其他通用的分布式计算环境中,为大规模神经网络的自动化分布式训练提供支持。

关键词: 流水线并行, 深度神经网络, 超级计算机, MPI, 并行计算

Abstract: Training large-scale neural networks usually exceeds the memory and computing capacity of a single computing node,which requires distributed training using multiple nodes.Existing distributed deep learning frameworks are mainly designed for specific hardware environments and cannot effectively adapt to various general-purpose computing devices.To support the efficient training of large-scale deep neural networks,this paper implements a general-purpose automatic pipeline parallel distributed training framework.This framework combines the model parallel strategy based on pipeline parallelism with the algorithm that automatically splits the neural network model,and realizes the automatic parallelization and training of large-scale neural network models and training data on general computer clusters,including the new generation of supercomputers in China,significantly reducing the memory and computing pressure of a single computing node.The framework does not require manual adjustment,and can automatically and efficiently deploy deep neural networks to multi-node distributed environments.It is not only suitable for supercomputers and other high-performance computer clusters,but also can be deployed to other general distributed computing environments,providing support for the automatic distributed training of large-scale neural networks.

Key words: Pipeline parallelism, Deep neural network, Supercomputer, Message passing interface, Parallel computing

中图分类号:

TP183

钟震宇, 林勇良, 王昊天, 李东闻, 孙羽菲, 张玉志. 一种面向通用计算设备的自动流水线并行训练框架[J]. 计算机科学, 2024, 51(12): 129-136. https://doi.org/10.11896/jsjkx.231000110

ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi. Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices[J]. Computer Science, 2024, 51(12): 129-136. https://doi.org/10.11896/jsjkx.231000110

参考文献

[1]BROWN T B,MANN B,RYDER N,et al.Language models arefewshot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901.
[2]FEDUS W,ZOPH B,SHAZEER N.Switch transformers:Sca-ling to trillion parameter models with simple and efficient sparsity[J].arXiv:2101.03961,2021.
[3]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[4]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]//Computer Vision-ECCV 2016.Springer,2016:630-645.
[5]NVIDIA.CUDA toolkit[EB/OL].https://developer.nvidia.com/cuda-toolkit.
[6]LU K,WANG Y,GUO Y,et al.MT-3000:a heterogeneousmulti-zone processor for HPC[J].CCF Transactions on High Performance Computing,2022,4(2):150-164.
[7]AWAN A A,CHU C,SUBRAMONI H,et al.OC-DNN:exploiting advanced unified memory capabilities in CUDA 9 and volta gpus for out-of-core DNN training[C]//25th IEEE International Conference on High Performance Computing.IEEE,2018:143-152.
[8]MARKTHUB P,BELVIRANLI M E,LEE S,et al.DRAGON:breaking GPU memory capacity limits with direct NVM access[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage,and Analysis.IEEE/ACM,2018:32:1-32:13.
[9]RHU M,GIMELSHEIN N,CLEMONS J,et al.vdnn:Virtua-lized deep neural networks for scalable,memory-efficient neural network design[C]//49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE Computer Society,2016:18:1-18:13.
[10]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[12]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Efficient trai-ning of giant neural networks using pipeline parallelism[C]//Proceedings of the 33^rd International Conference on Neural Information Processing Systems.2019:103-112.
[13]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.ACM,2020:3505-3506.
[14]RAJBHANDARI S,RASLEY J,RUWASE O,et al.Zero:me-mory optimizations toward training trillion parameter models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE/ACM,2020.
[15]REN J,RAJBHANDARI S,AMINABADI R Y,et al.Zero-offload:Democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference.USENIX Association,2021:551-564.
[16]RAJBHANDARI S,RUWASE O,RASLEY J,et al.Zero-infinity:breaking the GPU memory wall for extreme scale deep lear-ning[C]//International Conference for High Performance Computing,Networking,Storage and Analysis.ACM,2021.
[17]ZHAO Y,GU A,VARMA R,et al.PyTorch FSDP:Experiences on Scaling Fully Sharded Data Parallel[J].Proceedings of the VLDB Endowment,2023,16(12):3848-3860.
[18]BI R,XU T,XU M,et al.PaddlePaddle:A Production-Oriented Deep Learning Platform Facilitating the Competency of Enterprises[C]//24th IEEE International Conference on High Performance Computing & Communications; 8th International Conference on Data Science & Systems; 20th Int Conf on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:92-99.
[19]KIM T,KIM H,YU G,et al.BPipe:Memory-Balanced Pipeline Parallelism for Training Large Language Models[C]//Procee-dings of Machine Learning Research:International Conference on Machine Learning.PMLR,2023:16639-16653.
[20]GONG C,LIU J,BAO W,et al.Review on Ecological Construction of Domestic High-performance Parallel Application Software in Post Moore Era[J].Journal of System Simulation,2022,34(10):2107-2118.
[21]DENG L.The MNIST Database of Handwritten Digit Images for Machine Learning Research[Best of the Web][J].IEEE Signal Processing Magazine,2012,29(6):141-142.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种面向通用计算设备的自动流水线并行训练框架

Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0