一种面向通用计算设备的自动流水线并行训练框架

doi:10.11896/jsjkx.231000110

Abstract

Abstract: Training large-scale neural networks usually exceeds the memory and computing capacity of a single computing node,which requires distributed training using multiple nodes.Existing distributed deep learning frameworks are mainly designed for specific hardware environments and cannot effectively adapt to various general-purpose computing devices.To support the efficient training of large-scale deep neural networks,this paper implements a general-purpose automatic pipeline parallel distributed training framework.This framework combines the model parallel strategy based on pipeline parallelism with the algorithm that automatically splits the neural network model,and realizes the automatic parallelization and training of large-scale neural network models and training data on general computer clusters,including the new generation of supercomputers in China,significantly reducing the memory and computing pressure of a single computing node.The framework does not require manual adjustment,and can automatically and efficiently deploy deep neural networks to multi-node distributed environments.It is not only suitable for supercomputers and other high-performance computer clusters,but also can be deployed to other general distributed computing environments,providing support for the automatic distributed training of large-scale neural networks.

Key words: Pipeline parallelism, Deep neural network, Supercomputer, Message passing interface, Parallel computing

CLC Number:

TP183

ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi. Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices[J].Computer Science, 2024, 51(12): 129-136.

References

[1]BROWN T B,MANN B,RYDER N,et al.Language models arefewshot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems.2020:1877-1901.
[2]FEDUS W,ZOPH B,SHAZEER N.Switch transformers:Sca-ling to trillion parameter models with simple and efficient sparsity[J].arXiv:2101.03961,2021.
[3]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[4]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]//Computer Vision-ECCV 2016.Springer,2016:630-645.
[5]NVIDIA.CUDA toolkit[EB/OL].https://developer.nvidia.com/cuda-toolkit.
[6]LU K,WANG Y,GUO Y,et al.MT-3000:a heterogeneousmulti-zone processor for HPC[J].CCF Transactions on High Performance Computing,2022,4(2):150-164.
[7]AWAN A A,CHU C,SUBRAMONI H,et al.OC-DNN:exploiting advanced unified memory capabilities in CUDA 9 and volta gpus for out-of-core DNN training[C]//25th IEEE International Conference on High Performance Computing.IEEE,2018:143-152.
[8]MARKTHUB P,BELVIRANLI M E,LEE S,et al.DRAGON:breaking GPU memory capacity limits with direct NVM access[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage,and Analysis.IEEE/ACM,2018:32:1-32:13.
[9]RHU M,GIMELSHEIN N,CLEMONS J,et al.vdnn:Virtua-lized deep neural networks for scalable,memory-efficient neural network design[C]//49th Annual IEEE/ACM International Symposium on Microarchitecture.IEEE Computer Society,2016:18:1-18:13.
[10]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[12]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Efficient trai-ning of giant neural networks using pipeline parallelism[C]//Proceedings of the 33^rd International Conference on Neural Information Processing Systems.2019:103-112.
[13]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.ACM,2020:3505-3506.
[14]RAJBHANDARI S,RASLEY J,RUWASE O,et al.Zero:me-mory optimizations toward training trillion parameter models[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE/ACM,2020.
[15]REN J,RAJBHANDARI S,AMINABADI R Y,et al.Zero-offload:Democratizing billion-scale model training[C]//2021 USENIX Annual Technical Conference.USENIX Association,2021:551-564.
[16]RAJBHANDARI S,RUWASE O,RASLEY J,et al.Zero-infinity:breaking the GPU memory wall for extreme scale deep lear-ning[C]//International Conference for High Performance Computing,Networking,Storage and Analysis.ACM,2021.
[17]ZHAO Y,GU A,VARMA R,et al.PyTorch FSDP:Experiences on Scaling Fully Sharded Data Parallel[J].Proceedings of the VLDB Endowment,2023,16(12):3848-3860.
[18]BI R,XU T,XU M,et al.PaddlePaddle:A Production-Oriented Deep Learning Platform Facilitating the Competency of Enterprises[C]//24th IEEE International Conference on High Performance Computing & Communications; 8th International Conference on Data Science & Systems; 20th Int Conf on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:92-99.
[19]KIM T,KIM H,YU G,et al.BPipe:Memory-Balanced Pipeline Parallelism for Training Large Language Models[C]//Procee-dings of Machine Learning Research:International Conference on Machine Learning.PMLR,2023:16639-16653.
[20]GONG C,LIU J,BAO W,et al.Review on Ecological Construction of Domestic High-performance Parallel Application Software in Post Moore Era[J].Journal of System Simulation,2022,34(10):2107-2118.
[21]DENG L.The MNIST Database of Handwritten Digit Images for Machine Learning Research[Best of the Web][J].IEEE Signal Processing Magazine,2012,29(6):141-142.

Related Articles 15

[1]	XU He, ZHOU Tao, LI Peng, QIN Fangfang, JI Yimu. LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor [J]. Computer Science, 2024, 51(9): 51-58.
[2]	ZHU Fukun, TENG Zhen, SHAO Wenze, GE Qi, SUN Yubao. Semantic-guided Neural Network Critical Data Routing Path [J]. Computer Science, 2024, 51(9): 155-161.
[3]	HAN Bing, DENG Lixiang, ZHENG Yi, REN Shuang. Survey of 3D Point Clouds Upsampling Methods [J]. Computer Science, 2024, 51(7): 167-196.
[4]	XU Xiaohua, ZHOU Zhangbing, HU Zhongxu, LIN Shixun, YU Zhenjie. Lightweight Deep Neural Network Models for Edge Intelligence:A Survey [J]. Computer Science, 2024, 51(7): 257-271.
[5]	ZHU Jin, TAO Chuanqi, GUO Hongjing. Test Input Prioritization Approach Based on DNN Model Output Differences [J]. Computer Science, 2024, 51(6A): 230600121-8.
[6]	LI Wenting, XIAO Rong, YANG Xiao. Improving Transferability of Adversarial Samples Through Laplacian Smoothing Gradient [J]. Computer Science, 2024, 51(6A): 230800025-6.
[7]	LI Siyao, LI Shanglin, LUO Jingzhi. Parallel Computing of Reentry Vehicle Trajectory by Multiple Shooting Method Based onOPENMP [J]. Computer Science, 2024, 51(11A): 231000019-6.
[8]	ZHAO Ruonan, LI Duo, SONG Jiangling, ZHANG Rui. Automatic Sleep Staging Based on Multimodal Data and Fusion Deep Network [J]. Computer Science, 2024, 51(11A): 231100160-6.
[9]	HE Weilong, SU Lingli, GUO Bingxuan, LI Maosen, HAO Yan. Research and Implementation of Dynamic Scene 3D Perception Technology Based on BinocularEstimation [J]. Computer Science, 2024, 51(11A): 240300045-8.
[10]	ZHANG Mingze, LI Yi, WU Wenyuan, SHI Mingquan, WANG Zhengjiang. FCTNet:Bus Arrival Time Prediction Method Based on Dual Domain Deep Learning [J]. Computer Science, 2024, 51(11A): 231000180-7.
[11]	PENG Weidong, GUO Wei, WEI Lin. Reconfigurable Computing System for Parallel Implementation of SVM Training Based on FPGA [J]. Computer Science, 2024, 51(11A): 231100120-7.
[12]	WANG Xiaozhong, ZHANG Zuyu. Multi Level Parallel Computing for SW26010 Discontinuous Galerkin Finite Element Algorithm [J]. Computer Science, 2024, 51(11A): 240700055-5.
[13]	ZHAI Xulun, ZHANG Yongguang, JIN Anzhao, QIANG Wei, LI Mengbing. Parallel DVB-RCS2 Turbo Decoding on Multi-core CPU [J]. Computer Science, 2023, 50(6): 22-28.
[14]	BAI Zhixu, WANG Hengjun, GUO Kexiang. Adversarial Examples Generation Method Based on Image Color Random Transformation [J]. Computer Science, 2023, 50(4): 88-95.
[15]	YIN Haitao, WANG Tianyou. Image Denoising Algorithm Based on Deep Multi-scale Convolution Sparse Coding [J]. Computer Science, 2023, 50(4): 133-140.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0