Computer Science ›› 2024, Vol. 51 ›› Issue (12): 120-128.doi: 10.11896/jsjkx.231200128

• High Performance Computing • Previous Articles     Next Articles

Study on Distributed Training Optimization Based on Hybrid Parallel

XU Jinlong1,3, LI Pengfei2, LI Jianan2, CHEN Biaoyuan2, GAO Wei1, HAN Lin1   

  1. 1 National Supercomputing Center in Zhengzhou(Zhengzhou University), Zhengzhou 450000, China
    2 School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450000, China
    3 Strategic Support Force Information Engineering University, Zhengzhou 450000, China
  • Received:2023-12-19 Revised:2024-04-29 Online:2024-12-15 Published:2024-12-10
  • About author:XU Jinlong,born in 1985,Ph.D,master’ssupervisor.His main research intere-sts include high-performance computing and paralle compilation.
    HAN Lin,born in 1978,Ph.D,associate professor,is a senior member of CCF(No.16416M).His main research interests include compiler optimization and high-performance computing.
  • Supported by:
    Major Science and Technology Project of Henan Province(221100210600).

Abstract: Large-scale neural network training is a hot topic in the field of deep learning,and distributed training stands out as one of the most effective methods for training large neural networks across multiple nodes.Distributed training typically involves three parallel methods:data parallelism,inter-layer parallelism,and intra-layer parallelism.However,in existing frameworks,manual model partitioning is required for inter-layer parallelism,which increases the abstract complexity of model design.To address this issue,we propose a node-constrained relationship search algorithm that automates the model partitioning process.Moreover,in traditional data parallelism and inter-layer parallelism,strict serialization limits the overlap of computation and communication due to complex model constraints and the need for communication operations.To overcome this challenge,we introduce a synchronous optimization algorithm,enabling the overlap of computation and communication and effectively enhancing the overall training efficiency.The experiments involve training GPT-2 of different sizes,AlexNet,VGG16,and ResNet50 models.Using the synchronous optimization algorithm under a 6-node configuration,the training performance of GPT2-XL,GPT2-LARGE,and GPT2-MEDIUM models is improved,achieving speed-ups of 1.14,1.18,and 1.23,respectively.Under 1-node configuration,performance enhancements are also observed for AlexNet,VGG16,and ResNet50 models,with speed-ups of 1.31,1.14,and 1.03,respectively.The experimental results indicate that the synchronous optimization algorithm effectively enhances the training efficiency in mixed parallelism.

Key words: Distributed learning, Hybrid parallel, Automatic segmentation, Communication optimization, Gradient synchronization

CLC Number: 

  • TP391
[1]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019.
[2]AO Y,WU Z,YU D,et al.End-to-end adaptive distributedtraining on paddlepaddle[J].arXiv:2112.02752,2021.
[3]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3505-3506.
[4]SERGEEV A,DEL BALSO M.Horovod:fast and easy distribu-ted deep learning in TensorFlow[J].arXiv:1802.05799,2018.
[5]GAN S,JIANG J,YUAN B,et al.Bagua:scaling up distributed learning with system relaxations[J].Proceedings of the VLDB Endowment,2021,15(4):804-813.
[6]LI,ZHAO S A,VARMA Y A,et.al.Pytorch distributed:Experiences on accelerating data parallel training[J].VLDB Endowment,2020,13(12):3005-3018.
[7]SINGH S,BHATELE A.AxoNN:An asynchronous,message-driven parallel framework for extreme-scale deep learning[C]//2022 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2022:606-616.
[8]SONG J,YIM J,JUNG J,et al.Optimus-CC:Efficient LargeNLP Model Training with 3D Parallelism Aware Communication Compression[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2.2023:560-573.
[9]JEAUGEY S.Nccl 2.0[C]//GPU Technology Conference(GTC).2017.
[10]GABRIEL E,FAGG G E,BOSILCA G,et al.Open MPI:Goals,concept,and design of a next generation MPI implementation[C]//Recent Advances in Parallel Virtual Machine and Message Passing Interface:11th European PVM/MPI Users’ Group Meeting Budapest,Hungary,September 19-22,2004.Procee-dings 11.Springer Berlin Heidelberg,2004:97-104.
[11]COATES A,HUVAL B,WANG T,et al.Deep learning withCOTS HPC systems[C]//International Conference on Machine Learning.PMLR,2013:1337-1345.
[12]WANG E D,YAN R D,GUO Z H et al.Review of distributed training systems and their optimization algorithms[J].Chinese Journal of Computers,2024,47(1):1-28.
[13]NARAYANAN D,HARLAP A,PHANISHAYEE A,et al.PipeDream:Generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles.2019:1-15.
[14]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[15]PATARASUK P,YUAN X.Bandwidth optimal all-reduce algorithms for clusters of workstations[J].Journal of Parallel and Distributed Computing,2009,69(2):117-124.
[16]ZHENG S,MENG Q,WANG T,et al.Asynchronous stochastic gradient descent with delay compensation[C]//International Conference on Machine Learning.PMLR,2017:4120-4129.
[17]ZHANG H,ZHENG Z,XU S,et al.Poseidon:An efficient communication architecture for distributed deep learning on GPU clusters[C]//2017 USENIX Annual Technical Conference.2017:181-193.
[18]REED J,DEVITO Z,HE H,et al.Torch.fx:Practical program capture and transformation for deep learning in python[J].Proceedings of Machine Learning and Systems,2022,4:638-651.
[19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9.
[20]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45.
[21]XU Y,DONG D,XU W,et al.SketchDLC:A sketch on distributed deep learning communication via trace capturing[J].ACM Transactions on Architecture and Code Optimization(TACO),2019,16(2):1-26.
[1] LIU Jiang, LIU Wen-bo, ZHANG Ju. Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam [J]. Computer Science, 2022, 49(3): 3-10.
[2] BIAN Zhai-an, LI Hui-jia, CHEN Jun-hua, MA Yu-han and ZHAO Dan. Distributed and Heterogeneous Multi-agent System for Attributed Graph Clustering [J]. Computer Science, 2017, 44(Z6): 407-413.
[3] WANG Wen-yi,WANG Chun-xia and WANG Jie. Research on Hybrid Parallel Programming Technique Based on CMP Multi-cure Cluster [J]. Computer Science, 2014, 41(2): 19-22.
[4] . Communication Optimization Algorithm Using Reordering Transformation and Loop Distribution [J]. Computer Science, 2012, 39(9): 296-301.
[5] WU Hua-bei,SUN Ji-zhou,WANG Wen-yi. Research and Implementation of Parallel Programming Environment for Hybrid Parallel Computing System [J]. Computer Science, 2010, 37(4): 143-.
[6] . [J]. Computer Science, 2008, 35(10): 162-164.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!