Computer Science ›› 2024, Vol. 51 ›› Issue (12): 120-128.doi: 10.11896/jsjkx.231200128
• High Performance Computing • Previous Articles Next Articles
XU Jinlong1,3, LI Pengfei2, LI Jianan2, CHEN Biaoyuan2, GAO Wei1, HAN Lin1
CLC Number:
[1]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[J].arXiv:1909.08053,2019. [2]AO Y,WU Z,YU D,et al.End-to-end adaptive distributedtraining on paddlepaddle[J].arXiv:2112.02752,2021. [3]RASLEY J,RAJBHANDARI S,RUWASE O,et al.Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3505-3506. [4]SERGEEV A,DEL BALSO M.Horovod:fast and easy distribu-ted deep learning in TensorFlow[J].arXiv:1802.05799,2018. [5]GAN S,JIANG J,YUAN B,et al.Bagua:scaling up distributed learning with system relaxations[J].Proceedings of the VLDB Endowment,2021,15(4):804-813. [6]LI,ZHAO S A,VARMA Y A,et.al.Pytorch distributed:Experiences on accelerating data parallel training[J].VLDB Endowment,2020,13(12):3005-3018. [7]SINGH S,BHATELE A.AxoNN:An asynchronous,message-driven parallel framework for extreme-scale deep learning[C]//2022 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2022:606-616. [8]SONG J,YIM J,JUNG J,et al.Optimus-CC:Efficient LargeNLP Model Training with 3D Parallelism Aware Communication Compression[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2.2023:560-573. [9]JEAUGEY S.Nccl 2.0[C]//GPU Technology Conference(GTC).2017. [10]GABRIEL E,FAGG G E,BOSILCA G,et al.Open MPI:Goals,concept,and design of a next generation MPI implementation[C]//Recent Advances in Parallel Virtual Machine and Message Passing Interface:11th European PVM/MPI Users’ Group Meeting Budapest,Hungary,September 19-22,2004.Procee-dings 11.Springer Berlin Heidelberg,2004:97-104. [11]COATES A,HUVAL B,WANG T,et al.Deep learning withCOTS HPC systems[C]//International Conference on Machine Learning.PMLR,2013:1337-1345. [12]WANG E D,YAN R D,GUO Z H et al.Review of distributed training systems and their optimization algorithms[J].Chinese Journal of Computers,2024,47(1):1-28. [13]NARAYANAN D,HARLAP A,PHANISHAYEE A,et al.PipeDream:Generalized pipeline parallelism for DNN training[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles.2019:1-15. [14]BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901. [15]PATARASUK P,YUAN X.Bandwidth optimal all-reduce algorithms for clusters of workstations[J].Journal of Parallel and Distributed Computing,2009,69(2):117-124. [16]ZHENG S,MENG Q,WANG T,et al.Asynchronous stochastic gradient descent with delay compensation[C]//International Conference on Machine Learning.PMLR,2017:4120-4129. [17]ZHANG H,ZHENG Z,XU S,et al.Poseidon:An efficient communication architecture for distributed deep learning on GPU clusters[C]//2017 USENIX Annual Technical Conference.2017:181-193. [18]REED J,DEVITO Z,HE H,et al.Torch.fx:Practical program capture and transformation for deep learning in python[J].Proceedings of Machine Learning and Systems,2022,4:638-651. [19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9. [20]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45. [21]XU Y,DONG D,XU W,et al.SketchDLC:A sketch on distributed deep learning communication via trace capturing[J].ACM Transactions on Architecture and Code Optimization(TACO),2019,16(2):1-26. |
[1] | LIU Jiang, LIU Wen-bo, ZHANG Ju. Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam [J]. Computer Science, 2022, 49(3): 3-10. |
[2] | BIAN Zhai-an, LI Hui-jia, CHEN Jun-hua, MA Yu-han and ZHAO Dan. Distributed and Heterogeneous Multi-agent System for Attributed Graph Clustering [J]. Computer Science, 2017, 44(Z6): 407-413. |
[3] | WANG Wen-yi,WANG Chun-xia and WANG Jie. Research on Hybrid Parallel Programming Technique Based on CMP Multi-cure Cluster [J]. Computer Science, 2014, 41(2): 19-22. |
[4] | . Communication Optimization Algorithm Using Reordering Transformation and Loop Distribution [J]. Computer Science, 2012, 39(9): 296-301. |
[5] | WU Hua-bei,SUN Ji-zhou,WANG Wen-yi. Research and Implementation of Parallel Programming Environment for Hybrid Parallel Computing System [J]. Computer Science, 2010, 37(4): 143-. |
[6] | . [J]. Computer Science, 2008, 35(10): 162-164. |
|