计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 33-40.doi: 10.11896/jsjkx.250600073
周悦媛, 卢冠泽, 向佳为, 章家维, 邵恩, 何鑫
ZHOU Yueyuan, LU Guanze, XIANG Jiawei, ZHANG Jiawei, SHAO En, HE Xin
摘要: 随着中美贸易摩擦加剧,我国对国产加速芯片的研发任务愈发紧迫。海光DCU采用类CUDA架构,凭借优异的兼容性和性价比,成为人工智能领域替代美国高端芯片的有力候选。然而,在海光DCU平台上,作为大语言模型训练关键算子的GEMM核函数的性能差异显著。针对该现象,研究了矩阵转置对rocBLAS算法库中GEMM核函数性能的影响,并提出最小化转置与自适应转置两种优化方法,以有效降低大语言模型的训练耗时。修改了PyTorch的线性层实现,提出了大语言模型分布式训练的最小化转置和自适应转置优化方法。实验结果表明,这两种优化方法在多种大规模语言模型(如 OPT-6.7B,LLaMA-7B,Bloom-7B 等)的分布式训练中均能显著降低训练时间。在83个测试样例中,自适应转置优化方法在72种情况下表现更优,相比基于原始PyTorch的Megatron-LM端到端训练时间提升最高达 24.27%。
中图分类号:
| [1]HYGON.DCU[EB/OL].https://www.hygon.cn/product/accelerator. [2]NVIDIA Corporation.CUDA Toolkit - Free Tools and Training[EB/OL].https://developer.nvidia.com/cuda-toolkit. [3]HYGON.PyTorch[EB/OL].https://das.sourcefind.cn:55011/portal/#/installation?id=04749079-6b33-11ef-b472-005056904552&type=frame. [4]HYGON.PyTorch[EB/OL].https://download.sourcefind.cn:65024/1/main/DTK-24.04.3/Document/DTK24.04.3开发环境使用手册.pdf. [5]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-LM:Training Multi-Billion Parameter Language Models Using GPU Model Parallelism[J].arXiv:1909.08053,2019. [6]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory optimizations Toward Training Trillion Parameter Models[J].arXiv:1910.02054,2019. [7]BAINES M,BHOSALE S,CAGGIANO V,et al.Fairscale:Ageneral purpose modular pytorch library for high performance and large scale training[EB/OL].https://github.com/facebookresearch/fairscale. [8]LIU Z,OGUZ B,ZHAO C,et al.Llm-qat:Data-free quantization aware training for large language models[C]//Findings of the Association for Computational Linguistics:ACL 2024.2024:467-484. [9]DU D,ZHANG Y,CAO S,et al.Bitdistiller:Unleashing the potential of sub-4-bit llms via self-distillation[J].arXiv:2402.10631,2024. [10]YAO Z,WU X,LI C,et al.Zeroquant-v2:Exploring post-training quantization in llms from comprehensive study to low rank compensation[J].arXiv:2303.08302,2023. [11]FRANKLE J,CARBIN M.The lottery ticket hypothesis:Fin-ding sparse,trainable neural networks[J].arXiv:1803.03635,2018. [12]FRANTAR E,ALISTARH D.Sparsegpt:Massive languagemodels can be accurately prunedin one-shot[C]//International Conference on Machine Learning.PMLR,2023:10323-10337. [13]HOOPER C,KIM S,MOHAMMADZADEH H,et al.Kvquant:Towards 10 million context length llm inference with kv cache quantization[J].Advances in Neural Information Processing Systems,2024,37:1270-1303. [14]YUE Y,YUAN Z,DUANMU H,et al.Wkvquant:Quantizing weight and key/value cache for large language models gains more[J].arXiv:2402.12065,2024. [15]KIM B K,KIM G,KIM T H,et al.Shortened llama:A simple depth pruning for large language models[J].arXiv:2402.02834,2024. [16]PENG B,QUESNELLE J,ROLNICK D,et al.APreliminaryREPORT ON DISTRO[EB/OL].https://assets.ctfassets.net/jdtwqhzvc2n1/pxVc7MpQSQS7sNGcY7Unk/0328a24ade597df12cb0cde094a9af5d/A_Preliminary_Report_on_DisTrO.pdf. [17]WANG S,WEI J,SABNE A,et al.Overlap communication with dependent computation via decomposition in large deep learning models[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2022:93-106. [18]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Easy scalingwith micro-batch pipel ine parallelism[C]//Proceeding of Computer Vision and Pattern Recognition.2019. [19]ZHANG J,SHAO E,WANG L,et al.AsymFB:AcceleratingLLM Training Through Asymmetric Model Parallelism[C]//IFIP International Conference on Network and Parallel Computing.Singapore:Springer,2024:16-27. [20]QI P,WAN X,HUANG G,et al.Zero bubble pipeline paralle-lism[J].arXiv:2401.10241,2024. [21]CHEN T,XU B,ZHANG C,et al.Training deep nets with sublinear memory cost[J].arXiv:1604.06174,2016. [22]KIRISAME M,LYUBOMIRSKY S,HAAN A,et al.Dynamictensor rematerialization[J].arXiv:2006.09616,2020. [23]ANDOORVEEDU M,ZHU Z,ZHENG B,et al.Tempo:Acce-lerating transformer-based model training through memory footprint reduction[J].Advances in Neural Information Processing Systems,2022,35:12267-12282. [24]ZHANG S,ROLLER S,GOYAL N,et al.Opt:Open pre-trained transformer language models[J].arXiv:2205.01068,2022. [25]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023. [26]WORKSHOP B S,SCAO T L,FAN A,et al.Bloom:A 176b-parameter open-access multilingual language model[J].arXiv:2211.05100,2022. |
|
||