计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 33-40.doi: 10.11896/jsjkx.250600073

• 基于AGI技术的智能信息系统 • 上一篇    下一篇

面向海光DCU基于自适应转置的大语言模型训练系统

周悦媛, 卢冠泽, 向佳为, 章家维, 邵恩, 何鑫   

  1. 处理器芯片全国重点实验室(中国科学院计算技术研究所) 北京 100190
    中国科学院大学 北京 100049
  • 收稿日期:2025-06-11 修回日期:2026-02-11 发布日期:2026-03-12
  • 通讯作者: 邵恩(shaoen@ict.ac.cn)
  • 作者简介:(zhouyueyuan@ict.ac.cn)
  • 基金资助:
    中国科学院计算所创新课题(E461030);国家重点研发计划(2021YFB0300202);中国科学院青年促进会(2021099);天津市科技计划(24ZXKJGX00060)

Training System for Large Language Models Based on Adaptive Transpose on Hygon DCU

ZHOU Yueyuan, LU Guanze, XIANG Jiawei, ZHANG Jiawei, SHAO En, HE Xin   

  1. State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing 100190, China
    University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2025-06-11 Revised:2026-02-11 Online:2026-03-12
  • About author:ZHOU Yueyuan,born in 1995,postgra-duate,engineer,is a member of CCF(No.F6586M).Her main research interest is system software for deep learning.
    SHAO En,born in 1988,Ph.D,senior engineer,master supervisor,is a senior member of CCF(No.51632S).His main research interests include computer system architecture,interconnection networks,heterogeneous resource sche-duling,and programming models.
  • Supported by:
    Innovation Funding of ICT,CAS(E461030),National Key R&D Program of China(2021YFB0300202),Youth Innovation Promotion Association of Chinese Academy of Sciences(2021099) and Tianjin Science and Technology Plan(24ZXKJGX00060).

摘要: 随着中美贸易摩擦加剧,我国对国产加速芯片的研发任务愈发紧迫。海光DCU采用类CUDA架构,凭借优异的兼容性和性价比,成为人工智能领域替代美国高端芯片的有力候选。然而,在海光DCU平台上,作为大语言模型训练关键算子的GEMM核函数的性能差异显著。针对该现象,研究了矩阵转置对rocBLAS算法库中GEMM核函数性能的影响,并提出最小化转置与自适应转置两种优化方法,以有效降低大语言模型的训练耗时。修改了PyTorch的线性层实现,提出了大语言模型分布式训练的最小化转置和自适应转置优化方法。实验结果表明,这两种优化方法在多种大规模语言模型(如 OPT-6.7B,LLaMA-7B,Bloom-7B 等)的分布式训练中均能显著降低训练时间。在83个测试样例中,自适应转置优化方法在72种情况下表现更优,相比基于原始PyTorch的Megatron-LM端到端训练时间提升最高达 24.27%。

关键词: 大语言模型, 训练系统, 海光DCU, 矩阵转置

Abstract: With the intensification of trade frictions between China and the United States,the development of domestic accelerator chips in China has become increasingly urgent.The Hygon DCU,with its CUDA-like architecture,excellent compatibility,and cost-effectiveness,has emerged as a strong candidate to replace high-end American chips in the field of artificial intelligence.However,on the Hygon DCU platform,the performance of the GEMM kernel function,which is a critical operator in large language model training,varies significantly.This paper investigates the impact of matrix transposition on the performance of the GEMM kernel function in the rocBLAS algorithm library and proposes two optimization methods:minimizing transposition and adaptive transposition,to effectively reduce the training time of large language models.This study modifies the implementation of the linear layer in PyTorch and proposes the minimization and adaptation of transposition methods for distributed training of large language models.Experimental results show that these two optimization methods significantly reduce training time in the distri-buted training of various large-scale language models,such as OPT-6.7B,LLaMA-7B,and Bloom-7B.Among the 83 test cases,the adaptive transposition method outperformes in 72 cases,with the highest improvement of 24.27% in end-to-end training time compared to the original PyTorch-based Megatron-LM.

Key words: Large language model, Training system, Hygon DCU, Matrix transpose

中图分类号: 

  • TP315
[1]HYGON.DCU[EB/OL].https://www.hygon.cn/product/accelerator.
[2]NVIDIA Corporation.CUDA Toolkit - Free Tools and Training[EB/OL].https://developer.nvidia.com/cuda-toolkit.
[3]HYGON.PyTorch[EB/OL].https://das.sourcefind.cn:55011/portal/#/installation?id=04749079-6b33-11ef-b472-005056904552&type=frame.
[4]HYGON.PyTorch[EB/OL].https://download.sourcefind.cn:65024/1/main/DTK-24.04.3/Document/DTK24.04.3开发环境使用手册.pdf.
[5]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-LM:Training Multi-Billion Parameter Language Models Using GPU Model Parallelism[J].arXiv:1909.08053,2019.
[6]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory optimizations Toward Training Trillion Parameter Models[J].arXiv:1910.02054,2019.
[7]BAINES M,BHOSALE S,CAGGIANO V,et al.Fairscale:Ageneral purpose modular pytorch library for high performance and large scale training[EB/OL].https://github.com/facebookresearch/fairscale.
[8]LIU Z,OGUZ B,ZHAO C,et al.Llm-qat:Data-free quantization aware training for large language models[C]//Findings of the Association for Computational Linguistics:ACL 2024.2024:467-484.
[9]DU D,ZHANG Y,CAO S,et al.Bitdistiller:Unleashing the potential of sub-4-bit llms via self-distillation[J].arXiv:2402.10631,2024.
[10]YAO Z,WU X,LI C,et al.Zeroquant-v2:Exploring post-training quantization in llms from comprehensive study to low rank compensation[J].arXiv:2303.08302,2023.
[11]FRANKLE J,CARBIN M.The lottery ticket hypothesis:Fin-ding sparse,trainable neural networks[J].arXiv:1803.03635,2018.
[12]FRANTAR E,ALISTARH D.Sparsegpt:Massive languagemodels can be accurately prunedin one-shot[C]//International Conference on Machine Learning.PMLR,2023:10323-10337.
[13]HOOPER C,KIM S,MOHAMMADZADEH H,et al.Kvquant:Towards 10 million context length llm inference with kv cache quantization[J].Advances in Neural Information Processing Systems,2024,37:1270-1303.
[14]YUE Y,YUAN Z,DUANMU H,et al.Wkvquant:Quantizing weight and key/value cache for large language models gains more[J].arXiv:2402.12065,2024.
[15]KIM B K,KIM G,KIM T H,et al.Shortened llama:A simple depth pruning for large language models[J].arXiv:2402.02834,2024.
[16]PENG B,QUESNELLE J,ROLNICK D,et al.APreliminaryREPORT ON DISTRO[EB/OL].https://assets.ctfassets.net/jdtwqhzvc2n1/pxVc7MpQSQS7sNGcY7Unk/0328a24ade597df12cb0cde094a9af5d/A_Preliminary_Report_on_DisTrO.pdf.
[17]WANG S,WEI J,SABNE A,et al.Overlap communication with dependent computation via decomposition in large deep learning models[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2022:93-106.
[18]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Easy scalingwith micro-batch pipel ine parallelism[C]//Proceeding of Computer Vision and Pattern Recognition.2019.
[19]ZHANG J,SHAO E,WANG L,et al.AsymFB:AcceleratingLLM Training Through Asymmetric Model Parallelism[C]//IFIP International Conference on Network and Parallel Computing.Singapore:Springer,2024:16-27.
[20]QI P,WAN X,HUANG G,et al.Zero bubble pipeline paralle-lism[J].arXiv:2401.10241,2024.
[21]CHEN T,XU B,ZHANG C,et al.Training deep nets with sublinear memory cost[J].arXiv:1604.06174,2016.
[22]KIRISAME M,LYUBOMIRSKY S,HAAN A,et al.Dynamictensor rematerialization[J].arXiv:2006.09616,2020.
[23]ANDOORVEEDU M,ZHU Z,ZHENG B,et al.Tempo:Acce-lerating transformer-based model training through memory footprint reduction[J].Advances in Neural Information Processing Systems,2022,35:12267-12282.
[24]ZHANG S,ROLLER S,GOYAL N,et al.Opt:Open pre-trained transformer language models[J].arXiv:2205.01068,2022.
[25]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[26]WORKSHOP B S,SCAO T L,FAN A,et al.Bloom:A 176b-parameter open-access multilingual language model[J].arXiv:2211.05100,2022.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!