面向海光DCU基于自适应转置的大语言模型训练系统

doi:10.11896/jsjkx.250600073

Computer Science ›› 2026, Vol. 53 ›› Issue (3): 33-40.doi: 10.11896/jsjkx.250600073

• Intelligent Information System Based on AGI Technology • Previous Articles Next Articles

Training System for Large Language Models Based on Adaptive Transpose on Hygon DCU

ZHOU Yueyuan, LU Guanze, XIANG Jiawei, ZHANG Jiawei, SHAO En, HE Xin

State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China

Received:2025-06-11 Revised:2026-02-11 Published:2026-03-12
About author:ZHOU Yueyuan,born in 1995,postgra-duate,engineer,is a member of CCF(No.F6586M).Her main research interest is system software for deep learning.
SHAO En,born in 1988,Ph.D,senior engineer,master supervisor,is a senior member of CCF(No.51632S).His main research interests include computer system architecture,interconnection networks,heterogeneous resource sche-duling,and programming models.
Supported by:
Innovation Funding of ICT,CAS(E461030),National Key R&D Program of China(2021YFB0300202),Youth Innovation Promotion Association of Chinese Academy of Sciences(2021099) and Tianjin Science and Technology Plan(24ZXKJGX00060).

Abstract

Abstract: With the intensification of trade frictions between China and the United States,the development of domestic accelerator chips in China has become increasingly urgent.The Hygon DCU,with its CUDA-like architecture,excellent compatibility,and cost-effectiveness,has emerged as a strong candidate to replace high-end American chips in the field of artificial intelligence.However,on the Hygon DCU platform,the performance of the GEMM kernel function,which is a critical operator in large language model training,varies significantly.This paper investigates the impact of matrix transposition on the performance of the GEMM kernel function in the rocBLAS algorithm library and proposes two optimization methods:minimizing transposition and adaptive transposition,to effectively reduce the training time of large language models.This study modifies the implementation of the linear layer in PyTorch and proposes the minimization and adaptation of transposition methods for distributed training of large language models.Experimental results show that these two optimization methods significantly reduce training time in the distri-buted training of various large-scale language models,such as OPT-6.7B,LLaMA-7B,and Bloom-7B.Among the 83 test cases,the adaptive transposition method outperformes in 72 cases,with the highest improvement of 24.27% in end-to-end training time compared to the original PyTorch-based Megatron-LM.

Key words: Large language model, Training system, Hygon DCU, Matrix transpose

CLC Number:

TP315

ZHOU Yueyuan, LU Guanze, XIANG Jiawei, ZHANG Jiawei, SHAO En, HE Xin. Training System for Large Language Models Based on Adaptive Transpose on Hygon DCU[J].Computer Science, 2026, 53(3): 33-40.

References

[1]HYGON.DCU[EB/OL].https://www.hygon.cn/product/accelerator.
[2]NVIDIA Corporation.CUDA Toolkit - Free Tools and Training[EB/OL].https://developer.nvidia.com/cuda-toolkit.
[3]HYGON.PyTorch[EB/OL].https://das.sourcefind.cn:55011/portal/#/installation?id=04749079-6b33-11ef-b472-005056904552&type=frame.
[4]HYGON.PyTorch[EB/OL].https://download.sourcefind.cn:65024/1/main/DTK-24.04.3/Document/DTK24.04.3开发环境使用手册.pdf.
[5]SHOEYBI M,PATWARY M,PURI R,et al.Megatron-LM:Training Multi-Billion Parameter Language Models Using GPU Model Parallelism[J].arXiv:1909.08053,2019.
[6]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory optimizations Toward Training Trillion Parameter Models[J].arXiv:1910.02054,2019.
[7]BAINES M,BHOSALE S,CAGGIANO V,et al.Fairscale:Ageneral purpose modular pytorch library for high performance and large scale training[EB/OL].https://github.com/facebookresearch/fairscale.
[8]LIU Z,OGUZ B,ZHAO C,et al.Llm-qat:Data-free quantization aware training for large language models[C]//Findings of the Association for Computational Linguistics:ACL 2024.2024:467-484.
[9]DU D,ZHANG Y,CAO S,et al.Bitdistiller:Unleashing the potential of sub-4-bit llms via self-distillation[J].arXiv:2402.10631,2024.
[10]YAO Z,WU X,LI C,et al.Zeroquant-v2:Exploring post-training quantization in llms from comprehensive study to low rank compensation[J].arXiv:2303.08302,2023.
[11]FRANKLE J,CARBIN M.The lottery ticket hypothesis:Fin-ding sparse,trainable neural networks[J].arXiv:1803.03635,2018.
[12]FRANTAR E,ALISTARH D.Sparsegpt:Massive languagemodels can be accurately prunedin one-shot[C]//International Conference on Machine Learning.PMLR,2023:10323-10337.
[13]HOOPER C,KIM S,MOHAMMADZADEH H,et al.Kvquant:Towards 10 million context length llm inference with kv cache quantization[J].Advances in Neural Information Processing Systems,2024,37:1270-1303.
[14]YUE Y,YUAN Z,DUANMU H,et al.Wkvquant:Quantizing weight and key/value cache for large language models gains more[J].arXiv:2402.12065,2024.
[15]KIM B K,KIM G,KIM T H,et al.Shortened llama:A simple depth pruning for large language models[J].arXiv:2402.02834,2024.
[16]PENG B,QUESNELLE J,ROLNICK D,et al.APreliminaryREPORT ON DISTRO[EB/OL].https://assets.ctfassets.net/jdtwqhzvc2n1/pxVc7MpQSQS7sNGcY7Unk/0328a24ade597df12cb0cde094a9af5d/A_Preliminary_Report_on_DisTrO.pdf.
[17]WANG S,WEI J,SABNE A,et al.Overlap communication with dependent computation via decomposition in large deep learning models[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2022:93-106.
[18]HUANG Y,CHENG Y,BAPNA A,et al.Gpipe:Easy scalingwith micro-batch pipel ine parallelism[C]//Proceeding of Computer Vision and Pattern Recognition.2019.
[19]ZHANG J,SHAO E,WANG L,et al.AsymFB:AcceleratingLLM Training Through Asymmetric Model Parallelism[C]//IFIP International Conference on Network and Parallel Computing.Singapore:Springer,2024:16-27.
[20]QI P,WAN X,HUANG G,et al.Zero bubble pipeline paralle-lism[J].arXiv:2401.10241,2024.
[21]CHEN T,XU B,ZHANG C,et al.Training deep nets with sublinear memory cost[J].arXiv:1604.06174,2016.
[22]KIRISAME M,LYUBOMIRSKY S,HAAN A,et al.Dynamictensor rematerialization[J].arXiv:2006.09616,2020.
[23]ANDOORVEEDU M,ZHU Z,ZHENG B,et al.Tempo:Acce-lerating transformer-based model training through memory footprint reduction[J].Advances in Neural Information Processing Systems,2022,35:12267-12282.
[24]ZHANG S,ROLLER S,GOYAL N,et al.Opt:Open pre-trained transformer language models[J].arXiv:2205.01068,2022.
[25]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[26]WORKSHOP B S,SCAO T L,FAN A,et al.Bloom:A 176b-parameter open-access multilingual language model[J].arXiv:2211.05100,2022.

Related Articles 15

[1]	WANG Zhibin, LI Shipeng, ZHOU Yuhang, LI Xue, ZHANG Zhonghui, JIANG Zhiwei, GU Rong, TIAN Chen, CHEN Guihai, ZHONG Sheng. Optimization of Service Level Objectives and System Level Metrics in Large Language ModelServing System [J]. Computer Science, 2026, 53(3): 23-32.
[2]	CHEN Han, XU Zefeng, JIANG Jiu, FAN Fan, ZHANG Junjian, HE Chu, WANG Wenwei. Large Language Model and Deep Network Based Cognitive Assessment Automatic Diagnosis [J]. Computer Science, 2026, 53(3): 41-51.
[3]	WU Xianjie, LI Tongliang, LI Zhoujun. Survey of Table Question Answering Research [J]. Computer Science, 2026, 53(3): 295-306.
[4]	XU Cheng, LIU Yuxuan, WANG Xin, ZHANG Cheng, YAO Dengfeng, YUAN Jiazheng. Review of Speech Disorder Assessment Methods Driven by Large Language Models [J]. Computer Science, 2026, 53(3): 307-320.
[5]	LI Wenli, FENG Xiaonian, QIAN Tieyun. Few-shot Continuous Toxicity Detection Based on Large Language Model Augmentation [J]. Computer Science, 2026, 53(3): 321-330.
[6]	CHEN Yuyin, LI Guanfeng, QIN Jing, XIAO Yuhang. Survey on Complex Logical Query Methods in Knowledge Graphs [J]. Computer Science, 2026, 53(2): 273-288.
[7]	GUO Luxiang, WANG Yueyu, LI Qianyue, LI Shasha, LIU Xiaodong, JI Bin, YU Jie. Comprehensive Survey of LLM-based Agent Operating Systems [J]. Computer Science, 2026, 53(1): 1-11.
[8]	LIU Lilong, LIU Guoming, QI Baoyuan, DENG Xueshan, XUE Dizhan, QIAN Shengsheng. Efficient Inference Techniques of Large Models in Real-world Applications:A Comprehensive Survey [J]. Computer Science, 2026, 53(1): 12-28.
[9]	SHAO Xinyi, ZHU Jingwei, ZHANG Liang. LLM-based Business Process Adaptation Method to Respond Long-tailed Changes [J]. Computer Science, 2026, 53(1): 29-38.
[10]	LIU Leyuan, CHEN Gege, WU Wei, WANG Yong, ZHOU Fan. Survey of Data Classification and Grading Studies [J]. Computer Science, 2025, 52(9): 195-211.
[11]	CAI Qihang, XU Bin, DONG Xiaodi. Knowledge Graph Completion Model Using Semantically Enhanced Prompts and Structural Information [J]. Computer Science, 2025, 52(9): 282-293.
[12]	ZHONG Boyang, RUAN Tong, ZHANG Weiyan, LIU Jingping. Collaboration of Large and Small Language Models with Iterative Reflection Framework for Clinical Note Summarization [J]. Computer Science, 2025, 52(9): 294-302.
[13]	WANG Limei, HAN Linrui, DU Zuwei, ZHENG Ri, SHI Jianzhong, LIU Yiqun. Privacy Policy Compliance Detection Method for Mobile Application Based on Large LanguageModel [J]. Computer Science, 2025, 52(8): 1-16.
[14]	WANG Dongsheng. Multi-defendant Legal Judgment Prediction with Multi-turn LLM and Criminal Knowledge Graph [J]. Computer Science, 2025, 52(8): 308-316.
[15]	LI Maolin, LIN Jiajie, YANG Zhenguo. Confidence-guided Prompt Learning for Multimodal Aspect-level Sentiment Analysis [J]. Computer Science, 2025, 52(7): 241-247.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Training System for Large Language Models Based on Adaptive Transpose on Hygon DCU

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0