计算机科学 ›› 2025, Vol. 52 ›› Issue (4): 291-300.doi: 10.11896/jsjkx.241100030

• 高性能计算 • 上一篇    下一篇

输入感知的通用矩阵-向量乘算法在Hygon DCU的自适应性能优化

李庆1,2, 贾海鹏2, 张云泉2, 张思佳1   

  1. 1 大连海洋大学信息工程学院 辽宁 大连 116023
    2 中国科学院计算技术研究所 北京 100190
  • 收稿日期:2024-11-05 修回日期:2025-01-09 出版日期:2025-04-15 发布日期:2025-04-14
  • 通讯作者: 贾海鹏(jiahaipeng@ict.ac.cn)
  • 作者简介:(l2454885722@163.com)
  • 基金资助:
    国家重点研发计划(2023YFB3001701);国家自然科学基金(62372432)

Input-aware Generalized Matrix-Vector Product Algorithm for Adaptative PerformanceOptimization of Hygon DCU

LI Qing1,2, JIA Haipeng2, ZHANG Yunquan2, ZHANG Sijia1   

  1. 1 School of Information Engineering,Dalian Ocean University,Dalian,Liaoning 116023,China
    2 Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2024-11-05 Revised:2025-01-09 Online:2025-04-15 Published:2025-04-14
  • About author:LI Qing,born in 1999,postgraduate,is a member of CCF(No.P7721G).His main research interests include parallel computing and high performance computing.
    JIA Haipeng,born in 1983,Ph.D,senior engineer,is a member of CCF(No.31889M).His main research interests include parallel computing and heterogeneous computing.
  • Supported by:
    National Key Research and Development Program of China(2023YFB3001701) and National Natural Science Foundation of China(62372432).

摘要: GEMV(通用矩阵-向量乘法函数)是BLAS(基础线性代数子程序)算法库的核心组成部分,广泛用于计算机科学、工程计算和数学计算等领域。当前,随着国产Hygon DCU版本的不断迭代升级,Hygon DCU与传统GPU生产商之间也存在一定的竞争优势;随着GEMV应用领域的不断扩大,GEMV的输入特征体现出多样化的趋势。在这种背景下,单纯靠一种优化方法,无法实现GEMV算法在GPU计算平台上所有输入情况下的高性能。因此,在访存优化、指令重排、并行规约、共享内存、线程排布等传统优化手段的基础上,提出了一种输入感知的性能自适应优化方法,其能够根据输入矩阵的不同规模和形状自动调整计算kernel的实现方式以达到最佳性能,显著提高了GEMV在Hygon DCU上的性能。实验结果表明,在Hygon DCU Z100SM上,输入感知的通用矩阵-向量乘算法的整体性能明显优于RocBLAS库中的相关算法,对于不同的矩阵输入规模,性能最大提升为RocBLAS库中对应算法的3.020 3倍。

关键词: 通用矩阵-向量乘法, DCU, 基础线性代数子程序函数库, 自适应调优, 性能优化

Abstract: GEMV(generalized matrix-vector multiplication function) is the core component of BLAS(basic linear algebra subroutine) algorithm library,which is widely used in the fields of computer science,engineering computation and mathematical computation.Currently,with the continuous iterative upgrading of the domestic Hygon DCU version,there is also a certain competitive advantage between the Hygon DCU and the traditional GPU manufacturer before.With the continuous expansion of the application field of GEMV,the input characteristics of GEMV also reflect a diversified tendency,in which case,relying on a single optimization method,it is not possible to realize the GEMV algorithm in all inputs of GPU computing platforms in the cases with high performance.Therefore,in this paper,on the basis of traditional optimization means such as access optimization,instruction rearrangement,parallel statute,shared memory,thread scheduling,we propose an input-aware performance adaptive optimization method,which is able to automatically adjust the implementation of the computation kernel according to the different sizes and shapes of the input matrices in order to achieve the optimal performance,and significantly improves the performance of GEMV on a Hygon DCU.Experimental results show that the overall performance of the generalized matrix-vector multiplication algorithm for input awareness implemented in this paper on the Hygon DCU Z100SM is significantly better than the related algorithms in the RocBLAS library,with a maximum performance improvement of 3.020 3 times that of the corresponding algorithms in the RocBLAS library for different matrix input sizes.

Key words: Generalized matrix-vector multiplication, DCU, Library of basic linear algebra subroutine functions, Adaptive tuning, Performance optimization

中图分类号: 

  • TP311
[1]ZHANG Y Q,DANG L,YUAN L,et al.Analysis of the development status of high-performance computers in China in 2023[J].Computer Engineering and Science,2023,45(12):2091.
[2]XU S,WANG W,ZHANG J,et al.High performance computingalgorithms and software for heterogeneous computing[J].Journal of Software,2021,32(8):2365-2376.
[3]FAN Z,QIU F,KAUFMAN A,et al.GPU cluster for high performance computing[C]//SC’04:Proceedings of the 2004 ACM/IEEE Conference on Supercomputing.IEEE,2004:47-47.
[4]HAGER W W.Applied numerical linear algebra[M].Society for Industrial and Applied Mathematics,2021.
[5]MUKUNOKI D,IMAMURA T,TAKAHASHI D.Fast imple-mentation of general matrix-vector multiplication(GEMV) on Kepler GPUs[C]//2015 23rd Euromicro International Confe-rence on Parallel,Distributed,and Network-Based Processing.IEEE,2015:642-650.
[6]MUKUNOKI D,OGITA T.Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs[J].Journal of Computational and Applied Mathematics,2020,372:112701.
[7]LONG X.Research on key issues of resource scheduling and optimization in heterogeneous computing[D].Beijing:Beijing University of Posts and Telecommunications,2023.
[8]TIAN Z,YANG S,ZHANG C.Accelerating Sparse General Ma-trix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU[C]//Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing.2023:329-330.
[9]ZHANG B,ZENG H,PRASANNAV.Hardware acceleration of large scale GCN inference[C]//2020 IEEE 31st International Conference on Application-specific Systems,Architectures and Processors(ASAP).IEEE,2020:61-68.
[10]SCHOONHOVEN R A,VAN WERKHOVEN B,BATEN-BURG K J.Benchmarking optimization algorithms for auto-tuning GPU kernels[J].IEEE Transactions on Evolutionary Computation,2022,27(3):550-564.
[11]HU Y,CHEN D K,YANG C,et al.Crowd-core parallel optimization of level 1 and 2 BLAS functions for SW26010-Pro[J].Journal of Software,2023,34(9):4421-4436.
[12]YAO J,SHI B,XIANG C,et al.Iaat:A input-aware adaptive tuning framework for small gemm[C]//2021 IEEE 27th International Conference on Parallel and Distributed Systems(ICPADS).IEEE,2021:899-906.
[13]ABDELFATTAH A,COSTA T,DONGARRA J,et al.A set of batched basic linear algebra subprograms and LAPACK routines[J].ACM Transactions on Mathematical Software(TOMS),2021,47(3):1-23.
[14]GUO H,WANG H,CHEN W,et al.Optimizing sparse general matrix-matrix multiplication for DCUs[J].The Journal of Supercomputing,2024(14):80.
[15]CHANG W B,MOU M R,JIA H P,et al.Research on the implementation and optimization of image filtering algorithm based on OpenGL ES[J].Computer Engineering,2023,49(11):257-266.
[16]MINISKAR N R,MONIL M A H,VALERO-LARA P,et al.Iris-blas:Towards a performance portable and heterogeneous blas library[C]//2022 IEEE 29th International Conference on High Performance Computing,Data,and Analytics(HiPC).IEEE,2022:256-261.
[17]SCHIEFFER G,MEDEIROS D,FAJ J,et al.Characterizing the Performance,Power Efficiency,and Programmability of AMD Matrix Cores[R].Lawrence Livermore National Laboratory(LLNL),Livermore,CA(United States),2024.
[18]KIM D,KIM I,KIM J.Analysis of Sub-Routines in NVIDIA cuBLAS Library for a series of Matrix-Matrix Multiplications in Transformer[C]//2022 13th International Conference on Information and Communication Technology Convergence(ICTC).IEEE,2022:618-620.
[19]WANG E,ZHANG Q,SHEN B,et al.High-performance computing on the intel xeon phi[M].Springer,2014.
[20]ZHANG X Y,WANG X,ZHANG Y Q.openblas:a high performance blas library on loongson 3a cpu[J].Journal of Software,2012,22(zk2):208-216.
[21]LI C,JIA H,CAO H,et al.Autotsmm:An auto-tuning framework for building high-performance tall-and-skinny matrix-matrix multiplication on cpus[C]//2021 IEEE Intl. Conf. on Parallel & Distributed Processing with Applications,Big Data & Cloud Computing,Sustainable Computing & Communications,Social Computing & Networking(ISPA/BDCloud/SocialCom/SustainCom).IEEE,2021:159-166.
[22]RASCH A,SCHULZE R,STEUWER M,et al.Efficient auto-tuning of parallel programs with interdependent tuning parameters via auto-tuning framework(ATF)[J].ACM Transactions on Architecture and Code Optimization(TACO),2021,18(1):1-26.
[23]WEI C,JIA H,ZHANG Y,et al.IrGEMM:An Input-AwareTuning Framework for Irregular GEMM on ARM and X86 CPUs[J].IEEE Transactions on Parallel and Distributed Systems,2024,35(9):1672 -1689.
[24]MARKIDIS S,DER CHIEN S W,LAURE E,et al.Nvidia tensor core programmability,performance & precision[C]//2018 IEEE International Parallel and Distributed Processing Sympo-sium Workshops(IPDPSW).IEEE,2018:522-531.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!