计算机科学 ›› 2025, Vol. 52 ›› Issue (4): 291-300.doi: 10.11896/jsjkx.241100030
李庆1,2, 贾海鹏2, 张云泉2, 张思佳1
LI Qing1,2, JIA Haipeng2, ZHANG Yunquan2, ZHANG Sijia1
摘要: GEMV(通用矩阵-向量乘法函数)是BLAS(基础线性代数子程序)算法库的核心组成部分,广泛用于计算机科学、工程计算和数学计算等领域。当前,随着国产Hygon DCU版本的不断迭代升级,Hygon DCU与传统GPU生产商之间也存在一定的竞争优势;随着GEMV应用领域的不断扩大,GEMV的输入特征体现出多样化的趋势。在这种背景下,单纯靠一种优化方法,无法实现GEMV算法在GPU计算平台上所有输入情况下的高性能。因此,在访存优化、指令重排、并行规约、共享内存、线程排布等传统优化手段的基础上,提出了一种输入感知的性能自适应优化方法,其能够根据输入矩阵的不同规模和形状自动调整计算kernel的实现方式以达到最佳性能,显著提高了GEMV在Hygon DCU上的性能。实验结果表明,在Hygon DCU Z100SM上,输入感知的通用矩阵-向量乘算法的整体性能明显优于RocBLAS库中的相关算法,对于不同的矩阵输入规模,性能最大提升为RocBLAS库中对应算法的3.020 3倍。
中图分类号:
[1]ZHANG Y Q,DANG L,YUAN L,et al.Analysis of the development status of high-performance computers in China in 2023[J].Computer Engineering and Science,2023,45(12):2091. [2]XU S,WANG W,ZHANG J,et al.High performance computingalgorithms and software for heterogeneous computing[J].Journal of Software,2021,32(8):2365-2376. [3]FAN Z,QIU F,KAUFMAN A,et al.GPU cluster for high performance computing[C]//SC’04:Proceedings of the 2004 ACM/IEEE Conference on Supercomputing.IEEE,2004:47-47. [4]HAGER W W.Applied numerical linear algebra[M].Society for Industrial and Applied Mathematics,2021. [5]MUKUNOKI D,IMAMURA T,TAKAHASHI D.Fast imple-mentation of general matrix-vector multiplication(GEMV) on Kepler GPUs[C]//2015 23rd Euromicro International Confe-rence on Parallel,Distributed,and Network-Based Processing.IEEE,2015:642-650. [6]MUKUNOKI D,OGITA T.Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs[J].Journal of Computational and Applied Mathematics,2020,372:112701. [7]LONG X.Research on key issues of resource scheduling and optimization in heterogeneous computing[D].Beijing:Beijing University of Posts and Telecommunications,2023. [8]TIAN Z,YANG S,ZHANG C.Accelerating Sparse General Ma-trix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU[C]//Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing.2023:329-330. [9]ZHANG B,ZENG H,PRASANNAV.Hardware acceleration of large scale GCN inference[C]//2020 IEEE 31st International Conference on Application-specific Systems,Architectures and Processors(ASAP).IEEE,2020:61-68. [10]SCHOONHOVEN R A,VAN WERKHOVEN B,BATEN-BURG K J.Benchmarking optimization algorithms for auto-tuning GPU kernels[J].IEEE Transactions on Evolutionary Computation,2022,27(3):550-564. [11]HU Y,CHEN D K,YANG C,et al.Crowd-core parallel optimization of level 1 and 2 BLAS functions for SW26010-Pro[J].Journal of Software,2023,34(9):4421-4436. [12]YAO J,SHI B,XIANG C,et al.Iaat:A input-aware adaptive tuning framework for small gemm[C]//2021 IEEE 27th International Conference on Parallel and Distributed Systems(ICPADS).IEEE,2021:899-906. [13]ABDELFATTAH A,COSTA T,DONGARRA J,et al.A set of batched basic linear algebra subprograms and LAPACK routines[J].ACM Transactions on Mathematical Software(TOMS),2021,47(3):1-23. [14]GUO H,WANG H,CHEN W,et al.Optimizing sparse general matrix-matrix multiplication for DCUs[J].The Journal of Supercomputing,2024(14):80. [15]CHANG W B,MOU M R,JIA H P,et al.Research on the implementation and optimization of image filtering algorithm based on OpenGL ES[J].Computer Engineering,2023,49(11):257-266. [16]MINISKAR N R,MONIL M A H,VALERO-LARA P,et al.Iris-blas:Towards a performance portable and heterogeneous blas library[C]//2022 IEEE 29th International Conference on High Performance Computing,Data,and Analytics(HiPC).IEEE,2022:256-261. [17]SCHIEFFER G,MEDEIROS D,FAJ J,et al.Characterizing the Performance,Power Efficiency,and Programmability of AMD Matrix Cores[R].Lawrence Livermore National Laboratory(LLNL),Livermore,CA(United States),2024. [18]KIM D,KIM I,KIM J.Analysis of Sub-Routines in NVIDIA cuBLAS Library for a series of Matrix-Matrix Multiplications in Transformer[C]//2022 13th International Conference on Information and Communication Technology Convergence(ICTC).IEEE,2022:618-620. [19]WANG E,ZHANG Q,SHEN B,et al.High-performance computing on the intel xeon phi[M].Springer,2014. [20]ZHANG X Y,WANG X,ZHANG Y Q.openblas:a high performance blas library on loongson 3a cpu[J].Journal of Software,2012,22(zk2):208-216. [21]LI C,JIA H,CAO H,et al.Autotsmm:An auto-tuning framework for building high-performance tall-and-skinny matrix-matrix multiplication on cpus[C]//2021 IEEE Intl. Conf. on Parallel & Distributed Processing with Applications,Big Data & Cloud Computing,Sustainable Computing & Communications,Social Computing & Networking(ISPA/BDCloud/SocialCom/SustainCom).IEEE,2021:159-166. [22]RASCH A,SCHULZE R,STEUWER M,et al.Efficient auto-tuning of parallel programs with interdependent tuning parameters via auto-tuning framework(ATF)[J].ACM Transactions on Architecture and Code Optimization(TACO),2021,18(1):1-26. [23]WEI C,JIA H,ZHANG Y,et al.IrGEMM:An Input-AwareTuning Framework for Irregular GEMM on ARM and X86 CPUs[J].IEEE Transactions on Parallel and Distributed Systems,2024,35(9):1672 -1689. [24]MARKIDIS S,DER CHIEN S W,LAURE E,et al.Nvidia tensor core programmability,performance & precision[C]//2018 IEEE International Parallel and Distributed Processing Sympo-sium Workshops(IPDPSW).IEEE,2018:522-531. |
|