输入感知的通用矩阵-向量乘算法在Hygon DCU的自适应性能优化

doi:10.11896/jsjkx.241100030

Abstract

Abstract: GEMV(generalized matrix-vector multiplication function) is the core component of BLAS(basic linear algebra subroutine) algorithm library,which is widely used in the fields of computer science,engineering computation and mathematical computation.Currently,with the continuous iterative upgrading of the domestic Hygon DCU version,there is also a certain competitive advantage between the Hygon DCU and the traditional GPU manufacturer before.With the continuous expansion of the application field of GEMV,the input characteristics of GEMV also reflect a diversified tendency,in which case,relying on a single optimization method,it is not possible to realize the GEMV algorithm in all inputs of GPU computing platforms in the cases with high performance.Therefore,in this paper,on the basis of traditional optimization means such as access optimization,instruction rearrangement,parallel statute,shared memory,thread scheduling,we propose an input-aware performance adaptive optimization method,which is able to automatically adjust the implementation of the computation kernel according to the different sizes and shapes of the input matrices in order to achieve the optimal performance,and significantly improves the performance of GEMV on a Hygon DCU.Experimental results show that the overall performance of the generalized matrix-vector multiplication algorithm for input awareness implemented in this paper on the Hygon DCU Z100SM is significantly better than the related algorithms in the RocBLAS library,with a maximum performance improvement of 3.020 3 times that of the corresponding algorithms in the RocBLAS library for different matrix input sizes.

Key words: Generalized matrix-vector multiplication, DCU, Library of basic linear algebra subroutine functions, Adaptive tuning, Performance optimization

CLC Number:

TP311

LI Qing, JIA Haipeng, ZHANG Yunquan, ZHANG Sijia. Input-aware Generalized Matrix-Vector Product Algorithm for Adaptative PerformanceOptimization of Hygon DCU[J].Computer Science, 2025, 52(4): 291-300.

References

[1]ZHANG Y Q,DANG L,YUAN L,et al.Analysis of the development status of high-performance computers in China in 2023[J].Computer Engineering and Science,2023,45(12):2091.
[2]XU S,WANG W,ZHANG J,et al.High performance computingalgorithms and software for heterogeneous computing[J].Journal of Software,2021,32(8):2365-2376.
[3]FAN Z,QIU F,KAUFMAN A,et al.GPU cluster for high performance computing[C]//SC’04:Proceedings of the 2004 ACM/IEEE Conference on Supercomputing.IEEE,2004:47-47.
[4]HAGER W W.Applied numerical linear algebra[M].Society for Industrial and Applied Mathematics,2021.
[5]MUKUNOKI D,IMAMURA T,TAKAHASHI D.Fast imple-mentation of general matrix-vector multiplication(GEMV) on Kepler GPUs[C]//2015 23rd Euromicro International Confe-rence on Parallel,Distributed,and Network-Based Processing.IEEE,2015:642-650.
[6]MUKUNOKI D,OGITA T.Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs[J].Journal of Computational and Applied Mathematics,2020,372:112701.
[7]LONG X.Research on key issues of resource scheduling and optimization in heterogeneous computing[D].Beijing:Beijing University of Posts and Telecommunications,2023.
[8]TIAN Z,YANG S,ZHANG C.Accelerating Sparse General Ma-trix-Matrix Multiplication for NVIDIA Volta GPU and Hygon DCU[C]//Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing.2023:329-330.
[9]ZHANG B,ZENG H,PRASANNAV.Hardware acceleration of large scale GCN inference[C]//2020 IEEE 31st International Conference on Application-specific Systems,Architectures and Processors(ASAP).IEEE,2020:61-68.
[10]SCHOONHOVEN R A,VAN WERKHOVEN B,BATEN-BURG K J.Benchmarking optimization algorithms for auto-tuning GPU kernels[J].IEEE Transactions on Evolutionary Computation,2022,27(3):550-564.
[11]HU Y,CHEN D K,YANG C,et al.Crowd-core parallel optimization of level 1 and 2 BLAS functions for SW26010-Pro[J].Journal of Software,2023,34(9):4421-4436.
[12]YAO J,SHI B,XIANG C,et al.Iaat:A input-aware adaptive tuning framework for small gemm[C]//2021 IEEE 27th International Conference on Parallel and Distributed Systems(ICPADS).IEEE,2021:899-906.
[13]ABDELFATTAH A,COSTA T,DONGARRA J,et al.A set of batched basic linear algebra subprograms and LAPACK routines[J].ACM Transactions on Mathematical Software(TOMS),2021,47(3):1-23.
[14]GUO H,WANG H,CHEN W,et al.Optimizing sparse general matrix-matrix multiplication for DCUs[J].The Journal of Supercomputing,2024(14):80.
[15]CHANG W B,MOU M R,JIA H P,et al.Research on the implementation and optimization of image filtering algorithm based on OpenGL ES[J].Computer Engineering,2023,49(11):257-266.
[16]MINISKAR N R,MONIL M A H,VALERO-LARA P,et al.Iris-blas:Towards a performance portable and heterogeneous blas library[C]//2022 IEEE 29th International Conference on High Performance Computing,Data,and Analytics(HiPC).IEEE,2022:256-261.
[17]SCHIEFFER G,MEDEIROS D,FAJ J,et al.Characterizing the Performance,Power Efficiency,and Programmability of AMD Matrix Cores[R].Lawrence Livermore National Laboratory(LLNL),Livermore,CA(United States),2024.
[18]KIM D,KIM I,KIM J.Analysis of Sub-Routines in NVIDIA cuBLAS Library for a series of Matrix-Matrix Multiplications in Transformer[C]//2022 13th International Conference on Information and Communication Technology Convergence(ICTC).IEEE,2022:618-620.
[19]WANG E,ZHANG Q,SHEN B,et al.High-performance computing on the intel xeon phi[M].Springer,2014.
[20]ZHANG X Y,WANG X,ZHANG Y Q.openblas:a high performance blas library on loongson 3a cpu[J].Journal of Software,2012,22(zk2):208-216.
[21]LI C,JIA H,CAO H,et al.Autotsmm:An auto-tuning framework for building high-performance tall-and-skinny matrix-matrix multiplication on cpus[C]//2021 IEEE Intl. Conf. on Parallel & Distributed Processing with Applications,Big Data & Cloud Computing,Sustainable Computing & Communications,Social Computing & Networking(ISPA/BDCloud/SocialCom/SustainCom).IEEE,2021:159-166.
[22]RASCH A,SCHULZE R,STEUWER M,et al.Efficient auto-tuning of parallel programs with interdependent tuning parameters via auto-tuning framework(ATF)[J].ACM Transactions on Architecture and Code Optimization(TACO),2021,18(1):1-26.
[23]WEI C,JIA H,ZHANG Y,et al.IrGEMM:An Input-AwareTuning Framework for Irregular GEMM on ARM and X86 CPUs[J].IEEE Transactions on Parallel and Distributed Systems,2024,35(9):1672 -1689.
[24]MARKIDIS S,DER CHIEN S W,LAURE E,et al.Nvidia tensor core programmability,performance & precision[C]//2018 IEEE International Parallel and Distributed Processing Sympo-sium Workshops(IPDPSW).IEEE,2018:522-531.

Related Articles 15

[1]	LIU Xiaonan, LIAN Demeng, DU Shuaiqi, LIU Zhengyu. Simulation of Limited Entangled Quantum Fourier Transform Based on Matrix Product State [J]. Computer Science, 2024, 51(9): 80-86.
[2]	HAO Meng, TIAN Xueyang, LU Gangzhao, LIU Yi, ZHANG Weizhe, HE Hui. Transplantation and Optimization of Graph Matching Algorithm Based on Domestic DCUHeterogeneous Platform [J]. Computer Science, 2024, 51(4): 67-77.
[3]	CHEN Jun-wu, YU Hua-shan. Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs [J]. Computer Science, 2022, 49(6A): 594-600.
[4]	CHEN Le, GAO Ling, REN Jie, DANG Xin, WANG Yi-hao, CAO Rui, ZHENG Jie, WANG Hai. Adaptive Bitrate Streaming for Energy-Efficiency Mobile Augmented Reality [J]. Computer Science, 2022, 49(1): 194-203.
[5]	E Hai-hong, ZHANG Tian-yu, SONG Mei-na. Web-based Data Visualization Chart Rendering Optimization Method [J]. Computer Science, 2021, 48(3): 119-123.
[6]	ZHANG Xiao, ZHANG Si-meng, SHI Jia, DONG Cong, LI Zhan-huai. Review on Performance Optimization of Ceph Distributed Storage System [J]. Computer Science, 2021, 48(2): 1-12.
[7]	XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na. Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System [J]. Computer Science, 2021, 48(12): 36-42.
[8]	XU Jiang-feng and TAN Yu-long. Research on HBase Configuration Parameter Optimization Based on Machine Learning [J]. Computer Science, 2020, 47(6A): 474-479.
[9]	ZHANG Peng-yi, SONG Jie. Research Advance on Efficiency Optimization of Blockchain Consensus Algorithms [J]. Computer Science, 2020, 47(12): 296-303.
[10]	XU Chuan-fu,WANG Xi,LIU Shu,CHEN Shi-zhao,LIN Yu. Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python [J]. Computer Science, 2020, 47(1): 17-23.
[11]	ZHANG Ling-hao, GUI Sheng-lin, MU Feng-jun, WANG Sheng. Clone Detection Algorithm for Binary Executable Code with Suffix Tree [J]. Computer Science, 2019, 46(10): 141-147.
[12]	XU Qi-ze, HAN Wen-ting, CHEN Jun-shi, AN Hong. Optimization of Breadth-first Search Algorithm Based on Many-core Platform [J]. Computer Science, 2019, 46(1): 314-319.
[13]	SUN Tao, ZHANG Jun-xing. Review of SDN Performance Optimization Technology [J]. Computer Science, 2018, 45(11A): 84-91.
[14]	SUN Zhi-long, Edwin H-M Sha, ZHUGE Qing-feng, CHEN Xian-zhang and WU Kai-jie. Research on Data Consistency for In-memory File Systems [J]. Computer Science, 2017, 44(2): 222-227.
[15]	NI You-cong, LI Song, YE Peng and DU Xin. Random Search Rule Based Performance Evolutionary Optimization Method at Software Architecture Level [J]. Computer Science, 2017, 44(11): 156-163.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Input-aware Generalized Matrix-Vector Product Algorithm for Adaptative PerformanceOptimization of Hygon DCU

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0