矩阵乘法的GPU并行计算时耗模型与最优配置方法

doi:10.11896/jsjkx.230300200

Abstract

Abstract: Horizontal matrix & vertical matrix multiplication(HVM) is one of the fundamental calculations in scientific computing and engineering,as it largely affects the computational efficiency of higher-levet algorithms.GPU parallel computing has become one of the mainstream parallel computing method,and its underlying design makes it highly suitable for large-scale multiplication calculations.Numerous studies have focused on designing matrix structures and optimizing matrix multiplication using GPU parallel computing frameworks.However,there has been a lack of GPU parallet algorithms and optimization methods specifically targeting HVM.Furthermore,the configuration of GPU kernel functions directly affects computational efficiency,but studies on the optimal configuration of kernel functions have been extremely limited,typically requiring researchers to heuristi-cally set them based on the specific computational characteristics of the algorithm.This paper designs a parallel HVM algorithm,PHVM,based on the GPU’s thread and memory model.The numerical experimental results show that when the horizontal dimension of the left matrix is much larger than the vertical dimension,PHVM significantly outperforms the general matrix multiplication in the NVIDIA cuBLAS library.Furthermore,this paper establishes an optimal theoretical model for kernel function configuration of PHVM runtime based on GPU hardware parameters.The numerical experimental results indicates that this theoretical model accurately reflects the trend of changes in PHVM algorithm runtime with kernel function configuration(grid size and thread block size) variations.

Key words: Matrix multiplication, GPU, CUDA, Kernel function configuration

CLC Number:

TP391.9

LEI Chao, LIU Jiang, SONG Jiawen. Time Cost Model and Optimal Configuration Method for GPU Parallel Computation of Matrix Multiplication[J].Computer Science, 2024, 51(6A): 230300200-8.

References

[1]AHMED N,NATARAJAN T,RAO K R.Discrete cosine transform[J].IEEETransactions on Computers,1974,100(1):90-93.
[2]GERSHO A,GRAY R M.Vector quantization and signal compression:volume 159[M].Springer Science Business Media,2012.
[3]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Na-ture,2015,521(7553):436-444.
[4]STRASSEN V.Gaussian elimination is not optimal[J].Nume-rische Mathematik,1969,13(4):354-356.
[5]COPPERSMITH D,WINOGRAD S.Matrix multiplication viaarithmetic progressions[C]//Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing.1987:1-6.
[6]DAVIS T A,RAJAMANICKAM S,SID-LAKHDAR W M.A survey of direct methods for sparse linear systems[J/OL].Acta Numerica,2016,25:383-566.https://www.cambridge.org/core/article/survey-of-direct-methods-for-sparse-linear-systems/8AE7AC55909389F7EA1F027855AC4044.DOI:10.1017/S0962492916000076.
[7]CASASENT D,TAYLOR B K.Banded-matrix high-perfor-mance algorithm and architecture[J/OL].Applied Optics,1985,24(10):1476-1480.https://opg.optica.org/ao/abstract.cfm?URI=ao-24-10-1476.DOI:10.1364/AO.24.001476.
[8]CHALLACOMBE M.A general parallel sparse-blocked matrix multiply for linear scaling scf theory[J/OL].Computer Physics Communications,2000,128(1):93-107.https://www.sciencedirect.com/science/article/pii/S0010465500000746.DOI:https://doi.org/10.1016/S0010-4655(00)00074-6.
[9]RUBENSSON E H,RUDBERG E.Locality-aware parallelblock-sparse matrix-matrix multiplication using the chunks and tasks programming model[J/OL].Parallel Computing,2016,57:87-106.https://www.sciencedirect.com/science/article/pii/S0167819116300606.DOI:https://doi.org/10.1016/j.parco.2016.06.005
[10]CHOWDHERY A,NARANG S,DEVLIN J,et al.Palm:Scaling language modeling with pathways[J].arXiv:2204.02311,2022.
[11]ASAIMEH M,DENOLF K,LO J,et al.Comparing energy efficiency of CPU,GPU and FPGA implementations for vision kernels[C]//2019 IEEE International Conference on Embedded Software and Systems(ICESS).IEEE:1-8.
[12]LINDHOLM E,NICKOLLS J,OBERMAN S,et al.NVIDIATesla:A unified graphics and computing architecture[J/OL].IEEE Micro,2008,28(2):39-55.https://dx.doi.org/10.1109/mm.2008.31.
[13]NVIDIA,VINGELMANN P,FITZEK F H.CUDA,release:10.2.89[EB/OL].2020.https://developer.nvidia.com/cuda-toolkit.
[14]FATAHALIAN K,SUGERMAN J,HANRAHAN P.Under-standing the efficiency of GPU algorithms for matrix-matrix multiplication[C]//Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware.133-137.
[15]HERAULT T,ROBERT Y,BOSILCA G,et al.Generic matrix multiplication for multi-gpu accelerated distributed-memory platforms over parsec[C]//2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems(ScalA).IEEE:33-41.
[16]CLBLAS[EB/OL].https://github.com/clMathLibraries/cl-BLAS.
[17]WANG Z W,CHENG L L,ZHAO W Q.Parallel Computation Performance Analysis Model Based on GPU[J].Computer Science,2014,41(1):31-38.
[18]YIN M J,XU X B,XIONG Z G,et al.Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU[J].Computer Science,2015,42(12):13-17.
[19]DALTON S,OLSON L,BELL N.Optimizing sparse matrix-matrix multiplication for the ssGPU[J].ACM Transactions on Mathematical Software(TOMS),2015,41(4):1-20.
[20]CORMEN T H,LEISERSON C E,RIVEST R L,et al.Introduction to algorithms,third edition(3rd ed)[M].The MIT Press,2009.
[21]GUIDE D.CUDA C best practices guide[J].NVIDIA,July,2013.
[22]STIVAL P,GIRAUD L.Performance and accuracy of the matrix multiplication routines:cuBLAS on NVIDIA tesla versus MKL and ATLAS on Intel nehalem[Z].2010.
[23]KHALILOV M,TIMOVEEV A.Performance analysis of CUDA,OpenACC and OpenMP programming models on Tesla v100 GPU[J].Journal of Physics:Conference Series,2021,1740(1):012056.
[24]BARRACHINA S,CASTILLO M,IGUAL F D,et al.Evaluation and tuning of the level 3 cuBLAS for graphics processors[C]//2008 IEEE International Symposium on Parallel and Distributed Processing.IEEE:1-8.
[25]WONG H,PAPADOPOULOU M M,SADOOGHI-ALVANDI M,et al.Demystifying GPU microarchitecture through microbenchmarking[C]//IEEE International Symposium on Performance Analysis of Systems & Software(ISPASS 2010).2010:235-246.DOI:10.1109/ISPASS.2010.5452013.

Related Articles 15

[1]	WANG Zhuang, WANG Pinghui, WANG Bincheng, WU Wenbo, WANG Bin, CONG Pengyu. GPU Shared Scheduling System Under Deep Learning Container Cloud Platform [J]. Computer Science, 2023, 50(6): 86-91.
[2]	MO Shangfeng, ZHOU Zhenfen, HU Yonghua, XU Minmin, MAO Chunxian, YUAN Yudi. Transplantation and Optimization of Row-vector-matrix Multiplication in Complex Domain Based on FT-M7002 [J]. Computer Science, 2023, 50(11A): 220900277-6.
[3]	GUO Zheng-wei, FU Ze-wen, LI Ning, BAI Lan. Study on Acceleration Algorithm for Raw Data Simulation of High Resolution Squint Spotlight SAR [J]. Computer Science, 2022, 49(8): 178-183.
[4]	ZONG Di-di, XIE Yi-wu. Model Medial Axis Generation Method Based on Normal Iteration [J]. Computer Science, 2022, 49(6A): 764-770.
[5]	WANG Jin, LIU Jiang. GPU-based Parallel DILU Preconditioning Technique [J]. Computer Science, 2022, 49(6): 108-118.
[6]	HUANG Jia-wei, LI Xiao-peng, LING Cheng. Scalable Parallel Computing Method for Conditional Likelihood Probability of Nucleotide Molecular Phylogenetic Tree Based on GPU [J]. Computer Science, 2022, 49(11A): 210800189-7.
[7]	WANG Bo-yang, PANG Jian-min, XU Jin-long, ZHAO Jie, TAO Xiao-han, ZHU Yu. Matrix Multiplication Vector Code Generation Based on Polyhedron Model [J]. Computer Science, 2022, 49(10): 44-51.
[8]	LI Fan, YAN Xing, ZHANG Xiao-yu. Optimization of GPU-based Eigenface Algorithm [J]. Computer Science, 2021, 48(4): 197-204.
[9]	HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[10]	WEN Min-hua, WANG Shen-peng, WEI Jian-wen, LI Lin-ying, ZHANG Bin, LIN Xin-hua. DGX-2 Based Optimization of Application for Turbulent Combustion [J]. Computer Science, 2021, 48(12): 43-48.
[11]	HAN Xiao-dong, GAO Fei, ZHANG Li-wei. Novel Real-time Algorithm for Critical Path of Linear Network Coding [J]. Computer Science, 2020, 47(9): 232-237.
[12]	WANG Liang, ZHOU Xin-zhi, YNA Hua. Real-time SIFT Algorithm Based on GPU [J]. Computer Science, 2020, 47(8): 105-111.
[13]	LIU Shi-fang, ZHAO Yong-hua, YU Tian-yu, HUANG Rong-feng. Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster [J]. Computer Science, 2020, 47(4): 6-12.
[14]	ZUO Xian-yu, ZHANG Zhe, SU Yue-han, LIU Yang, GE Qiang, TIAN Jun-feng. Extraction Algorithm of NDVI Based on GPU Multi-stream Parallel Model [J]. Computer Science, 2020, 47(4): 25-29.
[15]	XU Xin-peng, HU Bin-xing. Fast Calculation Method of Aircraft Component Strength Check Based on ICCG [J]. Computer Science, 2020, 47(11A): 624-627.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Time Cost Model and Optimal Configuration Method for GPU Parallel Computation of Matrix Multiplication

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0