计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230300200-8.doi: 10.11896/jsjkx.230300200

• 计算机软件&体系架构 • 上一篇    下一篇

矩阵乘法的GPU并行计算时耗模型与最优配置方法

雷超1,2, 刘江1,2, 宋佳文3   

  1. 1 中国科学院重庆绿色智能技术研究院 重庆 400714
    2 中国科学院大学重庆学院 重庆 400714
    3 中南大学航空航天技术研究院 长沙 410017
  • 发布日期:2024-06-06
  • 通讯作者: 刘江(liujiang@cigit.ac.cn)
  • 作者简介:(leichao@cigit.ac.cn)
  • 基金资助:
    国家重点研发计划(2018YFC0116704);中国科学院科技服务网络计划区域重点项目(KFJ-STS-QYZD-2021-01-001);中南大学课题(大规模稀疏线性方程组并行加速求解研究);雷达资料同化关键技术及数值预报客观订正技术研究(E190600801)

Time Cost Model and Optimal Configuration Method for GPU Parallel Computation of Matrix Multiplication

LEI Chao1,2, LIU Jiang1,2, SONG Jiawen3   

  1. 1 Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China
    2 Chongqing School,University of Chinese Academy of Sciences,Chongqing 400714,China
    3 Research Institute of Aerospace Technology,Central South University,Changsha 410017,China
  • Published:2024-06-06
  • About author:LEI Chao,born in 1997,postgraduate.His main research interests include paral-let algorithm and iterative linear system solver.
    LIU Jiang,born in 1979,Ph.D,associate professor,is a member of CCF(No.50519M).His main research interests include computability theory,formal methods and computer algorithms.
  • Supported by:
    National Key R&D Program of China(2018YFC0116704),Science and Technology Service Network Initiative(KFJ-STS-QYZD-2021-01-001),Central South University Project(Research on Parallel and Accelerated Solution of Large-scale Sparse Linear Equations) and Rsearch on Key Techniques of Radar Data Assimilation and Objective Correction of Numerical Forecast(E190600801).

摘要: 水平矩阵乘竖直矩阵是科学计算及工程领域中的基本计算之一,很大程度上影响了整个算法的计算效率。GPU并行计算是迄今主流的并行计算方式之一,其底层设计使得GPU非常契合于大规模矩阵计算。迄今已经有许多研究基于GPU并行计算框架,针对矩阵的结构设计、优化矩阵乘法,但尚未有针对水平矩阵乘竖直矩阵的GPU并行算法及优化。此外,GPU 核函数配置直接影响计算效率,但迄今针对最优核函数配置的研究极为有限,通常需要研究人员针对具体算法的计算特点启发式地设置。基于GPU的线程、内存模型,设计了一种并行水平矩阵乘竖直矩阵乘法PHVM。数值实验结果表明,在左乘矩阵的水平维度远远大于竖直维度时,PHVM要显著优于NVIDIAcuBLAS库中的通用矩阵乘法。进一步,基于GPU的硬件参数,建立了PHVM运行时间的核函数配置最优化理论模型。数值实验结果表明,该理论模型较为准确地描述了 PHVM 算法运行时间随核函数配置(网格大小、线程块大小)变换的变化趋势,且模型得出的理论最优核函数配置与实际最优运行核函数配置相符。

关键词: 矩阵乘法, GPU, CUDA, 核函数配置

Abstract: Horizontal matrix & vertical matrix multiplication(HVM) is one of the fundamental calculations in scientific computing and engineering,as it largely affects the computational efficiency of higher-levet algorithms.GPU parallel computing has become one of the mainstream parallel computing method,and its underlying design makes it highly suitable for large-scale multiplication calculations.Numerous studies have focused on designing matrix structures and optimizing matrix multiplication using GPU parallel computing frameworks.However,there has been a lack of GPU parallet algorithms and optimization methods specifically targeting HVM.Furthermore,the configuration of GPU kernel functions directly affects computational efficiency,but studies on the optimal configuration of kernel functions have been extremely limited,typically requiring researchers to heuristi-cally set them based on the specific computational characteristics of the algorithm.This paper designs a parallel HVM algorithm,PHVM,based on the GPU’s thread and memory model.The numerical experimental results show that when the horizontal dimension of the left matrix is much larger than the vertical dimension,PHVM significantly outperforms the general matrix multiplication in the NVIDIA cuBLAS library.Furthermore,this paper establishes an optimal theoretical model for kernel function configuration of PHVM runtime based on GPU hardware parameters.The numerical experimental results indicates that this theoretical model accurately reflects the trend of changes in PHVM algorithm runtime with kernel function configuration(grid size and thread block size) variations.

Key words: Matrix multiplication, GPU, CUDA, Kernel function configuration

中图分类号: 

  • TP391.9
[1]AHMED N,NATARAJAN T,RAO K R.Discrete cosine transform[J].IEEETransactions on Computers,1974,100(1):90-93.
[2]GERSHO A,GRAY R M.Vector quantization and signal compression:volume 159[M].Springer Science Business Media,2012.
[3]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Na-ture,2015,521(7553):436-444.
[4]STRASSEN V.Gaussian elimination is not optimal[J].Nume-rische Mathematik,1969,13(4):354-356.
[5]COPPERSMITH D,WINOGRAD S.Matrix multiplication viaarithmetic progressions[C]//Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing.1987:1-6.
[6]DAVIS T A,RAJAMANICKAM S,SID-LAKHDAR W M.A survey of direct methods for sparse linear systems[J/OL].Acta Numerica,2016,25:383-566.https://www.cambridge.org/core/article/survey-of-direct-methods-for-sparse-linear-systems/8AE7AC55909389F7EA1F027855AC4044.DOI:10.1017/S0962492916000076.
[7]CASASENT D,TAYLOR B K.Banded-matrix high-perfor-mance algorithm and architecture[J/OL].Applied Optics,1985,24(10):1476-1480.https://opg.optica.org/ao/abstract.cfm?URI=ao-24-10-1476.DOI:10.1364/AO.24.001476.
[8]CHALLACOMBE M.A general parallel sparse-blocked matrix multiply for linear scaling scf theory[J/OL].Computer Physics Communications,2000,128(1):93-107.https://www.sciencedirect.com/science/article/pii/S0010465500000746.DOI:https://doi.org/10.1016/S0010-4655(00)00074-6.
[9]RUBENSSON E H,RUDBERG E.Locality-aware parallelblock-sparse matrix-matrix multiplication using the chunks and tasks programming model[J/OL].Parallel Computing,2016,57:87-106.https://www.sciencedirect.com/science/article/pii/S0167819116300606.DOI:https://doi.org/10.1016/j.parco.2016.06.005
[10]CHOWDHERY A,NARANG S,DEVLIN J,et al.Palm:Scaling language modeling with pathways[J].arXiv:2204.02311,2022.
[11]ASAIMEH M,DENOLF K,LO J,et al.Comparing energy efficiency of CPU,GPU and FPGA implementations for vision kernels[C]//2019 IEEE International Conference on Embedded Software and Systems(ICESS).IEEE:1-8.
[12]LINDHOLM E,NICKOLLS J,OBERMAN S,et al.NVIDIATesla:A unified graphics and computing architecture[J/OL].IEEE Micro,2008,28(2):39-55.https://dx.doi.org/10.1109/mm.2008.31.
[13]NVIDIA,VINGELMANN P,FITZEK F H.CUDA,release:10.2.89[EB/OL].2020.https://developer.nvidia.com/cuda-toolkit.
[14]FATAHALIAN K,SUGERMAN J,HANRAHAN P.Under-standing the efficiency of GPU algorithms for matrix-matrix multiplication[C]//Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware.133-137.
[15]HERAULT T,ROBERT Y,BOSILCA G,et al.Generic matrix multiplication for multi-gpu accelerated distributed-memory platforms over parsec[C]//2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems(ScalA).IEEE:33-41.
[16]CLBLAS[EB/OL].https://github.com/clMathLibraries/cl-BLAS.
[17]WANG Z W,CHENG L L,ZHAO W Q.Parallel Computation Performance Analysis Model Based on GPU[J].Computer Science,2014,41(1):31-38.
[18]YIN M J,XU X B,XIONG Z G,et al.Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU[J].Computer Science,2015,42(12):13-17.
[19]DALTON S,OLSON L,BELL N.Optimizing sparse matrix-matrix multiplication for the ssGPU[J].ACM Transactions on Mathematical Software(TOMS),2015,41(4):1-20.
[20]CORMEN T H,LEISERSON C E,RIVEST R L,et al.Introduction to algorithms,third edition(3rd ed)[M].The MIT Press,2009.
[21]GUIDE D.CUDA C best practices guide[J].NVIDIA,July,2013.
[22]STIVAL P,GIRAUD L.Performance and accuracy of the matrix multiplication routines:cuBLAS on NVIDIA tesla versus MKL and ATLAS on Intel nehalem[Z].2010.
[23]KHALILOV M,TIMOVEEV A.Performance analysis of CUDA,OpenACC and OpenMP programming models on Tesla v100 GPU[J].Journal of Physics:Conference Series,2021,1740(1):012056.
[24]BARRACHINA S,CASTILLO M,IGUAL F D,et al.Evaluation and tuning of the level 3 cuBLAS for graphics processors[C]//2008 IEEE International Symposium on Parallel and Distributed Processing.IEEE:1-8.
[25]WONG H,PAPADOPOULOU M M,SADOOGHI-ALVANDI M,et al.Demystifying GPU microarchitecture through microbenchmarking[C]//IEEE International Symposium on Performance Analysis of Systems & Software(ISPASS 2010).2010:235-246.DOI:10.1109/ISPASS.2010.5452013.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!