计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230300200-8.doi: 10.11896/jsjkx.230300200
雷超1,2, 刘江1,2, 宋佳文3
LEI Chao1,2, LIU Jiang1,2, SONG Jiawen3
摘要: 水平矩阵乘竖直矩阵是科学计算及工程领域中的基本计算之一,很大程度上影响了整个算法的计算效率。GPU并行计算是迄今主流的并行计算方式之一,其底层设计使得GPU非常契合于大规模矩阵计算。迄今已经有许多研究基于GPU并行计算框架,针对矩阵的结构设计、优化矩阵乘法,但尚未有针对水平矩阵乘竖直矩阵的GPU并行算法及优化。此外,GPU 核函数配置直接影响计算效率,但迄今针对最优核函数配置的研究极为有限,通常需要研究人员针对具体算法的计算特点启发式地设置。基于GPU的线程、内存模型,设计了一种并行水平矩阵乘竖直矩阵乘法PHVM。数值实验结果表明,在左乘矩阵的水平维度远远大于竖直维度时,PHVM要显著优于NVIDIAcuBLAS库中的通用矩阵乘法。进一步,基于GPU的硬件参数,建立了PHVM运行时间的核函数配置最优化理论模型。数值实验结果表明,该理论模型较为准确地描述了 PHVM 算法运行时间随核函数配置(网格大小、线程块大小)变换的变化趋势,且模型得出的理论最优核函数配置与实际最优运行核函数配置相符。
中图分类号:
[1]AHMED N,NATARAJAN T,RAO K R.Discrete cosine transform[J].IEEETransactions on Computers,1974,100(1):90-93. [2]GERSHO A,GRAY R M.Vector quantization and signal compression:volume 159[M].Springer Science Business Media,2012. [3]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Na-ture,2015,521(7553):436-444. [4]STRASSEN V.Gaussian elimination is not optimal[J].Nume-rische Mathematik,1969,13(4):354-356. [5]COPPERSMITH D,WINOGRAD S.Matrix multiplication viaarithmetic progressions[C]//Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing.1987:1-6. [6]DAVIS T A,RAJAMANICKAM S,SID-LAKHDAR W M.A survey of direct methods for sparse linear systems[J/OL].Acta Numerica,2016,25:383-566.https://www.cambridge.org/core/article/survey-of-direct-methods-for-sparse-linear-systems/8AE7AC55909389F7EA1F027855AC4044.DOI:10.1017/S0962492916000076. [7]CASASENT D,TAYLOR B K.Banded-matrix high-perfor-mance algorithm and architecture[J/OL].Applied Optics,1985,24(10):1476-1480.https://opg.optica.org/ao/abstract.cfm?URI=ao-24-10-1476.DOI:10.1364/AO.24.001476. [8]CHALLACOMBE M.A general parallel sparse-blocked matrix multiply for linear scaling scf theory[J/OL].Computer Physics Communications,2000,128(1):93-107.https://www.sciencedirect.com/science/article/pii/S0010465500000746.DOI:https://doi.org/10.1016/S0010-4655(00)00074-6. [9]RUBENSSON E H,RUDBERG E.Locality-aware parallelblock-sparse matrix-matrix multiplication using the chunks and tasks programming model[J/OL].Parallel Computing,2016,57:87-106.https://www.sciencedirect.com/science/article/pii/S0167819116300606.DOI:https://doi.org/10.1016/j.parco.2016.06.005 [10]CHOWDHERY A,NARANG S,DEVLIN J,et al.Palm:Scaling language modeling with pathways[J].arXiv:2204.02311,2022. [11]ASAIMEH M,DENOLF K,LO J,et al.Comparing energy efficiency of CPU,GPU and FPGA implementations for vision kernels[C]//2019 IEEE International Conference on Embedded Software and Systems(ICESS).IEEE:1-8. [12]LINDHOLM E,NICKOLLS J,OBERMAN S,et al.NVIDIATesla:A unified graphics and computing architecture[J/OL].IEEE Micro,2008,28(2):39-55.https://dx.doi.org/10.1109/mm.2008.31. [13]NVIDIA,VINGELMANN P,FITZEK F H.CUDA,release:10.2.89[EB/OL].2020.https://developer.nvidia.com/cuda-toolkit. [14]FATAHALIAN K,SUGERMAN J,HANRAHAN P.Under-standing the efficiency of GPU algorithms for matrix-matrix multiplication[C]//Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware.133-137. [15]HERAULT T,ROBERT Y,BOSILCA G,et al.Generic matrix multiplication for multi-gpu accelerated distributed-memory platforms over parsec[C]//2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems(ScalA).IEEE:33-41. [16]CLBLAS[EB/OL].https://github.com/clMathLibraries/cl-BLAS. [17]WANG Z W,CHENG L L,ZHAO W Q.Parallel Computation Performance Analysis Model Based on GPU[J].Computer Science,2014,41(1):31-38. [18]YIN M J,XU X B,XIONG Z G,et al.Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU[J].Computer Science,2015,42(12):13-17. [19]DALTON S,OLSON L,BELL N.Optimizing sparse matrix-matrix multiplication for the ssGPU[J].ACM Transactions on Mathematical Software(TOMS),2015,41(4):1-20. [20]CORMEN T H,LEISERSON C E,RIVEST R L,et al.Introduction to algorithms,third edition(3rd ed)[M].The MIT Press,2009. [21]GUIDE D.CUDA C best practices guide[J].NVIDIA,July,2013. [22]STIVAL P,GIRAUD L.Performance and accuracy of the matrix multiplication routines:cuBLAS on NVIDIA tesla versus MKL and ATLAS on Intel nehalem[Z].2010. [23]KHALILOV M,TIMOVEEV A.Performance analysis of CUDA,OpenACC and OpenMP programming models on Tesla v100 GPU[J].Journal of Physics:Conference Series,2021,1740(1):012056. [24]BARRACHINA S,CASTILLO M,IGUAL F D,et al.Evaluation and tuning of the level 3 cuBLAS for graphics processors[C]//2008 IEEE International Symposium on Parallel and Distributed Processing.IEEE:1-8. [25]WONG H,PAPADOPOULOU M M,SADOOGHI-ALVANDI M,et al.Demystifying GPU microarchitecture through microbenchmarking[C]//IEEE International Symposium on Performance Analysis of Systems & Software(ISPASS 2010).2010:235-246.DOI:10.1109/ISPASS.2010.5452013. |
|