计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 25-40.doi: 10.11896/jsjkx.240500052
戴翰文, 陈长波
DAI Hanwen, CHEN Changbo
摘要: 很多实际应用中需要高效计算大量不同维度的小矩阵乘积,如基于图神经网络的图分类需要将多个邻接矩阵与节点特征矩阵相乘。针对现有方法无法跨不同硬件平台高效计算此类维度各异(简称变维)批处理小矩阵乘法的问题,基于深度学习编译器TVM,提出了一种可以跨平台的高效算法BVSM,通过为小矩阵特制优化模板、运用张量化批处理和分组填充等技术使得TVM可以高效进行变维批处理小矩阵乘法。在真实图分类任务数据集上的实验表明,在CPU 端,BVSM相较于自动调度和调优的TVM(AnsorTVM)平均获得两倍以上加速,平均性能达到Intel MKL变维批处理矩阵乘法的95%,最高为其1.27倍;在 GPU 端,BVSM相较于AnsorTVM 平均获得62.05倍的加速,相较于cuBLAS平均获得28.82倍的加速,相较于MAGMA 的变维批处理矩阵乘法平均获得6.59倍的加速。
中图分类号:
[1]WU S,ZHAI Y,LIU J,et al.Anatomy of high performance GEMM with online fault tolerance on GPUs[C]//Proceedings of the 37th International Conference on Supercomputing.2023:360-372. [2]XIA T,FU G L,QU S R,et al.Optimization of parallel computation on sparse matrix vector multiplication with high predictability[J].Journal of Computer Research and Development,2023,60(9):1973-1987. [3]HWANG R,KANG M,LEE J,et al.GROW:A row-stationary sparse-dense GEMM accelerator for memory-Efficient graph convol-utional neural networks[C]//Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:42-55. [4]REGGIANI E,PAPPALARDO A,DOBLAS M,et al.Mix-GEMM:An efficient HW-SW architecture for mixed-Precision quantized deepneural networks inference on edge devices[C]//Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:1085-1098. [5]HU Y,YE Z,WANG M,et al.Featgraph:A flexible and effi-cient backend for graph neural network systems[C]//Procee-dings of the SC20:International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.IEEE,2020:1-13. [6]LI M,LIU Y,LIU X,et al.Thedeep learning compiler:A comp-rehensive survey[J].IEEE Transactions on Parallel & Distributed Systems,2021,32(3):708-721. [7]ZENG J,KOUM Y,ZHENG X Y,et al.TVMT:A high per-formance neural network training compiler based on TVM[J].SCIENTIA SINICA Informationis,2023,53(12):2458-2471. [8]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//Procee-dings of the 13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594. [9]ZHENG L,JIA C,SUN M,et al.Ansor:Generating high per-formance tensor programs for deep learning[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation(OSDI 20).2020:863-879. [10]VALERO L P,MARTINEZ P I,MATEO S,et al.Variablebatched DGEMM[C]//Proceedings of the 2018 26th Euromicro International Conference on Parallel,Distributed and Network-based Processing.IEEE,2018:363-367. [12]DONG T,HAIDAR A,LUSZCZEK P,et al.Magma batched:A batched BLAS approach for small matrix factorizations and applications on GPUs[R].Knoxville:Innovative Computing Laboratory,University of Tennessee,2016. [13]Introducing batch GEMM operations[EB/OL].https://www.intel.cn/content/www/cn/zh/developer/articles/technical/intro-ducing-batch-gemm-operations.html. [14]Matrix Algebra for GPU and Multicore Architectures[EB/OL].https://icl.utk.edu/projectsfiles/magma/. [15]ZHANG S,TONG H,XU J,et al.Graph convolutional net-works:A comprehensive review[J].Computational Social Networks,2019,6(1):1-23. [16]XIAO G Q,LI X Q,CHEN Y D,et al.A review of large scale graph neural networks [J].Chinese Journal of Computers,2024,47(1):148-171. [17]LIU L,CHEN C B.Band sparse matrix multiplication and efficient GPU implementation[J].Journal of Computer Applications,2023,43(12):3856-3867. [18]MORRIS C,KRIEGE N M,BAUSE F,et al.Tudataset:A collection ofbenchmark datasets for learning with graphs[J].arXiv:08663.pdf,2020. [19]YANG Z,LU L,WANG R.A batched GEMM optimizationframework for deep learning[J].The Journal of Supercompu-ting,2022,78(11):13393-13408. [20]HUANG C,JIANG H,QUAN Z,et al.Design and implementation of batch matrix multiplication for deep learning[J].Chinese Journal of Computers,2022,45(2):225-239. [21]LEDOUX L,CASAS M.A generator of numerically tailored and high throughput accelerators for batched GEMMs[C]//Proceedings of the 2022 IEEE 30th AnnualInternational Symposiumon Field Programmable Custom Computing Machines(FCCM).IEEE,2022:1-10. [22]HUANG R,ZHAO Y,YU T,et al.Batched LU factorizationwith fast row interchanges for small matrices on GPUs[C]//Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications;8th International Conference on Data Science & Systems;20th International Conference on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:259-266. [23]ABDELFATTAH A,TOMOV S,DONGARRA J.Matrix multiplication on batches of small matrices in half and half-complex precisions[J].Journal of Parallel and Distributed Computing,2020,145:188-201. [24]ABDELFATTAH A,TOMOV S,DONGARRA J.Fast batched matrix multi-plication for small sizes using half-precision arithmetic on GPUs[C]//Proceedings ofthe 2019 IEEE InternationalParallel and Distributed Processing Symposium(IPDPS).IEEE,2019:111-122. [25]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.NovelHPC techniquesto batch execution of many variable size BLAS computations on GPUs[C]//Proceedings of the International Conference on Supercomputing.2017:1-10. [26]LI X,LIANG Y,YAN S,et al.A coordinated tiling and batching framework for efficient GEMM on GPUs[C]//Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.2019:229-241. [27]WANG R,YANG Z,XU H,et al.A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution[J].The Journal of Supercomputing,2022,78(2):1741-1758. [28]ZHANG Y,WANG Y,MO Z,et al.Accelerating small matrixmulti-plications by adaptive batching strategy on GPU[C]//Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications;8th International Conference on Data Science & Systems;20th InternationalConference on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:882-887. [29]HUANG G,DAI G,WANG Y,et al.Ge-spmm:General-purposesparse matrix-matrix multiplication on GPUs for graph neural networks[C]//Proceedings of the SC20:International Confe-rence for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-12. [30]SHAN M,GUREVIN D,NYE J,et al.MergePath-SpMM:Pa-rallel sparse matrix-matrix algorithm for graph neural network acceleration[C]//Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2023:145-156. [31]NAGASAKA Y,NUKADA A,KOJIMA R,et al.Batchedsparse matrix multiplication for accelerating graph convolutionalnetworks[C]//Proceedings of the 2019 19th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).IEEE,2019:231-240. [32]ZHENG B,JIANG Z,YU C H,et al.DietCode:Automatic optimization for dynamic tensor programs[J].Proceedings of Machine Learning and Systems,2022,4(1):848-863. [33]YE Z,LAI R,SHAO J,et al.SparseTIR:Composable abstractions for sparse compilation in deep learning[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 3.2023:660-678. [34]WU W,SHI X,HE L,et al.TurboMGNN:Improvingconcurrent GNN training tasks on GPU with fine-grained kernel fusion[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(6):1968-1981. [35]TAUZ L,DOLECEK L.Variable coded batch matrix multiplication [J].IEEE Journal on Selected Areas in Information Theory,2022,3(2):306-320. |
|