Computer Science ›› 2025, Vol. 52 ›› Issue (5): 25-40.doi: 10.11896/jsjkx.240500052

• High Performance Computing • Previous Articles     Next Articles

Accelerating Batched Matrix Multiplication for Variable Small Sizes Based on TVM andApplications

DAI Hanwen, CHEN Changbo   

  1. Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China
    Chongqing School,University of Chinese Academy of Sciences,Chongqing 400714,China
  • Received:2024-05-13 Revised:2024-11-29 Online:2025-05-15 Published:2025-05-12
  • About author:DAI Hanwen,born in 1998,postgraduate.His main research interests include high performance computing and machine learning.
    CHEN Changbo,born in 1981,Ph.D,professor,is a member of CCF(No.64614M).His main research interests include computer algebra,high perfor-mance computing and applied machine learning.
  • Supported by:
    National Key R&D Program of China(2023YFA1009402,2020YFA0712300), Chongqing Talents Plan Youth Top-Notch Project(2021000263) and Chongqing Academician-led Science and Technology Innovation Guidance Projects(cstc2021yszx-jcyjX0004,2022YSZX-JCX0011CSTB,CSTB2023YSZX-JCX0008).

Abstract: In many practical applications,efficient computation of a large amount of small matrix products across different dimensions is required.For instance,in graph classification tasks based on graph neural networks,multiple adjacency matrices need to be multiplied with node feature matrices.To address the issue of existing methods being unable to efficiently handle batched matrix multiplication for variable and small sizes across different hardware platforms,this paper introduces a cross-platform efficient algorithm,BVSM,based on the deep learning compiler TVM.BVSM enhances the capability of TVM to efficiently perform batched matrix multiplication for small and variable sizes by utilizing customized optimization templates for small matrices,employing tensorization for batching,and applying grouped padding techniques.Experiments on real datasets of graph classification demonstrate that,on the CPU,BVSM achieves on average over two times speedup compared to auto-scheduled and auto-tuned TVM(AnsorTVM),reaching 95% of the average performance of,and achieving up to 1.27 times compared to,the method of batched matrix multiplication for variable sizes of Intel MKL.On the GPU,BVSM achieves on average over 62.05 times speedup compared to AnsorTVM,over 28.82 times speedup compared to cuBLAS,and over 6.59 times speedup compared to the method of batched matrix multiplication for variable sizes of MAGMA.

Key words: TVM, Batched matrix multiplication, Matrix multiplication for variable sizes

CLC Number: 

  • TP302
[1]WU S,ZHAI Y,LIU J,et al.Anatomy of high performance GEMM with online fault tolerance on GPUs[C]//Proceedings of the 37th International Conference on Supercomputing.2023:360-372.
[2]XIA T,FU G L,QU S R,et al.Optimization of parallel computation on sparse matrix vector multiplication with high predictability[J].Journal of Computer Research and Development,2023,60(9):1973-1987.
[3]HWANG R,KANG M,LEE J,et al.GROW:A row-stationary sparse-dense GEMM accelerator for memory-Efficient graph convol-utional neural networks[C]//Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:42-55.
[4]REGGIANI E,PAPPALARDO A,DOBLAS M,et al.Mix-GEMM:An efficient HW-SW architecture for mixed-Precision quantized deepneural networks inference on edge devices[C]//Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:1085-1098.
[5]HU Y,YE Z,WANG M,et al.Featgraph:A flexible and effi-cient backend for graph neural network systems[C]//Procee-dings of the SC20:International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.IEEE,2020:1-13.
[6]LI M,LIU Y,LIU X,et al.Thedeep learning compiler:A comp-rehensive survey[J].IEEE Transactions on Parallel & Distributed Systems,2021,32(3):708-721.
[7]ZENG J,KOUM Y,ZHENG X Y,et al.TVMT:A high per-formance neural network training compiler based on TVM[J].SCIENTIA SINICA Informationis,2023,53(12):2458-2471.
[8]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//Procee-dings of the 13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594.
[9]ZHENG L,JIA C,SUN M,et al.Ansor:Generating high per-formance tensor programs for deep learning[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation(OSDI 20).2020:863-879.
[10]VALERO L P,MARTINEZ P I,MATEO S,et al.Variablebatched DGEMM[C]//Proceedings of the 2018 26th Euromicro International Conference on Parallel,Distributed and Network-based Processing.IEEE,2018:363-367.
[12]DONG T,HAIDAR A,LUSZCZEK P,et al.Magma batched:A batched BLAS approach for small matrix factorizations and applications on GPUs[R].Knoxville:Innovative Computing Laboratory,University of Tennessee,2016.
[13]Introducing batch GEMM operations[EB/OL].https://www.intel.cn/content/www/cn/zh/developer/articles/technical/intro-ducing-batch-gemm-operations.html.
[14]Matrix Algebra for GPU and Multicore Architectures[EB/OL].https://icl.utk.edu/projectsfiles/magma/.
[15]ZHANG S,TONG H,XU J,et al.Graph convolutional net-works:A comprehensive review[J].Computational Social Networks,2019,6(1):1-23.
[16]XIAO G Q,LI X Q,CHEN Y D,et al.A review of large scale graph neural networks [J].Chinese Journal of Computers,2024,47(1):148-171.
[17]LIU L,CHEN C B.Band sparse matrix multiplication and efficient GPU implementation[J].Journal of Computer Applications,2023,43(12):3856-3867.
[18]MORRIS C,KRIEGE N M,BAUSE F,et al.Tudataset:A collection ofbenchmark datasets for learning with graphs[J].arXiv:08663.pdf,2020.
[19]YANG Z,LU L,WANG R.A batched GEMM optimizationframework for deep learning[J].The Journal of Supercompu-ting,2022,78(11):13393-13408.
[20]HUANG C,JIANG H,QUAN Z,et al.Design and implementation of batch matrix multiplication for deep learning[J].Chinese Journal of Computers,2022,45(2):225-239.
[21]LEDOUX L,CASAS M.A generator of numerically tailored and high throughput accelerators for batched GEMMs[C]//Proceedings of the 2022 IEEE 30th AnnualInternational Symposiumon Field Programmable Custom Computing Machines(FCCM).IEEE,2022:1-10.
[22]HUANG R,ZHAO Y,YU T,et al.Batched LU factorizationwith fast row interchanges for small matrices on GPUs[C]//Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications;8th International Conference on Data Science & Systems;20th International Conference on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:259-266.
[23]ABDELFATTAH A,TOMOV S,DONGARRA J.Matrix multiplication on batches of small matrices in half and half-complex precisions[J].Journal of Parallel and Distributed Computing,2020,145:188-201.
[24]ABDELFATTAH A,TOMOV S,DONGARRA J.Fast batched matrix multi-plication for small sizes using half-precision arithmetic on GPUs[C]//Proceedings ofthe 2019 IEEE InternationalParallel and Distributed Processing Symposium(IPDPS).IEEE,2019:111-122.
[25]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.NovelHPC techniquesto batch execution of many variable size BLAS computations on GPUs[C]//Proceedings of the International Conference on Supercomputing.2017:1-10.
[26]LI X,LIANG Y,YAN S,et al.A coordinated tiling and batching framework for efficient GEMM on GPUs[C]//Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.2019:229-241.
[27]WANG R,YANG Z,XU H,et al.A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution[J].The Journal of Supercomputing,2022,78(2):1741-1758.
[28]ZHANG Y,WANG Y,MO Z,et al.Accelerating small matrixmulti-plications by adaptive batching strategy on GPU[C]//Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications;8th International Conference on Data Science & Systems;20th InternationalConference on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:882-887.
[29]HUANG G,DAI G,WANG Y,et al.Ge-spmm:General-purposesparse matrix-matrix multiplication on GPUs for graph neural networks[C]//Proceedings of the SC20:International Confe-rence for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-12.
[30]SHAN M,GUREVIN D,NYE J,et al.MergePath-SpMM:Pa-rallel sparse matrix-matrix algorithm for graph neural network acceleration[C]//Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2023:145-156.
[31]NAGASAKA Y,NUKADA A,KOJIMA R,et al.Batchedsparse matrix multiplication for accelerating graph convolutionalnetworks[C]//Proceedings of the 2019 19th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).IEEE,2019:231-240.
[32]ZHENG B,JIANG Z,YU C H,et al.DietCode:Automatic optimization for dynamic tensor programs[J].Proceedings of Machine Learning and Systems,2022,4(1):848-863.
[33]YE Z,LAI R,SHAO J,et al.SparseTIR:Composable abstractions for sparse compilation in deep learning[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 3.2023:660-678.
[34]WU W,SHI X,HE L,et al.TurboMGNN:Improvingconcurrent GNN training tasks on GPU with fine-grained kernel fusion[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(6):1968-1981.
[35]TAUZ L,DOLECEK L.Variable coded batch matrix multiplication [J].IEEE Journal on Selected Areas in Information Theory,2022,3(2):306-320.
[1] GAO Wei, WANG Lei, LI Jianan, LI Shuailong, HAN Lin. Operator Fusion Optimization for Deep Learning Compiler TVM [J]. Computer Science, 2025, 52(5): 58-66.
[2] HAN Lin, WANG Yifan, LI Jianan, GAO Wei. Automatic Scheduling Search Optimization Method Based on TVM [J]. Computer Science, 2025, 52(3): 268-276.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!