计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 25-40.doi: 10.11896/jsjkx.240500052

• 高性能计算 • 上一篇    下一篇

基于TVM 的变维批处理小矩阵乘法的加速及应用

戴翰文, 陈长波   

  1. 中国科学院重庆绿色智能技术研究院 重庆 400714
    中国科学院大学重庆学院 重庆 400714
  • 收稿日期:2024-05-13 修回日期:2024-11-29 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 陈长波(chenchangbo@cigit.ac.cn)
  • 作者简介:(18944511402@163.com)
  • 基金资助:
    国家重点研发计划(2023YFA1009402,2020YFA0712300);重庆英才计划青年拔尖项目(2021000263);重庆市院士牵头科技创新引导专项(cstc2021yszx-jcyjX0004,2022YSZX-JCX0011CSTB,CSTB2023YSZX-JCX0008)

Accelerating Batched Matrix Multiplication for Variable Small Sizes Based on TVM andApplications

DAI Hanwen, CHEN Changbo   

  1. Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714,China
    Chongqing School,University of Chinese Academy of Sciences,Chongqing 400714,China
  • Received:2024-05-13 Revised:2024-11-29 Online:2025-05-15 Published:2025-05-12
  • About author:DAI Hanwen,born in 1998,postgraduate.His main research interests include high performance computing and machine learning.
    CHEN Changbo,born in 1981,Ph.D,professor,is a member of CCF(No.64614M).His main research interests include computer algebra,high perfor-mance computing and applied machine learning.
  • Supported by:
    National Key R&D Program of China(2023YFA1009402,2020YFA0712300), Chongqing Talents Plan Youth Top-Notch Project(2021000263) and Chongqing Academician-led Science and Technology Innovation Guidance Projects(cstc2021yszx-jcyjX0004,2022YSZX-JCX0011CSTB,CSTB2023YSZX-JCX0008).

摘要: 很多实际应用中需要高效计算大量不同维度的小矩阵乘积,如基于图神经网络的图分类需要将多个邻接矩阵与节点特征矩阵相乘。针对现有方法无法跨不同硬件平台高效计算此类维度各异(简称变维)批处理小矩阵乘法的问题,基于深度学习编译器TVM,提出了一种可以跨平台的高效算法BVSM,通过为小矩阵特制优化模板、运用张量化批处理和分组填充等技术使得TVM可以高效进行变维批处理小矩阵乘法。在真实图分类任务数据集上的实验表明,在CPU 端,BVSM相较于自动调度和调优的TVM(AnsorTVM)平均获得两倍以上加速,平均性能达到Intel MKL变维批处理矩阵乘法的95%,最高为其1.27倍;在 GPU 端,BVSM相较于AnsorTVM 平均获得62.05倍的加速,相较于cuBLAS平均获得28.82倍的加速,相较于MAGMA 的变维批处理矩阵乘法平均获得6.59倍的加速。

关键词: TVM, 批处理矩阵乘法, 变维矩阵乘法

Abstract: In many practical applications,efficient computation of a large amount of small matrix products across different dimensions is required.For instance,in graph classification tasks based on graph neural networks,multiple adjacency matrices need to be multiplied with node feature matrices.To address the issue of existing methods being unable to efficiently handle batched matrix multiplication for variable and small sizes across different hardware platforms,this paper introduces a cross-platform efficient algorithm,BVSM,based on the deep learning compiler TVM.BVSM enhances the capability of TVM to efficiently perform batched matrix multiplication for small and variable sizes by utilizing customized optimization templates for small matrices,employing tensorization for batching,and applying grouped padding techniques.Experiments on real datasets of graph classification demonstrate that,on the CPU,BVSM achieves on average over two times speedup compared to auto-scheduled and auto-tuned TVM(AnsorTVM),reaching 95% of the average performance of,and achieving up to 1.27 times compared to,the method of batched matrix multiplication for variable sizes of Intel MKL.On the GPU,BVSM achieves on average over 62.05 times speedup compared to AnsorTVM,over 28.82 times speedup compared to cuBLAS,and over 6.59 times speedup compared to the method of batched matrix multiplication for variable sizes of MAGMA.

Key words: TVM, Batched matrix multiplication, Matrix multiplication for variable sizes

中图分类号: 

  • TP302
[1]WU S,ZHAI Y,LIU J,et al.Anatomy of high performance GEMM with online fault tolerance on GPUs[C]//Proceedings of the 37th International Conference on Supercomputing.2023:360-372.
[2]XIA T,FU G L,QU S R,et al.Optimization of parallel computation on sparse matrix vector multiplication with high predictability[J].Journal of Computer Research and Development,2023,60(9):1973-1987.
[3]HWANG R,KANG M,LEE J,et al.GROW:A row-stationary sparse-dense GEMM accelerator for memory-Efficient graph convol-utional neural networks[C]//Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:42-55.
[4]REGGIANI E,PAPPALARDO A,DOBLAS M,et al.Mix-GEMM:An efficient HW-SW architecture for mixed-Precision quantized deepneural networks inference on edge devices[C]//Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:1085-1098.
[5]HU Y,YE Z,WANG M,et al.Featgraph:A flexible and effi-cient backend for graph neural network systems[C]//Procee-dings of the SC20:International Conference for High Perfor-mance Computing,Networking,Storage and Analysis.IEEE,2020:1-13.
[6]LI M,LIU Y,LIU X,et al.Thedeep learning compiler:A comp-rehensive survey[J].IEEE Transactions on Parallel & Distributed Systems,2021,32(3):708-721.
[7]ZENG J,KOUM Y,ZHENG X Y,et al.TVMT:A high per-formance neural network training compiler based on TVM[J].SCIENTIA SINICA Informationis,2023,53(12):2458-2471.
[8]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//Procee-dings of the 13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594.
[9]ZHENG L,JIA C,SUN M,et al.Ansor:Generating high per-formance tensor programs for deep learning[C]//Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation(OSDI 20).2020:863-879.
[10]VALERO L P,MARTINEZ P I,MATEO S,et al.Variablebatched DGEMM[C]//Proceedings of the 2018 26th Euromicro International Conference on Parallel,Distributed and Network-based Processing.IEEE,2018:363-367.
[12]DONG T,HAIDAR A,LUSZCZEK P,et al.Magma batched:A batched BLAS approach for small matrix factorizations and applications on GPUs[R].Knoxville:Innovative Computing Laboratory,University of Tennessee,2016.
[13]Introducing batch GEMM operations[EB/OL].https://www.intel.cn/content/www/cn/zh/developer/articles/technical/intro-ducing-batch-gemm-operations.html.
[14]Matrix Algebra for GPU and Multicore Architectures[EB/OL].https://icl.utk.edu/projectsfiles/magma/.
[15]ZHANG S,TONG H,XU J,et al.Graph convolutional net-works:A comprehensive review[J].Computational Social Networks,2019,6(1):1-23.
[16]XIAO G Q,LI X Q,CHEN Y D,et al.A review of large scale graph neural networks [J].Chinese Journal of Computers,2024,47(1):148-171.
[17]LIU L,CHEN C B.Band sparse matrix multiplication and efficient GPU implementation[J].Journal of Computer Applications,2023,43(12):3856-3867.
[18]MORRIS C,KRIEGE N M,BAUSE F,et al.Tudataset:A collection ofbenchmark datasets for learning with graphs[J].arXiv:08663.pdf,2020.
[19]YANG Z,LU L,WANG R.A batched GEMM optimizationframework for deep learning[J].The Journal of Supercompu-ting,2022,78(11):13393-13408.
[20]HUANG C,JIANG H,QUAN Z,et al.Design and implementation of batch matrix multiplication for deep learning[J].Chinese Journal of Computers,2022,45(2):225-239.
[21]LEDOUX L,CASAS M.A generator of numerically tailored and high throughput accelerators for batched GEMMs[C]//Proceedings of the 2022 IEEE 30th AnnualInternational Symposiumon Field Programmable Custom Computing Machines(FCCM).IEEE,2022:1-10.
[22]HUANG R,ZHAO Y,YU T,et al.Batched LU factorizationwith fast row interchanges for small matrices on GPUs[C]//Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications;8th International Conference on Data Science & Systems;20th International Conference on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:259-266.
[23]ABDELFATTAH A,TOMOV S,DONGARRA J.Matrix multiplication on batches of small matrices in half and half-complex precisions[J].Journal of Parallel and Distributed Computing,2020,145:188-201.
[24]ABDELFATTAH A,TOMOV S,DONGARRA J.Fast batched matrix multi-plication for small sizes using half-precision arithmetic on GPUs[C]//Proceedings ofthe 2019 IEEE InternationalParallel and Distributed Processing Symposium(IPDPS).IEEE,2019:111-122.
[25]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.NovelHPC techniquesto batch execution of many variable size BLAS computations on GPUs[C]//Proceedings of the International Conference on Supercomputing.2017:1-10.
[26]LI X,LIANG Y,YAN S,et al.A coordinated tiling and batching framework for efficient GEMM on GPUs[C]//Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.2019:229-241.
[27]WANG R,YANG Z,XU H,et al.A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution[J].The Journal of Supercomputing,2022,78(2):1741-1758.
[28]ZHANG Y,WANG Y,MO Z,et al.Accelerating small matrixmulti-plications by adaptive batching strategy on GPU[C]//Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications;8th International Conference on Data Science & Systems;20th InternationalConference on Smart City;8th International Conference on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/Smart-City/DependSys).IEEE,2022:882-887.
[29]HUANG G,DAI G,WANG Y,et al.Ge-spmm:General-purposesparse matrix-matrix multiplication on GPUs for graph neural networks[C]//Proceedings of the SC20:International Confe-rence for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-12.
[30]SHAN M,GUREVIN D,NYE J,et al.MergePath-SpMM:Pa-rallel sparse matrix-matrix algorithm for graph neural network acceleration[C]//Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2023:145-156.
[31]NAGASAKA Y,NUKADA A,KOJIMA R,et al.Batchedsparse matrix multiplication for accelerating graph convolutionalnetworks[C]//Proceedings of the 2019 19th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing(CCGRID).IEEE,2019:231-240.
[32]ZHENG B,JIANG Z,YU C H,et al.DietCode:Automatic optimization for dynamic tensor programs[J].Proceedings of Machine Learning and Systems,2022,4(1):848-863.
[33]YE Z,LAI R,SHAO J,et al.SparseTIR:Composable abstractions for sparse compilation in deep learning[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 3.2023:660-678.
[34]WU W,SHI X,HE L,et al.TurboMGNN:Improvingconcurrent GNN training tasks on GPU with fine-grained kernel fusion[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(6):1968-1981.
[35]TAUZ L,DOLECEK L.Variable coded batch matrix multiplication [J].IEEE Journal on Selected Areas in Information Theory,2022,3(2):306-320.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!