计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 52-60.doi: 10.11896/jsjkx.230800049

• 计算机软件 • 上一篇    下一篇

面向TPU粗粒度指令的自动张量化方法

刘磊1, 周志德1, 刘兴祥2, 车皓阳3, 姚雷3, 江贺1   

  1. 1 大连理工大学软件学院 辽宁 大连 116620
    2 深信服科技股份有限公司 广东 深圳 518000
    3 浙江极氪智能科技有限公司 浙江 宁波 315800
  • 收稿日期:2023-08-08 修回日期:2024-01-21 出版日期:2024-06-15 发布日期:2024-06-05
  • 通讯作者: 周志德(cszide@gmail.com)
  • 作者简介:(22117006@mail.dlut.edu.cn)
  • 基金资助:
    国家自然科学基金重点项目(62032004);CCF-深信服伏羲基金项目(2022003);中国博士后科学基金(2023M730472);国家自然科学基金(62302077)

Automatic Tensorization for TPU Coarse-grained Instructions

LIU Lei1, ZHOU Zhide1, LIU Xingxiang2, CHE Haoyang3, YAO Lei3, JIANG He1   

  1. 1 School of Software Engineering,Dalian University of Technology,Dalian,Liaoning 116620,China
    2 Sangfor Technologies Inc.,Shenzhen,Guangdong 518000,China
    3 Zhejiang Zeekr Intelligent Technology Co.,Ltd.,Ningbo,Zhejiang 315800,China
  • Received:2023-08-08 Revised:2024-01-21 Online:2024-06-15 Published:2024-06-05
  • About author:LIU Lei,born in 2000,postgraduate.His main research interests include deep learning compiler and so on.
    ZHOU Zhide,born in 1990,Ph.D,research associate,is a member of CCF(No.94505M).His main research interests include intelligent software engineering,software testing,and deep learning compiler.
  • Supported by:
    Key Program of the National Natural Science Foundation of China(62032004),CCF-SANGFOR Foundation(2022003),China Postdoctoral Science Foundation(2023M730472)and National Natural Science Foundation of China(62302077).

摘要: 张量化是通过调用硬件特定指令对张量运算进行加速的过程。TPU支持多种粗粒度指令,可表示神经网络级别的算子,且没有明确的运算规模限制。现有张量化方法对于粗粒度指令需要手写大量的IR匹配片段,且难以实现灵活的双缓存(ping-pong buffer)形式的指令并行优化,不利于扩展至TPU场景。为此,提出了一种面向TPU粗粒度指令的自动张量化方法——Tir2TPU。首先,基于TensorIR抽象语法树的分析对运算程序进行指令替换。其次,设计了一种模拟硬件行为的并行模型以实现指令并行优化。最后,构建了基于TPU硬件特征的程序调度空间以实现快速自动调优。实验对矩阵乘法等5种机器学习模型中常用的算子进行了性能评估。实验结果表明,Tir2TPU自动优化生成的算子与TPU自有编译器相比可取得最高3.1倍、平均1.78倍的运算加速,并且可取得平均90%的手工优化性能。

关键词: 机器学习编译器, 张量加速器, 张量化, 指令并行, 算子优化

Abstract: Tensorization refers to the process of calling specific hardware instructions to accelerate tensor programs.TPU supports various coarse-grained instructions for computation and memory transaction without clear constraints on the input scale.How to use these instructions to automatically generate tensorized programs has become an important topic.However,existing tensorization method requires a large number of handwritten matching fragments for coarse-grained instructions and does not support flexible instruction parallelism optimization like ping-pong buffer,which is inefficient to scale to TPU scenarios.To this end,this paper proposes Tir2TPU,an automatic tensorization method for TPU coarse-grained instructions.Firstly,Tir2TPU extracts the iterator binding information of Block structure and automatically performs instruction replacement while traversing TensorIR’s abstract syntax tree.Secondly,it also utilizes a parallel model that simulates hardware behavior to generate parallel instruction flow.Finally,Tir2TPU combines a hardware-centric schedule space based on TPU features,which greatly accelerates auto-tuning process.The performance of Tir2TPU is evaluatedon 5 commonly used operators in machine learning models.Experimental results show that Tir2TPU can achieve up to 3× and an average of 1.78 × speedup compared to TPU’s compiler,and consistently delivers 90% performance compared to manually optimized operators.

Key words: Machine-learning compiler, Tensor accelerator, Tensorization, Instruction parallelism, Operator optimization

中图分类号: 

  • TP311
[1]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[2]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[3]CAESAR H,BANKITI V,LANG A H,et al.nuscenes:A multimodal dataset for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11621-11631.
[4]Nvidia.NVIDIA Tensor Cores[OL].https://www.nvidia.com/enus/datacenter/tensorcore/.
[5]LIAO H,TU J,XIA J,et al.DaVinci:A Scalable Architecture for Neural Network Computing[C]//Hot Chips Symposium.2019:1-44.
[6]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter per-formance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposiumon Computer Architecture.2017:1-12.
[7]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594.
[8]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions[J].arXiv:1802.04730,2018.
[9]CYPHERS S,BANSAL A K,BHIWANDIWALLA A,et al.Intel ngraph:An intermediate representation,compiler,and executor for deep learning[J].arXiv:1801.08058,2018.
[10]Google(2022).XLA:Domain-specific compiler for linear algebra to optimize tensorflow computations[OL].https://www.tensorflow.org/xla/jit.
[11]ZHENG S,CHEN R,WEI A,et al.AMOS:enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction[C]//Proceedings of the 49th Annual International Symposium on Computer Architecture.2022:874-887.
[12]FENG S,HOU B,JIN H,et al.Tensorir:An abstraction for automatic tensorized program optimization[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2023:804-817.
[13]JOO Y M,MCKEOWN N.Doubling memory bandwidth for network buffers[C]//The Conference on Computer Communications.Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies(INFOCOM’98).IEEE,1998,2:808-815.
[14]WENG J,JAIN A,WANG J,et al.UNIT:Unifying tensorized instruction compilation[C]//2021 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).IEEE,2021:77-89.
[15]HE K,ZHANG X,REN S,et al.Deep residual learning for imagerecognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[16]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:a language and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines[J].ACM Sigplan Notices,2013,48(6):519-530.
[17]BAGHDADI R,RAY J,ROMDHANE M B,et al.Tiramisu:A polyhedral compiler for expressing fast and portable code[C]//2019 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).IEEE,2019:193-205.
[18]ZHAO J,LI B,NIE W,et al.AKG:automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIG-PLAN International Conference on Programming Language Design and Implementation.2021:1233-1248.
[19]LIU Y,WANG Y,YU R,et al.Optimizing CNN model inference on CPUs[C]//2019 USENIX Annual Technical Conference(USENIX ATC 19).2019:1025-1040.
[20]LI R,XU Y,SUKUMARAN-RAJAM A,et al.Analytical cha-racterization and design space exploration for optimization of CNNs[C]//Proceedings of the 26th ACM International Confe-rence on Architectural Support for Programming Languages and Operating Systems.2021:928-942.
[21]DING Y,YU C H,ZHENG B,et al.Hidet:Task-mapping programming paradigm for deep learning tensor programs[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2023:370-384.
[22]HU Y,WANG Y,DAN X,et al.Cost-Aware TVM(CAT) Tensorization for Modern Deep Learning Accelerators[C]//2022 IEEE 40th International Conference on Computer Design(ICCD).IEEE,2022:352-359.
[23]CHEN T,ZHENG L,YAN E,et al.Learning to optimize tensor programs[J].Advances in Neural Information Processing Systems,2018,31:3393-3404.
[24]ZHENG L,JIA C,SUN M,et al.Ansor:Generating High-Performance tensor programs for deep learning[C]//14th USENIX Symposium on Operating Systems Design and Implementation(OSDI 20).2020:863-879.
[25]ZHU H,WU R,DIAO Y,et al.ROLLER:Fast and efficient tensor compilation for deep learning[C]//16th USENIX Sympo-sium on Operating Systems Design and Implementation(OSDI 22).2022:233-248.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!