Computer Science ›› 2024, Vol. 51 ›› Issue (6): 52-60.doi: 10.11896/jsjkx.230800049

• Computer Software • Previous Articles     Next Articles

Automatic Tensorization for TPU Coarse-grained Instructions

LIU Lei1, ZHOU Zhide1, LIU Xingxiang2, CHE Haoyang3, YAO Lei3, JIANG He1   

  1. 1 School of Software Engineering,Dalian University of Technology,Dalian,Liaoning 116620,China
    2 Sangfor Technologies Inc.,Shenzhen,Guangdong 518000,China
    3 Zhejiang Zeekr Intelligent Technology Co.,Ltd.,Ningbo,Zhejiang 315800,China
  • Received:2023-08-08 Revised:2024-01-21 Online:2024-06-15 Published:2024-06-05
  • About author:LIU Lei,born in 2000,postgraduate.His main research interests include deep learning compiler and so on.
    ZHOU Zhide,born in 1990,Ph.D,research associate,is a member of CCF(No.94505M).His main research interests include intelligent software engineering,software testing,and deep learning compiler.
  • Supported by:
    Key Program of the National Natural Science Foundation of China(62032004),CCF-SANGFOR Foundation(2022003),China Postdoctoral Science Foundation(2023M730472)and National Natural Science Foundation of China(62302077).

Abstract: Tensorization refers to the process of calling specific hardware instructions to accelerate tensor programs.TPU supports various coarse-grained instructions for computation and memory transaction without clear constraints on the input scale.How to use these instructions to automatically generate tensorized programs has become an important topic.However,existing tensorization method requires a large number of handwritten matching fragments for coarse-grained instructions and does not support flexible instruction parallelism optimization like ping-pong buffer,which is inefficient to scale to TPU scenarios.To this end,this paper proposes Tir2TPU,an automatic tensorization method for TPU coarse-grained instructions.Firstly,Tir2TPU extracts the iterator binding information of Block structure and automatically performs instruction replacement while traversing TensorIR’s abstract syntax tree.Secondly,it also utilizes a parallel model that simulates hardware behavior to generate parallel instruction flow.Finally,Tir2TPU combines a hardware-centric schedule space based on TPU features,which greatly accelerates auto-tuning process.The performance of Tir2TPU is evaluatedon 5 commonly used operators in machine learning models.Experimental results show that Tir2TPU can achieve up to 3× and an average of 1.78 × speedup compared to TPU’s compiler,and consistently delivers 90% performance compared to manually optimized operators.

Key words: Machine-learning compiler, Tensor accelerator, Tensorization, Instruction parallelism, Operator optimization

CLC Number: 

  • TP311
[1]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[2]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners[J].Advances in Neural Information Proces-sing Systems,2020,33:1877-1901.
[3]CAESAR H,BANKITI V,LANG A H,et al.nuscenes:A multimodal dataset for autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:11621-11631.
[4]Nvidia.NVIDIA Tensor Cores[OL].https://www.nvidia.com/enus/datacenter/tensorcore/.
[5]LIAO H,TU J,XIA J,et al.DaVinci:A Scalable Architecture for Neural Network Computing[C]//Hot Chips Symposium.2019:1-44.
[6]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter per-formance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposiumon Computer Architecture.2017:1-12.
[7]CHEN T,MOREAU T,JIANG Z,et al.TVM:An automatedEnd-to-End optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).2018:578-594.
[8]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions[J].arXiv:1802.04730,2018.
[9]CYPHERS S,BANSAL A K,BHIWANDIWALLA A,et al.Intel ngraph:An intermediate representation,compiler,and executor for deep learning[J].arXiv:1801.08058,2018.
[10]Google(2022).XLA:Domain-specific compiler for linear algebra to optimize tensorflow computations[OL].https://www.tensorflow.org/xla/jit.
[11]ZHENG S,CHEN R,WEI A,et al.AMOS:enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction[C]//Proceedings of the 49th Annual International Symposium on Computer Architecture.2022:874-887.
[12]FENG S,HOU B,JIN H,et al.Tensorir:An abstraction for automatic tensorized program optimization[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2023:804-817.
[13]JOO Y M,MCKEOWN N.Doubling memory bandwidth for network buffers[C]//The Conference on Computer Communications.Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies(INFOCOM’98).IEEE,1998,2:808-815.
[14]WENG J,JAIN A,WANG J,et al.UNIT:Unifying tensorized instruction compilation[C]//2021 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).IEEE,2021:77-89.
[15]HE K,ZHANG X,REN S,et al.Deep residual learning for imagerecognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[16]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:a language and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines[J].ACM Sigplan Notices,2013,48(6):519-530.
[17]BAGHDADI R,RAY J,ROMDHANE M B,et al.Tiramisu:A polyhedral compiler for expressing fast and portable code[C]//2019 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).IEEE,2019:193-205.
[18]ZHAO J,LI B,NIE W,et al.AKG:automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIG-PLAN International Conference on Programming Language Design and Implementation.2021:1233-1248.
[19]LIU Y,WANG Y,YU R,et al.Optimizing CNN model inference on CPUs[C]//2019 USENIX Annual Technical Conference(USENIX ATC 19).2019:1025-1040.
[20]LI R,XU Y,SUKUMARAN-RAJAM A,et al.Analytical cha-racterization and design space exploration for optimization of CNNs[C]//Proceedings of the 26th ACM International Confe-rence on Architectural Support for Programming Languages and Operating Systems.2021:928-942.
[21]DING Y,YU C H,ZHENG B,et al.Hidet:Task-mapping programming paradigm for deep learning tensor programs[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.2023:370-384.
[22]HU Y,WANG Y,DAN X,et al.Cost-Aware TVM(CAT) Tensorization for Modern Deep Learning Accelerators[C]//2022 IEEE 40th International Conference on Computer Design(ICCD).IEEE,2022:352-359.
[23]CHEN T,ZHENG L,YAN E,et al.Learning to optimize tensor programs[J].Advances in Neural Information Processing Systems,2018,31:3393-3404.
[24]ZHENG L,JIA C,SUN M,et al.Ansor:Generating High-Performance tensor programs for deep learning[C]//14th USENIX Symposium on Operating Systems Design and Implementation(OSDI 20).2020:863-879.
[25]ZHU H,WU R,DIAO Y,et al.ROLLER:Fast and efficient tensor compilation for deep learning[C]//16th USENIX Sympo-sium on Operating Systems Design and Implementation(OSDI 22).2022:233-248.
[1] LIANG Jiali, HUA Baojian, SU Shaobo. Tensor Instruction Generation Optimization Fusing with Loop Partitioning [J]. Computer Science, 2023, 50(2): 374-383.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!