Computer Science ›› 2023, Vol. 50 ›› Issue (2): 374-383.doi: 10.11896/jsjkx.220300147

• Interdiscipline & Frontier • Previous Articles    

Tensor Instruction Generation Optimization Fusing with Loop Partitioning

LIANG Jiali, HUA Baojian, SU Shaobo   

  1. School of Software Engineering,University of Science and Technology of China,Hefei 230000,China
  • Received:2022-03-15 Revised:2022-07-04 Online:2023-02-15 Published:2023-02-22
  • Supported by:
    Graduate Education Innovation Program of USTC(2020YCJC41,2021YCJC34)

Abstract: The tensor compiler compiles the tensor algorithm and schedule of the operator into the code of the target hardware.In order to accelerate tensor operation,the special processor in the field of deep learning is designed as a special architecture with special instructions,which supports multi-core parallel,multi-level special memory architecture and tensor calculation.On top of the hardware,there is a tensor instruction set closely related to the characteristics of the hardware.In such a complex architecture,the use of tensor instructions has many constraints and limitations,and there are the following problems and challenges.Firstly,the conditional branches introduced by loop tiling such as computing task division or data chunking increase the difficulty of pattern matching.Secondly,tensor instructions have hardware constraints such as alignment and data layout.To solve the above problems and research challenges,an optimization algorithm of tensor instruction ge-neration based on loop partitioning is proposed.By dividing the loop interval,the algorithm eliminates the conditional branches introduced by task division or data segmentation.The instruction and hardware constraints are solved by filling zeros,replacing equivalent instructions and adding additional calculations.The tensor instruction is generated by pattern matching method.This paper studies and extends the open source deep learning compiler TVM version 0.7,and implements a compiler prototype system supporting tensor instruction ge-neration of DianNao architecture machine learning accelerator.In order to evaluate the effectiveness of the algorithm,the operator performance and development efficiency of element-wise binary tensor operator,in-place unary tensor operator and convolution operator are tested on the DianNao architecture machine learning accelerator hardware platform.Experimental results show that the average speedup of the three types of operators is 125.00%,the maximum speedup is 194.00%,and the maximum development efficiency increases by 7 times.

Key words: Deep learning, Tensor compiler, Domain-specific processor, Tensorization, Loop partition

CLC Number: 

  • TP311
[1]ABADI M,BARHAM P,CHEN J,et al.Tensorflow:A system for large-scale machine learning[C]//12th {USENIX} Sympo-sium on Operating Systems Design and Implementation({Osdi} 16).Savannah,GA,USA:USENIX Association,2016:265-283.
[2]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An imperative style,high-performance deep learning library[J].Advances in Neural Information Processing Systems,2019,32:8026-8037.
[3]CHEN T,LI M,LI Y,et al.Mxnet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv:1512.01274,2015.
[4]JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia.New York:Association for Computing Machinery,2014:675-678.
[5]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:3431-3440.
[6]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2016:770-778.
[7]KARPATHY A,TODERICI G,SHETTY S,et al.Large-scalevideo classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2014:1725-1732.
[8]TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotem-poral features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.NW Washington DC,United States:IEEE Computer Society,2015:4489-4497.
[9]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:A language and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines[C]//Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI).2013:519-530.
[10]CHEN T,MOREAU T,JIANG Z,et al.{TVM}:An automated end-to-end optimizing compiler for deep learning[C]//13th {USENIX} Symposium on Operating Systems Design and Implementation({OSDI} 18).Berkeley:{USENIX} Association,2018:578-594.
[11]LIU C,YANG H,SUN R,et al.swtvm:Exploring the automated compilation for deep learning on sunway architecture[J].arXiv:1904.07404,2019.
[12]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions[J].arXiv:1802.04730,2018.
[13]GAO W,FANG J,ZHAO W,et al.SwATOP:Automatically optimizing deep learning operators on SW26010 many-core processor[C]//Proceedings of the 48th International Conference on Parallel Processing.New York:ACM,2019:1-10.
[14]ZHAO J,LI B,NIE W,et al.AKG:automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation.New York:ACM,2021:1233-1248.
[15]CHOQUETTE J,GANDHI W.Nvidia A100 GPU:Performance &innovation for GPU computing[C]//2020 IEEE Hot Chips 32 Symposium(HCS).Piscataway:IEEE,2020:1-43.
[16]CHEN T,DU Z,SUN N,et al.Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning[J].ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
[17]SRIVASTAVA N,RONG H,BARUA P,et al.T2S-Tensor:Productively generating high-performance spatial hardware for dense tensor computations[C]//2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM).Piscataway:IEEE,2019:181-189.
[18]MOREAU T,CHEN T,VEGA L,et al.A Hardware-Software Blueprint for Flexible Deep Learning Specialization[J].IEEE Micro,2019,39(5):8-16.
[19]CHEN Y,CHEN T,XU Z,et al.DianNao family:Energy-efficient hardware accelerators for machine learning[J].Communications of the ACM,2016,59(11):105-112.
[20]LIAO H,TU J,XIA J,et al.Ascend:a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing:Industry Track Paper[C]//2021 IEEE International Symposium on High-Performance Computer Architecture(HPCA).Pisca-taway:IEEE,2021:789-801.
[21]LIAO H,TU J,XIA J,et al.Davinci:A scalable architecture for neural network computing[C]//2019 IEEE Hot Chips 31 Symposium(HCS).Piscataway:IEEE,2019:1-44.
[22]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.New York:Association for Computing Machinery,2017:1-12.
[23]MCKINLEY K S,CARR S,TSENG C W.Improving data locality with loop transformations[J].ACM Transactions on Programming Languages and Systems(TOPLAS),1996,18(4):424-453.
[24]HAMMAMI E,SLAMA Y.An overview on loop tiling tech-niques for code generation[C]//2017 IEEE/ACS 14th International Conference on Computer Systems and Applications(AICCSA).Piscataway:IEEE,2017:280-287.
[25]COCIORVA D,WILKINS J W,LAM C,et al.Loop optimization for a class of memory-constrained computations[C]//Procee-dings of the 15th International Conference on Supercomputing.New York,United States:Association for Computing Machi-nery,2001:103-113.
[26]ZHAO J,HORSNELL M,LUJÁN M,et al.Adaptive loop tiling for a multi-cluster cmp[C]//International Conference on Algorithms and Architectures for Parallel Processing.Berlin:Sprin-ger,2008:220-232.
[27]NUZMAN D,ROSEN I,ZAKS A.Auto-vectorization of interleaved data for SIMD[J].ACM SIGPLAN Notices,2006,41(6):132-143.
[28]EICHENBERGER A E,WU P,O'BRIEN K.Vectorization for SIMD architectures with alignment constraints[J].ACM Sig-plan Notices,2004,39(6):82-93.
[29]NUZMAN D,HENDERSON R.Multi-platform auto-vectorization[C]//International Symposium on Code Generation and Optimization(CGO'06).Piscataway:IEEE,2006:281-294.
[30]BHASKARACHARYA S G,DEMOUTH J,GROVER V.Automatic Kernel Generation for Volta Tensor Cores[J].arXiv:2006.12645,2020.
[31]TAVARAGERI S,HEINECKE A,AVANCHA S,et al.PolyDL:Polyhedral Optimizations for Creation of High-performance DL Primitives[J].ACM Transactions on Architecture and Code Optimization(TACO),2021,18(1):1-27.
[32]WENG J,JAIN A,WANG J,et al.UNIT:Unifying Tensorized Instruction Compilation[C]//Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization.2021:77-89.
[33]ROESCH J,LYUBOMIRSKY S,WEBER L,et al.Relay:A new ir for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages.New York,United States:Asso-ciation for Computing Machinery,2018:58-68.
[1] BAI Xuefei, MA Yanan, WANG Wenjian. Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion [J]. Computer Science, 2023, 50(3): 199-207.
[2] LIU Hang, PU Yuanyuan, LYU Dahua, ZHAO Zhengpeng, XU Dan, QIAN Wenhua. Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image [J]. Computer Science, 2023, 50(3): 208-215.
[3] CHEN Liang, WANG Lu, LI Shengchun, LIU Changhong. Study on Visual Dashboard Generation Technology Based on Deep Learning [J]. Computer Science, 2023, 50(3): 238-245.
[4] ZHANG Yi, WU Qin. Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention [J]. Computer Science, 2023, 50(3): 246-253.
[5] YING Zonghao, WU Bin. Backdoor Attack on Deep Learning Models:A Survey [J]. Computer Science, 2023, 50(3): 333-350.
[6] DONG Yongfeng, HUANG Gang, XUE Wanruo, LI Linhao. Graph Attention Deep Knowledge Tracing Model Integrated with IRT [J]. Computer Science, 2023, 50(3): 173-180.
[7] HUA Xiaofeng, FENG Na, YU Junqing, HE Yunfeng. Shooting Event Detection of Free Kick in Soccer Video Based on Rule Reasoning [J]. Computer Science, 2023, 50(3): 181-190.
[8] MEI Pengcheng, YANG Jibin, ZHANG Qiang, HUANG Xiang. Sound Event Joint Estimation Method Based on Three-dimension Convolution [J]. Computer Science, 2023, 50(3): 191-198.
[9] ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui. Visual Question Answering Model Based on Multi-modal Deep Feature Fusion [J]. Computer Science, 2023, 50(2): 123-129.
[10] WANG Pengyu, TAI Wenxin, LIU Fang, ZHONG Ting, LUO Xucheng, ZHOU Fan. Self-supervised Flight Trajectory Prediction Based on Data Augmentation [J]. Computer Science, 2023, 50(2): 130-137.
[11] GUO Nan, LI Jingyuan, REN Xi. Survey of Rigid Object Pose Estimation Algorithms Based on Deep Learning [J]. Computer Science, 2023, 50(2): 178-189.
[12] LI Junlin, OUYANG Zhi, DU Nisuo. Scene Text Detection with Improved Region Proposal Network [J]. Computer Science, 2023, 50(2): 201-208.
[13] HUA Jie, LIU Xueliang, ZHAO Ye. Few-shot Object Detection Based on Feature Fusion [J]. Computer Science, 2023, 50(2): 209-213.
[14] CAI Xiao, CEHN Zhihua, SHENG Bin. SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing [J]. Computer Science, 2023, 50(1): 105-113.
[15] WANG Bin, LIANG Yudong, LIU Zhe, ZHANG Chao, LI Deyu. Study on Unsupervised Image Dehazing and Low-light Image Enhancement Algorithms Based on Luminance Adjustment [J]. Computer Science, 2023, 50(1): 123-130.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!