计算机科学 ›› 2023, Vol. 50 ›› Issue (2): 374-383.doi: 10.11896/jsjkx.220300147

• 交叉&前沿 • 上一篇    

融合循环划分的张量指令生成优化

梁佳利, 华保健, 苏少博   

  1. 中国科学技术大学软件学院 合肥 230000
  • 收稿日期:2022-03-15 修回日期:2022-07-04 出版日期:2023-02-15 发布日期:2023-02-22
  • 通讯作者: 华保健(bjhua@ustc.edu.cn)
  • 作者简介:(liangjl@mail.ustc.edu.cn)
  • 基金资助:
    中国科学技术大学研究生教育创新计划(2020YCJC41,2021YCJC34)

Tensor Instruction Generation Optimization Fusing with Loop Partitioning

LIANG Jiali, HUA Baojian, SU Shaobo   

  1. School of Software Engineering,University of Science and Technology of China,Hefei 230000,China
  • Received:2022-03-15 Revised:2022-07-04 Online:2023-02-15 Published:2023-02-22
  • Supported by:
    Graduate Education Innovation Program of USTC(2020YCJC41,2021YCJC34)

摘要: 张量编译器支持将算子的张量描述和计算调度编译为目标硬件的代码。为加速张量运算,深度学习领域专用处理器被设计为包含特殊指令的专有架构,支持多核并行、多级专用内存架构和张量计算,在硬件之上还有与硬件特性紧密相关的张量指令集。在这样复杂的架构上,张量指令的使用有着许多约束与限制,并存在以下问题和挑战:首先,因计算任务划分或数据切块等循环分段引入的条件分支增加了模式匹配难度;其次,张量指令有对齐、数据布局等硬件约束。针对上述问题和挑战,提出了一种融合循环划分的张量指令生成优化算法。算法通过划分循环区间,来消除因任务划分或数据切分引入的条件分支;通过补零、等价指令替换和添加额外计算来解决指令和硬件约束;并使用模式匹配的方法生成张量指令。研究并扩展开源深度学习编译器 TVM 0.7 版本,实现了支持 DianNao 架构机器学习加速器的张量指令生成的编译器原型系统。为评测算法的有效性,在 DianNao 架构机器学习加速器硬件平台上,对逐元素二元张量操作算子、原地一元张量操作算子和卷积操作算子3类算子的性能和开发效率进行了测试,实验结果表明3类算子性能平均加速比为 125.00%,最大加速比为 194.00%,开发效率最高提升了7 倍。

关键词: 深度学习, 张量编译器, 领域专用处理器, 张量化, 循环划分

Abstract: The tensor compiler compiles the tensor algorithm and schedule of the operator into the code of the target hardware.In order to accelerate tensor operation,the special processor in the field of deep learning is designed as a special architecture with special instructions,which supports multi-core parallel,multi-level special memory architecture and tensor calculation.On top of the hardware,there is a tensor instruction set closely related to the characteristics of the hardware.In such a complex architecture,the use of tensor instructions has many constraints and limitations,and there are the following problems and challenges.Firstly,the conditional branches introduced by loop tiling such as computing task division or data chunking increase the difficulty of pattern matching.Secondly,tensor instructions have hardware constraints such as alignment and data layout.To solve the above problems and research challenges,an optimization algorithm of tensor instruction ge-neration based on loop partitioning is proposed.By dividing the loop interval,the algorithm eliminates the conditional branches introduced by task division or data segmentation.The instruction and hardware constraints are solved by filling zeros,replacing equivalent instructions and adding additional calculations.The tensor instruction is generated by pattern matching method.This paper studies and extends the open source deep learning compiler TVM version 0.7,and implements a compiler prototype system supporting tensor instruction ge-neration of DianNao architecture machine learning accelerator.In order to evaluate the effectiveness of the algorithm,the operator performance and development efficiency of element-wise binary tensor operator,in-place unary tensor operator and convolution operator are tested on the DianNao architecture machine learning accelerator hardware platform.Experimental results show that the average speedup of the three types of operators is 125.00%,the maximum speedup is 194.00%,and the maximum development efficiency increases by 7 times.

Key words: Deep learning, Tensor compiler, Domain-specific processor, Tensorization, Loop partition

中图分类号: 

  • TP311
[1]ABADI M,BARHAM P,CHEN J,et al.Tensorflow:A system for large-scale machine learning[C]//12th {USENIX} Sympo-sium on Operating Systems Design and Implementation({Osdi} 16).Savannah,GA,USA:USENIX Association,2016:265-283.
[2]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An imperative style,high-performance deep learning library[J].Advances in Neural Information Processing Systems,2019,32:8026-8037.
[3]CHEN T,LI M,LI Y,et al.Mxnet:A flexible and efficient machine learning library for heterogeneous distributed systems[J].arXiv:1512.01274,2015.
[4]JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia.New York:Association for Computing Machinery,2014:675-678.
[5]LONG J,SHELHAMER E,DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:3431-3440.
[6]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2016:770-778.
[7]KARPATHY A,TODERICI G,SHETTY S,et al.Large-scalevideo classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2014:1725-1732.
[8]TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotem-poral features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.NW Washington DC,United States:IEEE Computer Society,2015:4489-4497.
[9]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:A language and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines[C]//Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI).2013:519-530.
[10]CHEN T,MOREAU T,JIANG Z,et al.{TVM}:An automated end-to-end optimizing compiler for deep learning[C]//13th {USENIX} Symposium on Operating Systems Design and Implementation({OSDI} 18).Berkeley:{USENIX} Association,2018:578-594.
[11]LIU C,YANG H,SUN R,et al.swtvm:Exploring the automated compilation for deep learning on sunway architecture[J].arXiv:1904.07404,2019.
[12]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions[J].arXiv:1802.04730,2018.
[13]GAO W,FANG J,ZHAO W,et al.SwATOP:Automatically optimizing deep learning operators on SW26010 many-core processor[C]//Proceedings of the 48th International Conference on Parallel Processing.New York:ACM,2019:1-10.
[14]ZHAO J,LI B,NIE W,et al.AKG:automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation.New York:ACM,2021:1233-1248.
[15]CHOQUETTE J,GANDHI W.Nvidia A100 GPU:Performance &innovation for GPU computing[C]//2020 IEEE Hot Chips 32 Symposium(HCS).Piscataway:IEEE,2020:1-43.
[16]CHEN T,DU Z,SUN N,et al.Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning[J].ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
[17]SRIVASTAVA N,RONG H,BARUA P,et al.T2S-Tensor:Productively generating high-performance spatial hardware for dense tensor computations[C]//2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM).Piscataway:IEEE,2019:181-189.
[18]MOREAU T,CHEN T,VEGA L,et al.A Hardware-Software Blueprint for Flexible Deep Learning Specialization[J].IEEE Micro,2019,39(5):8-16.
[19]CHEN Y,CHEN T,XU Z,et al.DianNao family:Energy-efficient hardware accelerators for machine learning[J].Communications of the ACM,2016,59(11):105-112.
[20]LIAO H,TU J,XIA J,et al.Ascend:a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing:Industry Track Paper[C]//2021 IEEE International Symposium on High-Performance Computer Architecture(HPCA).Pisca-taway:IEEE,2021:789-801.
[21]LIAO H,TU J,XIA J,et al.Davinci:A scalable architecture for neural network computing[C]//2019 IEEE Hot Chips 31 Symposium(HCS).Piscataway:IEEE,2019:1-44.
[22]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.New York:Association for Computing Machinery,2017:1-12.
[23]MCKINLEY K S,CARR S,TSENG C W.Improving data locality with loop transformations[J].ACM Transactions on Programming Languages and Systems(TOPLAS),1996,18(4):424-453.
[24]HAMMAMI E,SLAMA Y.An overview on loop tiling tech-niques for code generation[C]//2017 IEEE/ACS 14th International Conference on Computer Systems and Applications(AICCSA).Piscataway:IEEE,2017:280-287.
[25]COCIORVA D,WILKINS J W,LAM C,et al.Loop optimization for a class of memory-constrained computations[C]//Procee-dings of the 15th International Conference on Supercomputing.New York,United States:Association for Computing Machi-nery,2001:103-113.
[26]ZHAO J,HORSNELL M,LUJÁN M,et al.Adaptive loop tiling for a multi-cluster cmp[C]//International Conference on Algorithms and Architectures for Parallel Processing.Berlin:Sprin-ger,2008:220-232.
[27]NUZMAN D,ROSEN I,ZAKS A.Auto-vectorization of interleaved data for SIMD[J].ACM SIGPLAN Notices,2006,41(6):132-143.
[28]EICHENBERGER A E,WU P,O'BRIEN K.Vectorization for SIMD architectures with alignment constraints[J].ACM Sig-plan Notices,2004,39(6):82-93.
[29]NUZMAN D,HENDERSON R.Multi-platform auto-vectorization[C]//International Symposium on Code Generation and Optimization(CGO'06).Piscataway:IEEE,2006:281-294.
[30]BHASKARACHARYA S G,DEMOUTH J,GROVER V.Automatic Kernel Generation for Volta Tensor Cores[J].arXiv:2006.12645,2020.
[31]TAVARAGERI S,HEINECKE A,AVANCHA S,et al.PolyDL:Polyhedral Optimizations for Creation of High-performance DL Primitives[J].ACM Transactions on Architecture and Code Optimization(TACO),2021,18(1):1-27.
[32]WENG J,JAIN A,WANG J,et al.UNIT:Unifying Tensorized Instruction Compilation[C]//Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization.2021:77-89.
[33]ROESCH J,LYUBOMIRSKY S,WEBER L,et al.Relay:A new ir for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages.New York,United States:Asso-ciation for Computing Machinery,2018:58-68.
[1] 董永峰, 黄港, 薛婉若, 李林昊.
融合IRT的图注意力深度知识追踪模型
Graph Attention Deep Knowledge Tracing Model Integrated with IRT
计算机科学, 2023, 50(3): 173-180. https://doi.org/10.11896/jsjkx.211200134
[2] 华晓凤, 冯娜, 于俊清, 何云峰.
基于规则推理的足球视频任意球射门事件检测
Shooting Event Detection of Free Kick in Soccer Video Based on Rule Reasoning
计算机科学, 2023, 50(3): 181-190. https://doi.org/10.11896/jsjkx.220300062
[3] 梅鹏程, 杨吉斌, 张强, 黄翔.
一种基于三维卷积的声学事件联合估计方法
Sound Event Joint Estimation Method Based on Three-dimension Convolution
计算机科学, 2023, 50(3): 191-198. https://doi.org/10.11896/jsjkx.220500259
[4] 白雪飞, 马亚楠, 王文剑.
基于特征融合的边缘引导乳腺超声图像分割方法
Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion
计算机科学, 2023, 50(3): 199-207. https://doi.org/10.11896/jsjkx.211200294
[5] 刘航, 普园媛, 吕大华, 赵征鹏, 徐丹, 钱文华.
极化自注意力约束颜色溢出的图像自动上色
Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image
计算机科学, 2023, 50(3): 208-215. https://doi.org/10.11896/jsjkx.220100149
[6] 陈亮, 王璐, 李生春, 刘昌宏.
基于深度学习的可视化仪表板生成技术研究
Study on Visual Dashboard Generation Technology Based on Deep Learning
计算机科学, 2023, 50(3): 238-245. https://doi.org/10.11896/jsjkx.230100064
[7] 张译, 吴秦.
特征增强损失与前景注意力人群计数网络
Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention
计算机科学, 2023, 50(3): 246-253. https://doi.org/10.11896/jsjkx.220100219
[8] 应宗浩, 吴槟.
深度学习模型的后门攻击研究综述
Backdoor Attack on Deep Learning Models:A Survey
计算机科学, 2023, 50(3): 333-350. https://doi.org/10.11896/jsjkx.220600031
[9] 邹芸竹, 杜圣东, 滕飞, 李天瑞.
一种基于多模态深度特征融合的视觉问答模型
Visual Question Answering Model Based on Multi-modal Deep Feature Fusion
计算机科学, 2023, 50(2): 123-129. https://doi.org/10.11896/jsjkx.211200303
[10] 王鹏宇, 台文鑫, 刘芳, 钟婷, 罗绪成, 周帆.
基于数据增强的自监督飞行航迹预测
Self-supervised Flight Trajectory Prediction Based on Data Augmentation
计算机科学, 2023, 50(2): 130-137. https://doi.org/10.11896/jsjkx.211200016
[11] 郭楠, 李婧源, 任曦.
基于深度学习的刚体位姿估计方法综述
Survey of Rigid Object Pose Estimation Algorithms Based on Deep Learning
计算机科学, 2023, 50(2): 178-189. https://doi.org/10.11896/jsjkx.211200164
[12] 李俊林, 欧阳智, 杜逆索.
基于改进区域候选网络的场景文本检测
Scene Text Detection with Improved Region Proposal Network
计算机科学, 2023, 50(2): 201-208. https://doi.org/10.11896/jsjkx.211000191
[13] 华杰, 刘学亮, 赵烨.
基于特征融合的小样本目标检测
Few-shot Object Detection Based on Feature Fusion
计算机科学, 2023, 50(2): 209-213. https://doi.org/10.11896/jsjkx.220500153
[14] 蔡肖, 陈志华, 盛斌.
基于移位窗口金字塔Transformer的遥感图像目标检测
SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing
计算机科学, 2023, 50(1): 105-113. https://doi.org/10.11896/jsjkx.211100208
[15] 王斌, 梁宇栋, 刘哲, 张超, 李德玉.
亮度自调节的无监督图像去雾与低光图像增强算法研究
Study on Unsupervised Image Dehazing and Low-light Image Enhancement Algorithms Based on Luminance Adjustment
计算机科学, 2023, 50(1): 123-130. https://doi.org/10.11896/jsjkx.211100058
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!