计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 58-66.doi: 10.11896/jsjkx.240100018

• 高性能计算 • 上一篇    下一篇

面向深度学习编译器TVM的算子融合优化

高伟1, 王磊1,2, 李嘉楠1,2, 李帅龙1,2, 韩林1   

  1. 1 国家超级计算郑州中心(郑州大学) 郑州 450001
    2 郑州大学计算机与人工智能学院 郑州 450001
  • 收稿日期:2024-01-02 修回日期:2024-06-20 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 韩林(strollerlin@163.com)
  • 作者简介:(yongwu22@126.com)
  • 基金资助:
    河南省重大科技专项“国产先进计算平台创新生态及应用研究”(221100210600)

Operator Fusion Optimization for Deep Learning Compiler TVM

GAO Wei1, WANG Lei1,2, LI Jianan1,2, LI Shuailong1,2, HAN Lin1   

  1. 1 National Supercomputing Center in Zhengzhou(Zhengzhou University),Zhengzhou 450001,China
    2 School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou 450001,China
  • Received:2024-01-02 Revised:2024-06-20 Online:2025-05-15 Published:2025-05-12
  • About author:GAO Wei,born in 1985,Ph.D,associate professor.His main research interest is high performance computing.
    HAN Lin,born in 1978,Ph.D,professor.His main research interests include high performance computing,advanced compilation,parallel programming and optimization.
  • Supported by:
    Henan Province Major Science and Technology Project “Domestic Advanced Computing Platform Innovation Ecology and Application Research”(221100210600).

摘要: 算子融合是深度学习编译器中的一种编译优化技术,能够将多个算子合并为一个大的算子,有效降低计算和访存的成本。深度学习编译器TVM的算子融合方案中将算子按照功能特性进行分类,并设计融合规则,最后采用贪心算法进行融合。这种融合方案存在以下问题:首先,功能特性的算子分类方式下的融合规则不够通用,会错失算子融合机会,无法实现更大粒度的融合;其次,贪心的融合算法也无法实现算子融合的最优解。针对上述问题,对TVM进行改进,提出按照算子输入输出映射类型的算子分类方式,并设计通用的算子融合规则以扩大算子融合的粒度;提出基于动态规划的融合方案搜索算法和算子融合代价评估模型,并对搜索空间进行剪枝,使得算法能够在合理时间内搜索得到优化的融合方案。为评测融合方案的有效性,在CPU以及DCU等平台上对VGG-16,Efficient-B0,MobileNet-V1,YOLO-V4等深度学习模型的融合比和推理时延进行测试,实验结果表明,相较于TVM原有融合方案,所提方案融合比平均提升了27%,推理时延平均获得了1.75的加速比。

关键词: 深度学习编译器, TVM, 算子融合, 融合规则, 动态规划

Abstract: Operator fusion technique is an optimization method employed by deep learning compilers to combine multiple operators into a single,larger operator.This approach effectively reduces computation costs and memory access requirements.In the operator fusion scheme of deep learning compiler TVM,operators are categorized based on their functional characteristics,fusion rules are devised,and a greedy algorithm is utilized for fusion.However,this fusion scheme has the following problems.Firstly,the fusion rules derived from functional feature classification may not be sufficiently generalizable,leading to missed opportunities for operator fusion and limited granularity.Secondly,the greedy algorithm fails to achieve optimal solutions for operator fusion.To address these issues,improvements are made in TVM by introducing an operator classification method based on input/output mapping types and designing a more comprehensive set of fusion rules that expand the granularity of operator fusion.Additionally,a search algorithm for finding suitable fusion schemes and a cost evaluation model based on dynamic programming are proposed to prune the search space and enable efficient identification of optimal solutions within reasonable time frames.To evaluate the effectiveness of this enhanced fusion scheme,experiments are conducted using popular deep learning models such as VGG-16,Efficient-B0,MobileNet-V1 and YOLO-V4 on both CPU and DCU platforms.The experimental results show that compared with the original fusion scheme of TVM,the fusion ratio of deep learning models can be improved.The average fusion ratio is increased by 27%,and the average inference delay rate is 1.75.

Key words: Deep learning compiler, TVM, Operator fusion, Fusion rule, Dynamic programming

中图分类号: 

  • TP314
[1]ABADI M,BARHAM P,CHEN J,et al.TensorFlow:a system for large-scale machine learning[C]//12th USENIX Symposium on Operating Systems Design and Implementation(OSDI 16).Savannah,GA,USA,Berkeley,CA,USA:USENIX Association,2016:265-283.
[2]PASZKE A,GROSS S,MASSA F,et al.Pytorch:an imperative style,high-performance deep learning library[J].Advances in Neural Information Processing Systems,2019,32:8026-8037.
[3]LI M Z,LIU Y,LIU X Y,et al.The deep learning compiler:a comprehensive survey[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(3):708-727.
[4]ZHANG H B,XING M J,WU Y J,et al.Compiler technologies in deep learning co-design:a survey[J/OL].https://spj.science.org/doi/10.34133/icomputing.0040.
[5]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR:a compiler infrastructure for the end of Moore's law[J].arXiv:2002.11054,2020.
[6]CHEN T Q,MOREAU T,JIANG Z H,et al.TVM:an automa-ted end-to-end optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).Carlsbad,CA,USA,Berkeley,CA,USA:USENIX Association,2018:578-594.
[7]ROESCH J,LYUBOMIRSKY S,WEBER L,et al.Relay:a newIR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(MAPL 2018).Philadelphia PA,USA,New York,NY,USA:Association for Computing Machinery,2018:58-68.
[8]FENG S Y,HOU B H,JIN H Y,et al.TensorIR:an abstraction for automatic tensorized program optimization[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2(ASPLOS 2023).Vancouver,BC,Canada,New York,NY,USA:Association for Computing Machinery,2023:804-817.
[9]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:alanguage and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines [C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'13).Seattle,WA,USA,New York,NY,USA:Association for Computing Machinery,2013:519-530.
[10]SNIDER D,LIANG R F.Operator fusion in XLA:analysis and evaluation[J].arXiv:2301.13062,2023.
[11]JIA Z H,PADON O,THOMAS J,et al.TASO:optimizing deep learning computation with automatic generation of graph substitutions[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles(SOSP'19).Huntsville,Ontario,Canada,New York,NY,USA:Association for Computing Machinery,2019:47-62.
[12]NIU W,GUAN J X,WANG Y Z,et al.DNNFusion:accelera-ting deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN InternationalConference on Programming Language Design and Implementation(PLDI 2021),Vitural Canada.New York,NY,USA:Association for Computing Machinery,2021:883-898.
[13]JIA Z H,THOMAS J,WARSZAWSKI T,et al.OptimizingDNN computation with relaxed graph substitutions [J].Proceedings of Machine Learning and Systems,2019,1:27-39.
[14]FANG J Z,SHEN Y Y,WANG Y,et al.Optimizing DNN computation graph using graph gubstitutions[J].Proceedings of the VLDB Endowment,2020,13(12):2734-2746.
[15]CAI X Y,WANG Y,ZHANG L.Optimus:an operator fusion framework for deep neural networks[J].ACM Transactions on Embedded Computing Systems,2022,22(1):1-26.
[16]ZHENG Z,YANG X D,ZHAO P Z,et al.AStitch:enabling anew multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2022).Lausanne Switzerland,New York,NY,USA:Association for Computing Machinery,2022:359-373.
[17]ZHAO J,GAO X,XIA R J,et al.Apollo:automatic partition-based operator fusion through layer by layer optimization[J].Proceedings of Machine Learning and Systems,2022,4:1-19.
[18]MA L X,XIE Z Q,YANG Z,et al.RAMMER:enabling holistic deep learning compiler optimizations with rtasks[C]//14th USENIX Conference on Operating Systems Design and Implementation(OSDI 20).Berkeley,CA,USA:USENIX Association,2020:881-897.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!