Computer Science ›› 2025, Vol. 52 ›› Issue (5): 58-66.doi: 10.11896/jsjkx.240100018

• High Performance Computing • Previous Articles     Next Articles

Operator Fusion Optimization for Deep Learning Compiler TVM

GAO Wei1, WANG Lei1,2, LI Jianan1,2, LI Shuailong1,2, HAN Lin1   

  1. 1 National Supercomputing Center in Zhengzhou(Zhengzhou University),Zhengzhou 450001,China
    2 School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou 450001,China
  • Received:2024-01-02 Revised:2024-06-20 Online:2025-05-15 Published:2025-05-12
  • About author:GAO Wei,born in 1985,Ph.D,associate professor.His main research interest is high performance computing.
    HAN Lin,born in 1978,Ph.D,professor.His main research interests include high performance computing,advanced compilation,parallel programming and optimization.
  • Supported by:
    Henan Province Major Science and Technology Project “Domestic Advanced Computing Platform Innovation Ecology and Application Research”(221100210600).

Abstract: Operator fusion technique is an optimization method employed by deep learning compilers to combine multiple operators into a single,larger operator.This approach effectively reduces computation costs and memory access requirements.In the operator fusion scheme of deep learning compiler TVM,operators are categorized based on their functional characteristics,fusion rules are devised,and a greedy algorithm is utilized for fusion.However,this fusion scheme has the following problems.Firstly,the fusion rules derived from functional feature classification may not be sufficiently generalizable,leading to missed opportunities for operator fusion and limited granularity.Secondly,the greedy algorithm fails to achieve optimal solutions for operator fusion.To address these issues,improvements are made in TVM by introducing an operator classification method based on input/output mapping types and designing a more comprehensive set of fusion rules that expand the granularity of operator fusion.Additionally,a search algorithm for finding suitable fusion schemes and a cost evaluation model based on dynamic programming are proposed to prune the search space and enable efficient identification of optimal solutions within reasonable time frames.To evaluate the effectiveness of this enhanced fusion scheme,experiments are conducted using popular deep learning models such as VGG-16,Efficient-B0,MobileNet-V1 and YOLO-V4 on both CPU and DCU platforms.The experimental results show that compared with the original fusion scheme of TVM,the fusion ratio of deep learning models can be improved.The average fusion ratio is increased by 27%,and the average inference delay rate is 1.75.

Key words: Deep learning compiler, TVM, Operator fusion, Fusion rule, Dynamic programming

CLC Number: 

  • TP314
[1]ABADI M,BARHAM P,CHEN J,et al.TensorFlow:a system for large-scale machine learning[C]//12th USENIX Symposium on Operating Systems Design and Implementation(OSDI 16).Savannah,GA,USA,Berkeley,CA,USA:USENIX Association,2016:265-283.
[2]PASZKE A,GROSS S,MASSA F,et al.Pytorch:an imperative style,high-performance deep learning library[J].Advances in Neural Information Processing Systems,2019,32:8026-8037.
[3]LI M Z,LIU Y,LIU X Y,et al.The deep learning compiler:a comprehensive survey[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(3):708-727.
[4]ZHANG H B,XING M J,WU Y J,et al.Compiler technologies in deep learning co-design:a survey[J/OL].https://spj.science.org/doi/10.34133/icomputing.0040.
[5]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR:a compiler infrastructure for the end of Moore's law[J].arXiv:2002.11054,2020.
[6]CHEN T Q,MOREAU T,JIANG Z H,et al.TVM:an automa-ted end-to-end optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).Carlsbad,CA,USA,Berkeley,CA,USA:USENIX Association,2018:578-594.
[7]ROESCH J,LYUBOMIRSKY S,WEBER L,et al.Relay:a newIR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(MAPL 2018).Philadelphia PA,USA,New York,NY,USA:Association for Computing Machinery,2018:58-68.
[8]FENG S Y,HOU B H,JIN H Y,et al.TensorIR:an abstraction for automatic tensorized program optimization[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2(ASPLOS 2023).Vancouver,BC,Canada,New York,NY,USA:Association for Computing Machinery,2023:804-817.
[9]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:alanguage and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines [C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'13).Seattle,WA,USA,New York,NY,USA:Association for Computing Machinery,2013:519-530.
[10]SNIDER D,LIANG R F.Operator fusion in XLA:analysis and evaluation[J].arXiv:2301.13062,2023.
[11]JIA Z H,PADON O,THOMAS J,et al.TASO:optimizing deep learning computation with automatic generation of graph substitutions[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles(SOSP'19).Huntsville,Ontario,Canada,New York,NY,USA:Association for Computing Machinery,2019:47-62.
[12]NIU W,GUAN J X,WANG Y Z,et al.DNNFusion:accelera-ting deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN InternationalConference on Programming Language Design and Implementation(PLDI 2021),Vitural Canada.New York,NY,USA:Association for Computing Machinery,2021:883-898.
[13]JIA Z H,THOMAS J,WARSZAWSKI T,et al.OptimizingDNN computation with relaxed graph substitutions [J].Proceedings of Machine Learning and Systems,2019,1:27-39.
[14]FANG J Z,SHEN Y Y,WANG Y,et al.Optimizing DNN computation graph using graph gubstitutions[J].Proceedings of the VLDB Endowment,2020,13(12):2734-2746.
[15]CAI X Y,WANG Y,ZHANG L.Optimus:an operator fusion framework for deep neural networks[J].ACM Transactions on Embedded Computing Systems,2022,22(1):1-26.
[16]ZHENG Z,YANG X D,ZHAO P Z,et al.AStitch:enabling anew multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2022).Lausanne Switzerland,New York,NY,USA:Association for Computing Machinery,2022:359-373.
[17]ZHAO J,GAO X,XIA R J,et al.Apollo:automatic partition-based operator fusion through layer by layer optimization[J].Proceedings of Machine Learning and Systems,2022,4:1-19.
[18]MA L X,XIE Z Q,YANG Z,et al.RAMMER:enabling holistic deep learning compiler optimizations with rtasks[C]//14th USENIX Conference on Operating Systems Design and Implementation(OSDI 20).Berkeley,CA,USA:USENIX Association,2020:881-897.
[1] DAI Hanwen, CHEN Changbo. Accelerating Batched Matrix Multiplication for Variable Small Sizes Based on TVM andApplications [J]. Computer Science, 2025, 52(5): 25-40.
[2] HAN Lin, WANG Yifan, LI Jianan, GAO Wei. Automatic Scheduling Search Optimization Method Based on TVM [J]. Computer Science, 2025, 52(3): 268-276.
[3] XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[4] CHEN Yunliang, LIU Hao, ZHU Guishui, HUANG Xiaohui, CHEN Xiaodao, WANG Lizhe. Study on Supervised Learning Model for Optimal Histogram Solution [J]. Computer Science, 2023, 50(9): 145-151.
[5] CHEN Ying, HAO Ying-guang, WANG Hong-yu, WANG Kun. Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map [J]. Computer Science, 2022, 49(8): 150-156.
[6] LIN Bao-ling, JIA Ri-heng, LIN Fei-long, ZHENG Zhong-long, LI Ming-lu. Multi-armed Bandit Model Based on Time-variant Budgets [J]. Computer Science, 2022, 49(11A): 210800212-6.
[7] MA Xin-yu, JIANG Chun-mao, HUANG Chun-mei. Optimal Scheduling of Cloud Task Based on Three-way Clustering [J]. Computer Science, 2022, 49(11A): 211100139-7.
[8] LI Shuang-gang, ZHANG Shuang, WANG Xing-wei. Cloud Resource Scheduling Mechanism Based on Adaptive Virtual Machine Migration [J]. Computer Science, 2020, 47(9): 238-245.
[9] ZHU Ying,XIA Yi-li,PEI Wen-jiang. Fusion of Infrared and Color Visible Images Based on Improved BEMD [J]. Computer Science, 2020, 47(3): 124-129.
[10] WANG Zheng-li, XIE Tian, HE Kun and JIN Yan. 0-1 Knapsack Variant with Time Scheduling [J]. Computer Science, 2018, 45(4): 53-59.
[11] ZHANG Xun, GU Chun-hua, LUO Fei, CHANG Yao-hui and WEN Geng. Virtual Machine Placement Strategy Based on Dynamic Programming [J]. Computer Science, 2017, 44(8): 54-59.
[12] ZHOU Huan-huan and JIANG Ying. Test Configuration Method Based on Dynamic Programming under Cloud Environment [J]. Computer Science, 2014, 41(9): 215-219.
[13] LIU Kun-liang,ZHANG Da-kun and WU Ji-gang. Improved Algorithm for Finding Weight-constrained Maximum-density Path [J]. Computer Science, 2014, 41(8): 122-124.
[14] . Solving Dynamic 0-1 Knapsack Problems Based on Dynamic Programming Algorithm [J]. Computer Science, 2012, 39(7): 237-241.
[15] . Decentralized Multi-Agent Based Cooperative Path Planning for Multi-UAVs [J]. Computer Science, 2012, 39(1): 219-222.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!