面向深度学习编译器TVM的算子融合优化

doi:10.11896/jsjkx.240100018

Abstract

Abstract: Operator fusion technique is an optimization method employed by deep learning compilers to combine multiple operators into a single,larger operator.This approach effectively reduces computation costs and memory access requirements.In the operator fusion scheme of deep learning compiler TVM,operators are categorized based on their functional characteristics,fusion rules are devised,and a greedy algorithm is utilized for fusion.However,this fusion scheme has the following problems.Firstly,the fusion rules derived from functional feature classification may not be sufficiently generalizable,leading to missed opportunities for operator fusion and limited granularity.Secondly,the greedy algorithm fails to achieve optimal solutions for operator fusion.To address these issues,improvements are made in TVM by introducing an operator classification method based on input/output mapping types and designing a more comprehensive set of fusion rules that expand the granularity of operator fusion.Additionally,a search algorithm for finding suitable fusion schemes and a cost evaluation model based on dynamic programming are proposed to prune the search space and enable efficient identification of optimal solutions within reasonable time frames.To evaluate the effectiveness of this enhanced fusion scheme,experiments are conducted using popular deep learning models such as VGG-16,Efficient-B0,MobileNet-V1 and YOLO-V4 on both CPU and DCU platforms.The experimental results show that compared with the original fusion scheme of TVM,the fusion ratio of deep learning models can be improved.The average fusion ratio is increased by 27%,and the average inference delay rate is 1.75.

Key words: Deep learning compiler, TVM, Operator fusion, Fusion rule, Dynamic programming

CLC Number:

TP314

GAO Wei, WANG Lei, LI Jianan, LI Shuailong, HAN Lin. Operator Fusion Optimization for Deep Learning Compiler TVM[J].Computer Science, 2025, 52(5): 58-66.

References

[1]ABADI M,BARHAM P,CHEN J,et al.TensorFlow:a system for large-scale machine learning[C]//12th USENIX Symposium on Operating Systems Design and Implementation(OSDI 16).Savannah,GA,USA,Berkeley,CA,USA:USENIX Association,2016:265-283.
[2]PASZKE A,GROSS S,MASSA F,et al.Pytorch:an imperative style,high-performance deep learning library[J].Advances in Neural Information Processing Systems,2019,32:8026-8037.
[3]LI M Z,LIU Y,LIU X Y,et al.The deep learning compiler:a comprehensive survey[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(3):708-727.
[4]ZHANG H B,XING M J,WU Y J,et al.Compiler technologies in deep learning co-design:a survey[J/OL].https://spj.science.org/doi/10.34133/icomputing.0040.
[5]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR:a compiler infrastructure for the end of Moore's law[J].arXiv:2002.11054,2020.
[6]CHEN T Q,MOREAU T,JIANG Z H,et al.TVM:an automa-ted end-to-end optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).Carlsbad,CA,USA,Berkeley,CA,USA:USENIX Association,2018:578-594.
[7]ROESCH J,LYUBOMIRSKY S,WEBER L,et al.Relay:a newIR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(MAPL 2018).Philadelphia PA,USA,New York,NY,USA:Association for Computing Machinery,2018:58-68.
[8]FENG S Y,HOU B H,JIN H Y,et al.TensorIR:an abstraction for automatic tensorized program optimization[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2(ASPLOS 2023).Vancouver,BC,Canada,New York,NY,USA:Association for Computing Machinery,2023:804-817.
[9]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:alanguage and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines [C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'13).Seattle,WA,USA,New York,NY,USA:Association for Computing Machinery,2013:519-530.
[10]SNIDER D,LIANG R F.Operator fusion in XLA:analysis and evaluation[J].arXiv:2301.13062,2023.
[11]JIA Z H,PADON O,THOMAS J,et al.TASO:optimizing deep learning computation with automatic generation of graph substitutions[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles(SOSP'19).Huntsville,Ontario,Canada,New York,NY,USA:Association for Computing Machinery,2019:47-62.
[12]NIU W,GUAN J X,WANG Y Z,et al.DNNFusion:accelera-ting deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN InternationalConference on Programming Language Design and Implementation(PLDI 2021),Vitural Canada.New York,NY,USA:Association for Computing Machinery,2021:883-898.
[13]JIA Z H,THOMAS J,WARSZAWSKI T,et al.OptimizingDNN computation with relaxed graph substitutions [J].Proceedings of Machine Learning and Systems,2019,1:27-39.
[14]FANG J Z,SHEN Y Y,WANG Y,et al.Optimizing DNN computation graph using graph gubstitutions[J].Proceedings of the VLDB Endowment,2020,13(12):2734-2746.
[15]CAI X Y,WANG Y,ZHANG L.Optimus:an operator fusion framework for deep neural networks[J].ACM Transactions on Embedded Computing Systems,2022,22(1):1-26.
[16]ZHENG Z,YANG X D,ZHAO P Z,et al.AStitch:enabling anew multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2022).Lausanne Switzerland,New York,NY,USA:Association for Computing Machinery,2022:359-373.
[17]ZHAO J,GAO X,XIA R J,et al.Apollo:automatic partition-based operator fusion through layer by layer optimization[J].Proceedings of Machine Learning and Systems,2022,4:1-19.
[18]MA L X,XIE Z Q,YANG Z,et al.RAMMER:enabling holistic deep learning compiler optimizations with rtasks[C]//14th USENIX Conference on Operating Systems Design and Implementation(OSDI 20).Berkeley,CA,USA:USENIX Association,2020:881-897.

Related Articles 15

[1]	DAI Hanwen, CHEN Changbo. Accelerating Batched Matrix Multiplication for Variable Small Sizes Based on TVM andApplications [J]. Computer Science, 2025, 52(5): 25-40.
[2]	HAN Lin, WANG Yifan, LI Jianan, GAO Wei. Automatic Scheduling Search Optimization Method Based on TVM [J]. Computer Science, 2025, 52(3): 268-276.
[3]	XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[4]	CHEN Yunliang, LIU Hao, ZHU Guishui, HUANG Xiaohui, CHEN Xiaodao, WANG Lizhe. Study on Supervised Learning Model for Optimal Histogram Solution [J]. Computer Science, 2023, 50(9): 145-151.
[5]	CHEN Ying, HAO Ying-guang, WANG Hong-yu, WANG Kun. Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map [J]. Computer Science, 2022, 49(8): 150-156.
[6]	LIN Bao-ling, JIA Ri-heng, LIN Fei-long, ZHENG Zhong-long, LI Ming-lu. Multi-armed Bandit Model Based on Time-variant Budgets [J]. Computer Science, 2022, 49(11A): 210800212-6.
[7]	MA Xin-yu, JIANG Chun-mao, HUANG Chun-mei. Optimal Scheduling of Cloud Task Based on Three-way Clustering [J]. Computer Science, 2022, 49(11A): 211100139-7.
[8]	LI Shuang-gang, ZHANG Shuang, WANG Xing-wei. Cloud Resource Scheduling Mechanism Based on Adaptive Virtual Machine Migration [J]. Computer Science, 2020, 47(9): 238-245.
[9]	ZHU Ying,XIA Yi-li,PEI Wen-jiang. Fusion of Infrared and Color Visible Images Based on Improved BEMD [J]. Computer Science, 2020, 47(3): 124-129.
[10]	WANG Zheng-li, XIE Tian, HE Kun and JIN Yan. 0-1 Knapsack Variant with Time Scheduling [J]. Computer Science, 2018, 45(4): 53-59.
[11]	ZHANG Xun, GU Chun-hua, LUO Fei, CHANG Yao-hui and WEN Geng. Virtual Machine Placement Strategy Based on Dynamic Programming [J]. Computer Science, 2017, 44(8): 54-59.
[12]	ZHOU Huan-huan and JIANG Ying. Test Configuration Method Based on Dynamic Programming under Cloud Environment [J]. Computer Science, 2014, 41(9): 215-219.
[13]	LIU Kun-liang,ZHANG Da-kun and WU Ji-gang. Improved Algorithm for Finding Weight-constrained Maximum-density Path [J]. Computer Science, 2014, 41(8): 122-124.
[14]	. Solving Dynamic 0-1 Knapsack Problems Based on Dynamic Programming Algorithm [J]. Computer Science, 2012, 39(7): 237-241.
[15]	. Decentralized Multi-Agent Based Cooperative Path Planning for Multi-UAVs [J]. Computer Science, 2012, 39(1): 219-222.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Operator Fusion Optimization for Deep Learning Compiler TVM

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0