计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 58-66.doi: 10.11896/jsjkx.240100018
高伟1, 王磊1,2, 李嘉楠1,2, 李帅龙1,2, 韩林1
GAO Wei1, WANG Lei1,2, LI Jianan1,2, LI Shuailong1,2, HAN Lin1
摘要: 算子融合是深度学习编译器中的一种编译优化技术,能够将多个算子合并为一个大的算子,有效降低计算和访存的成本。深度学习编译器TVM的算子融合方案中将算子按照功能特性进行分类,并设计融合规则,最后采用贪心算法进行融合。这种融合方案存在以下问题:首先,功能特性的算子分类方式下的融合规则不够通用,会错失算子融合机会,无法实现更大粒度的融合;其次,贪心的融合算法也无法实现算子融合的最优解。针对上述问题,对TVM进行改进,提出按照算子输入输出映射类型的算子分类方式,并设计通用的算子融合规则以扩大算子融合的粒度;提出基于动态规划的融合方案搜索算法和算子融合代价评估模型,并对搜索空间进行剪枝,使得算法能够在合理时间内搜索得到优化的融合方案。为评测融合方案的有效性,在CPU以及DCU等平台上对VGG-16,Efficient-B0,MobileNet-V1,YOLO-V4等深度学习模型的融合比和推理时延进行测试,实验结果表明,相较于TVM原有融合方案,所提方案融合比平均提升了27%,推理时延平均获得了1.75的加速比。
中图分类号:
[1]ABADI M,BARHAM P,CHEN J,et al.TensorFlow:a system for large-scale machine learning[C]//12th USENIX Symposium on Operating Systems Design and Implementation(OSDI 16).Savannah,GA,USA,Berkeley,CA,USA:USENIX Association,2016:265-283. [2]PASZKE A,GROSS S,MASSA F,et al.Pytorch:an imperative style,high-performance deep learning library[J].Advances in Neural Information Processing Systems,2019,32:8026-8037. [3]LI M Z,LIU Y,LIU X Y,et al.The deep learning compiler:a comprehensive survey[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(3):708-727. [4]ZHANG H B,XING M J,WU Y J,et al.Compiler technologies in deep learning co-design:a survey[J/OL].https://spj.science.org/doi/10.34133/icomputing.0040. [5]LATTNER C,AMINI M,BONDHUGULA U,et al.MLIR:a compiler infrastructure for the end of Moore's law[J].arXiv:2002.11054,2020. [6]CHEN T Q,MOREAU T,JIANG Z H,et al.TVM:an automa-ted end-to-end optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18).Carlsbad,CA,USA,Berkeley,CA,USA:USENIX Association,2018:578-594. [7]ROESCH J,LYUBOMIRSKY S,WEBER L,et al.Relay:a newIR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages(MAPL 2018).Philadelphia PA,USA,New York,NY,USA:Association for Computing Machinery,2018:58-68. [8]FENG S Y,HOU B H,JIN H Y,et al.TensorIR:an abstraction for automatic tensorized program optimization[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,Volume 2(ASPLOS 2023).Vancouver,BC,Canada,New York,NY,USA:Association for Computing Machinery,2023:804-817. [9]RAGAN-KELLEY J,BARNES C,ADAMS A,et al.Halide:alanguage and compiler for optimizing parallelism,locality,and recomputation in image processing pipelines [C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI'13).Seattle,WA,USA,New York,NY,USA:Association for Computing Machinery,2013:519-530. [10]SNIDER D,LIANG R F.Operator fusion in XLA:analysis and evaluation[J].arXiv:2301.13062,2023. [11]JIA Z H,PADON O,THOMAS J,et al.TASO:optimizing deep learning computation with automatic generation of graph substitutions[C]//Proceedings of the 27th ACM Symposium on Operating Systems Principles(SOSP'19).Huntsville,Ontario,Canada,New York,NY,USA:Association for Computing Machinery,2019:47-62. [12]NIU W,GUAN J X,WANG Y Z,et al.DNNFusion:accelera-ting deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN InternationalConference on Programming Language Design and Implementation(PLDI 2021),Vitural Canada.New York,NY,USA:Association for Computing Machinery,2021:883-898. [13]JIA Z H,THOMAS J,WARSZAWSKI T,et al.OptimizingDNN computation with relaxed graph substitutions [J].Proceedings of Machine Learning and Systems,2019,1:27-39. [14]FANG J Z,SHEN Y Y,WANG Y,et al.Optimizing DNN computation graph using graph gubstitutions[J].Proceedings of the VLDB Endowment,2020,13(12):2734-2746. [15]CAI X Y,WANG Y,ZHANG L.Optimus:an operator fusion framework for deep neural networks[J].ACM Transactions on Embedded Computing Systems,2022,22(1):1-26. [16]ZHENG Z,YANG X D,ZHAO P Z,et al.AStitch:enabling anew multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS 2022).Lausanne Switzerland,New York,NY,USA:Association for Computing Machinery,2022:359-373. [17]ZHAO J,GAO X,XIA R J,et al.Apollo:automatic partition-based operator fusion through layer by layer optimization[J].Proceedings of Machine Learning and Systems,2022,4:1-19. [18]MA L X,XIE Z Q,YANG Z,et al.RAMMER:enabling holistic deep learning compiler optimizations with rtasks[C]//14th USENIX Conference on Operating Systems Design and Implementation(OSDI 20).Berkeley,CA,USA:USENIX Association,2020:881-897. |
|