Computer Science ›› 2026, Vol. 53 ›› Issue (6): 128-136.doi: 10.11896/jsjkx.250600137

• High Performance Computing • Previous Articles     Next Articles

Study on Compilation Technology of Neural Network Accelerator Based on RISC-V InstructionExtension

WANG Yipin1, CAI Chenghuan1, XU Jiabin2, ZHOU Xuegong3, ZHANG Fengzhe3, CAO Wei3, ZHANG Fan3, YU Xinsheng4   

  1. 1 School of Software Engineering,Fudan University,Shanghai 200433,China
    2 School of Computer Science and Technology,Fudan University,Shanghai 200433,China
    3 Institute of Big Data,Fudan University,Shanghai 200433,China
    4 The 32nd Research Institute,China Electronics Technology Group Corporation (CETC),Shanghai 201899,China
  • Received:2025-06-20 Revised:2025-09-26 Online:2026-06-15 Published:2026-06-09
  • About author:WANG Yipin,born in 2000,postgra-duate.His main research interest is domain-specific hardware-software co-design.
    ZHOU Xuegong,Ph.D,assistant resear-cher.His main research interests include mimic intelligent computing,reconfigurable computing and EDA algorithms.
  • Supported by:
    National Key R & D Program of China(2022YFB4500903).

Abstract: With the rapid advancement of artificial intelligence,RISC-V instruction set extended neural network accelerators have become a research hotspot.Deep learning compilers are critical for efficiently deploying neural network models on hardware platforms.However,modifying the tuning rules of existing compilers based on hardware characteristics places extremely high demands on developers.At the same time,existing compilers lack compilation support for specialized transposition hardware units and cannot achieve efficient transposition through data flow reconstruction,resulting in the performance potential of such hardware not being fully utilized.To address these issues,this paper first designs an MLIR-based compiler toolchain targeting RISC-V neural network accelerators,enabling end-to-end model deployment.Secondly,it introduces hardware-specific dialects within the MLIR framework.This allows the compiler to optimize tiled matrix multiplication by integrating accelerator buffer size,systolic array dimensions,double buffering strategies,and dataflow patterns.Furthermore,it implements efficient matrix transposition by adjusting data flow patterns according to the hardware architecture,eliminating the need for large-scale data exchange.The paper also implements convolution operator fusion and matrix multiplication-vector computation fusion based on systolic array features.Experimental results demonstrate an average speedup of 14.55× after transposition optimization,an average 41.59% speedup after operator fusion,and a comprehensive average speedup of 5.13% for BERT model execution.

Key words: Deep learning compiler, RISC-V, Accelerator, MLIR, Compilation optimization

CLC Number: 

  • TP314
[1]WANG Y,WANG Y,LI H,et al.Systolic cube:A spatial 3D CNN accelerator architecture for low power video analysis[C]//Proceedings of the 56th Annual Design Automation Conference 2019.2019:1-6.
[2]PATWARDHAN N,MARRONE S,SANSONE C.Transfor-mers in the real world:A survey on nlp applications[J].Information,2023,14(4):242.
[3]PAREKH D,PODDAR N,RAJPURKAR A,et al.A review on autonomous vehicles:Progress,methods and challenges[J].Electronics,2022,11(14):2162.
[4]MEZGER B W,SANTOS D A,DILILLO L,et al.A survey of the RISC-V architecture software support[J].IEEE Access,2022,10:51394-51411.
[5]CUI E,LI T,WEI Q.Risc-v instruction set architecture extensions:A survey[J].IEEE Access,2023,11:24696-24711.
[6]GENC H,KIM S,AMID A,et al.Gemmini:Enabling systematic deep-learning architecture evaluation via full-stack integration[C]//2021 58th ACM/IEEE Design Automation Conference (DAC).IEEE,2021:769-774.
[7]LI M,LIU Y,LIU X,et al.The deep learning compiler:A comprehensive survey[J].IEEE Transactions on Parallel and Distributed Systems,2020,32(3):708-727.
[8]ZHANG H,XING M,WU Y,et al.Compiler Technologies in Deep Learning Co-Design:A Survey[J].Intelligent Computing,2023,2:40.
[9]PECCIA F N,BRINGMANN O.Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework[C]//Proceedings of the 2023 Workshop onCompi-lers,Deployment,and Tooling for Edge AI.2023:21-26.
[10]AHMADIFARSANI S,MUELLER-GRITSCHNEDER D,SCHLICHTMANN U.A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization[J].arXiv:2507.04828,2025.
[11]GitHub.An MLIR-based compiler framework bridges DSLs(domain-specific languages) to DSAs (domain-specific architectures)[DB/OL].https://github.com/buddy-compiler/buddy-mlir.
[12]ROTEM N,FIX J,ABDULRASOOL S,et al.Glow:Graph lowering compiler techniques for neural networks[J].arXiv:1805.00907,2018.
[13]ZHENG L,JIA C,SUN M,et al.Ansor:Generating {High-Performance} tensor programs for deep learning[C]//14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20).2020:863-879.
[14]ZHENG B,JIANG Z,YU C H,et al.DietCode:Automatic optimization for dynamic tensor programs[J].Proceedings of Machine Learning and Systems,2022,4:848-863.
[15]EKLUNDH J O.Efficient matrix transposition[J].Two-Dimensional Digital Signal Prcessing II:Transforms and Median Filters,2006,43:9-35.
[16]FRIGO M,LEISERSON C E,PROKOP H,et al.Cache-oblivious algorithms[J].ACM Transactions on Algorithms,2012,8(1):1-22.
[17]RUETSCH G,MICIKEVICIUS P.Optimizing matrix transpose in CUDA[C]//NVIDIA Technical Report.2009:1-24.
[18]GOMEZ-LUNA J,SUNG I J,CHANG L W,et al.In-place matrix transposition on GPUs[J].IEEE Transactions on Parallel and Distributed Systems,2015,27(3):776-788.
[19]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions[J].arXiv:1802.04730,2018.
[20]BADER M,ZENGER C.Cache oblivious matrix multiplication using an element ordering based on a Peano curve[J].Linear Algebra and Its Applications,2006,417(2/3):301-313.
[21]FRIGO M,LEISERSON C E,PROKOP H,et al.Cache-oblivious algorithms[C]//40th Annual Symposium on Foundations of Computer Science (Cat.No.99CB37039).IEEE,1999:285-297.
[22]TORRES L A,BARRIOS C J,DENNEULIN Y.Evaluation of Computational and Power Performance in Matrix Multiplication[C]//High Performance Computing:11th Latin American High Performance Computing Conference.2024.
[23]ADEFEMI T.Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer[J].arXiv:2408.15384,2024.
[24]NIU W,GUAN J,WANG Y,et al.Dnnfusion:accelerating deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN International Confe-rence on Programming Language Design and Implementation.2021:883-898.
[25]ZHAO J,GAO X,XIA R,et al.Apollo:Automatic partition-based operator fusion through layer by layer optimization[J].Proceedings of Machine Learning and Systems,2022,4:1-19.
[26]GENC H,KIM S,AMID A,et al.Gemmini Tutorial:Generate Custom DNN Accelerators with Full-System Full-Stack Evaluation [C]//Proceedings of the Conference on Machine Learning and Systems (MLSys).2022.
[1] HAN Lin, DING Yongqiang, CUI Pingfei, LIU Haohao, LI Haoran, CHEN Mengyao. SLP Vectorization Across Basic Blocks Based on Region Partitioning [J]. Computer Science, 2025, 52(9): 186-194.
[2] JIANG Jun, ZHAI Yanhe, ZENG Zhiheng, GU Yichao, HUANG Liangming. Loop-invariant Code Motion Algorithm Based on Loop Cost Analysis [J]. Computer Science, 2025, 52(6): 44-51.
[3] GAO Wei, WANG Lei, LI Jianan, LI Shuailong, HAN Lin. Operator Fusion Optimization for Deep Learning Compiler TVM [J]. Computer Science, 2025, 52(5): 58-66.
[4] WEI Xiaohui, GUAN Zeyu, WANG Chenyang, YUE Hengshan, WU Qi. Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators [J]. Computer Science, 2025, 52(5): 91-100.
[5] XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[6] LIU Lei, ZHOU Zhide, LIU Xingxiang, CHE Haoyang, YAO Lei, JIANG He. Automatic Tensorization for TPU Coarse-grained Instructions [J]. Computer Science, 2024, 51(6): 52-60.
[7] PEI Xue, WEI Shuai, SHAO Yangxue, YU Hong, GE Chenyang. Compilation Optimization and Implementation of High-order Cryptographic Operators on FPGA [J]. Computer Science, 2024, 51(11A): 231200184-11.
[8] FAN Lilin, QIAO Yihang, LI Junfei, CHAI Xuqing, CUI Rongpei, HAN Bingyu. CP2K Software Porting and Optimization Based on Domestic c86 Processor [J]. Computer Science, 2023, 50(6): 58-65.
[9] WANG Xiaofeng, LI Chaoran, LU Kunfeng, LUAN Tianjiao, YAO Na, ZHOU Hui, XIE Yujia. Acceleration Design and FPGA Implementation of CNN Scene Matching Algorithm [J]. Computer Science, 2023, 50(11): 8-14.
[10] FU Si-qing, LI Tie-jun, ZHANG Jian-min. Architecture Design for Particle Transport Code Acceleration [J]. Computer Science, 2022, 49(6): 81-88.
[11] CAO Hao, GUO Shao-zhong, LIU Dan, XU Jin-chen. Automatic Porting of Basic Mathematics Library for 64-bit RISC-V [J]. Computer Science, 2021, 48(6): 41-47.
[12] XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na. Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System [J]. Computer Science, 2021, 48(12): 36-42.
[13] CHI Hao-yu, CHEN Chang-bo. Prediction of Loop Tiling Size Based on Neural Network [J]. Computer Science, 2020, 47(8): 62-70.
[14] ZHANG YU,FENG Dan. Study and Design of Reconfigurable Embedded Computing System Based on Xilinx SoPC [J]. Computer Science, 2010, 37(5): 274-277.
[15] . [J]. Computer Science, 2009, 36(3): 45-47.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!