Computer Science ›› 2025, Vol. 52 ›› Issue (12): 1-8.doi: 10.11896/jsjkx.250600014

• Computer Software & Architecture • Previous Articles     Next Articles

Neural Network Acceleration Architecture Based on RISC-V Instruction Set Extension

CAI Chenghuan1, WANG Yipin1, XU Jiabin2, ZHANG Fengzhe3, ZHOU Xuegong3, CAO Wei3, ZHANG Fan3, YU Xinsheng4   

  1. 1 School of Software Engineering, Fudan University, Shanghai 200433, China
    2 School of Computer Science and Technology, Fudan University, Shanghai 200433, China
    3 Institute of Big Data, Fudan University, Shanghai 200433, China
    4 The 32nd Research Institute, China Electronics Technology Group Corporation(CETC), Shanghai 201899, China
  • Received:2025-06-03 Revised:2025-09-04 Published:2025-12-09
  • About author:CAI Chenghuan,born in 1999,postgra-duate.His main research interest is in domain-specific hardware-software co-design.
    ZHANG Fengzhe,born in 1982,Ph.D,associate professor,Ph.D supervisor,is a member of CCF(No.21012M).His main research interests include compu-ter architecture and system software,and brain-inspired computing.
  • Supported by:
    This work was supported by the National Key R & D Program of China (2022YFB4500903).

Abstract: To address the current shortcomings of RISC-V-based neural network accelerators in accelerating matrix computations and nonlinear operations within Transformer-based models,a neural network acceleration architecture based on RISC-V instruction set extension,named Taurus,is proposed.This architecture introduces matrix instruction extensions tailored to the characte-ristics of Transformer models and employs a systolic array to perform matrix multiply-accumulate operations.To accelerate nonlinear computations,vector instruction extensions are added,along with the design of specialized vector units to efficiently compute operations such as LayerNorm and Softmax.To ensure balanced data supply,memory access instruction extensions are optimized to provide sufficient data throughput to the matrix and vector computation units.The instruction set extensions adopt a scalar register expansion approach,embedding operand data information directly into the registers.This increases the addressing space and reduces the number of instructions required for large-scale data computations.The Taurus neural network accelerator architecture is cycle-accurately simulated on the Gem5 platform.Compared with the open-source accelerator Gemmini,Taurus achieves an 80% improvement in systolic array utilization during general matrix multiplication.For inference tasks on ResNet50 and BERT models,Taurus delivers 1.3× and 31.3× speedups respectively over Gemmini.Compared with the baseline RISC-V,Taurus achieves 1 467× and 4 513× performance improvements respectively.

Key words: Neural networks, Matrix computation, Nonlinear computation, Instruction set extension

CLC Number: 

  • TP302
[1]YOU H,SUN Z,SHI H,et al.Vitcod:Vision transformer acceleration via dedicated algorithm and accelerator co-design[C]//2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:273-286.
[2]WANG T,GONG L,WANG C,et al.Via:A novel vision-transformer accelerator based on fpga[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2022,41(11):4088-4099.
[3]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter per-formance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[4]LIANG X Y.Ascend AI Processor Architecture and Programming:In-Depth Understanding of CANN Technology Principles and Applications [M].Beijing:Tsinghua University Press,2019.
[5]LIU M Y,LI C H,LIN C Y,et al.Matrix Accelerator Designed for Vision Transformer[C]//2024 IEEE International Confe-rence on Consumer Electronics-Asia(ICCE-Asia).IEEE,2024:1-2.
[6]KIM S,HOOPER C,WATTANAWONGT,et al.Full stack optimization of transformer inference:a survey[J].arXiv:2302.14017,2023.
[7]CUI E,LI T,WEI Q.Risc-v instruction set architecture extensions:A survey[J].IEEE Access,2023,11:24696-24711.
[8]CAMMARATA D,PEROTTI M,BERTULETTI M,et al.Quadrilatero:A RISC-V programmable matrix coprocessor for low-power edge applications[J].arXiv:2504.07565,2025.
[9]PUROHIT Y,PAREEK D,SAVANI V.Development of a Sys-tem on Chip(SoC) for Matrix Multiplication Utilizing RISC-V and Vector Processor[C]//International Conference on Sustainable and Innovative Solutions for Current Challenges in Engineering & Technology.Springer,2025:1-12.
[10]JIAO Q,HU W,LIU F,et al.Risc-vtf:Risc-v based extended instruction set for transformer[C]//2021 IEEE International Conference on Systems,Man,and Cybernetics(SMC).IEEE,2021:1565-1570.
[11]BUTKO A,GARIBOTTI R,OST L,et al.Accuracy evaluation of gem5 simulator system[C]//7th International Workshop on Reconfigurable and Communication-centric Systems-on-chip(ReCoSoC).IEEE,2012:1-7.
[12]LOWE-POWER J,AHMAD A M,AKRAM A,et al.The gem5 simulator:Version 20.0+[J].arXiv:2007.03152,2020.
[13]SHAO Y S,XI S L,SRINIVASAN V,et al.Co-designing acce-lerators and SoC interfaces using gem5-Aladdin[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2016:1-12.
[14]ROGERS S,SLYCORD J,BAHARANI M,et al.gem5-salam:A system architecture for llvm-based accelerator modeling[C]//2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2020:471-482.
[15]VIEIRA J,ROMA N,FALCAO G,et al.Gem5-accel:A pre-RTL simulation toolchain for accelerator architecture validation[J].IEEE Computer Architecture Letters,2023,23(1):1-4.
[16]FEIST T.Vivado design suite[Z].White Paper,2012:24.
[17]GENC H,KIM S,AMID A,et al.Gemmini:Enabling systematic deep-learning architecture evaluation via full-stack integration[C]//2021 58th ACM/IEEE Design Automation Conference(DAC).IEEE,2021:769-774.
[18]CAVALCANTE M,SCHUIKI F,ZARUBAF,et al.Ara:A 1-GHz+scalable and energy-efficient RISC-V vector processor with multiprecision floating-point support in 22-nm FD-SOI[J].IEEE Transactions on Very Large Scale Integration Systems,2019,28(2):530-543.
[19]GAURAV T,BHATT A,PAREKH R.Design and Implementation of low power RISC V ISA based coprocessor design for Matrix multiplication[C]//2021 Second International Conference on Electronics and Sustainable Communication Systems(ICESC).IEEE,2021:189-195.
[20]TAI H Y.Enhanced RISC-V Matrix Extension Architecture[D].Taiwan:National Yang Ming Chiao Tung University,2023.
[21]YI X,ANTONIO R,DUMOULIN J,et al.OpenGeMM:A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling[J].arXiv:2411.09543,2024.
[22]Working draft of the proposed RISC-V V vector extension[EB/OL].https://github.com/riscv/riscv-v-spec.
[23]CHEN C,XIANG X,LIU C,et al.Xuantie-910:A commercial multi-core 12-stage pipeline out-of-order 64-bit high perfor-mance RISC-V processor with vector extension:Industrial pro-duct[C]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture(ISCA).IEEE,2020:52-64.
[24]KIRAN D C,GURUNARAYANAN S,MISRAJ P,et al.Register allocation for fine grain threads on multicore processor[J].Journal of King Saud University-Computer and Information Sciences,2017,29(1):85-92.
[25]PALA D.Design and programming of a coprocessor for a RISC-V architecture[D].Torino:Politecnico di Torino,2017.
[26]WATERMAN A,LEE Y,PATTERSON D A,et al.The RISC-V instruction set manual,volume I:User-level ISA,version 2.0:Tech.Rep.:UCB/EECS-2014-54[R].EECS Department,University of California,Berkeley,2014:4.
[27]SZE V,CHEN Y H,YANG T J,et al.Efficient processing of deep neural networks:A tutorial and survey[C]//Proceedings of the IEEE.2017:2295-2329.
[28]CAPRA M,BUSSOLINO B,MARCHISIO A,et al.Hardwareand software optimizations for accelerating deep neural networks:Survey of current trends,challenges,and the road ahead[J].IEEE Access,2020,8:225134-225180.
[29]THOMASD,MOORBY P.The Verilog © hardware description language[M].Springer Science & Business Media,2008.
[1] LI Yaru, WANG Qianqian, CHE Chao, ZHU Deheng. Graph-based Compound-Protein Interaction Prediction with Drug Substructures and Protein 3D Information [J]. Computer Science, 2025, 52(9): 71-79.
[2] GUO Husheng, ZHANG Xufei, SUN Yujie, WANG Wenjian. Continuously Evolution Streaming Graph Neural Network [J]. Computer Science, 2025, 52(8): 118-126.
[3] LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[4] LUO Xuyang, TAN Zhiyi. Knowledge-aware Graph Refinement Network for Recommendation [J]. Computer Science, 2025, 52(7): 103-109.
[5] HAO Jiahui, WAN Yuan, ZHANG Yuhang. Research on Node Learning of Graph Neural Networks Fusing Positional and StructuralInformation [J]. Computer Science, 2025, 52(7): 110-118.
[6] JIANG Kun, ZHAO Zhengpeng, PU Yuanyuan, HUANG Jian, GU Jinjing, XU Dan. Cross-modal Hypergraph Optimisation Learning for Multimodal Sentiment Analysis [J]. Computer Science, 2025, 52(7): 210-217.
[7] RAN Qin, RUAN Xiaoli, XU Jing, LI Shaobo, HU Bingqi. Function Prediction of Therapeutic Peptides with Multi-coded Neural Networks Based on Projected Gradient Descent [J]. Computer Science, 2025, 52(6A): 240800024-6.
[8] GAO Xinjun, ZHANG Meixin, ZHU Li. Study on Short-time Passenger Flow Data Generation and Prediction Method for RailTransportation [J]. Computer Science, 2025, 52(6A): 240600017-5.
[9] XIA Zhuoqun, ZHOU Zihao, DENG Bin, KANG Chen. Security Situation Assessment Method for Intelligent Water Resources Network Based on ImprovedD-S Evidence [J]. Computer Science, 2025, 52(6A): 240600051-6.
[10] WANG Jinghong, WU Zhibing, WANG Xizhao, LI Haokang. Semantic-aware Heterogeneous Graph Attention Network Based on Multi-view RepresentationLearning [J]. Computer Science, 2025, 52(6): 167-178.
[11] LI Enji, HU Siyu, TAN Guangming, JIA Weile. Impact and Analysis of Optimizers on the Performance of Neural Network Force Fields [J]. Computer Science, 2025, 52(5): 50-57.
[12] CHEN Xuhao, HU Sipeng, LIU Hongchao, LIU Boran, TANG Dan, ZHAO Di. Research on LLM Vector Dot Product Acceleration Based on RISC-V Matrix Instruction Set Extension [J]. Computer Science, 2025, 52(5): 83-90.
[13] WEI Xiaohui, GUAN Zeyu, WANG Chenyang, YUE Hengshan, WU Qi. Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators [J]. Computer Science, 2025, 52(5): 91-100.
[14] WU Pengyuan, FANG Wei. Study on Graph Collaborative Filtering Model Based on FeatureNet Contrastive Learning [J]. Computer Science, 2025, 52(5): 139-148.
[15] WANG Liming, ZHONG Guomin, SUN Mingxuan, HE Xiongxiong. Finitely-valued Terminal Zeroing Neural Networks with Application to Robotic Motion Planning [J]. Computer Science, 2025, 52(5): 270-280.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!