计算机科学 ›› 2025, Vol. 52 ›› Issue (12): 1-8.doi: 10.11896/jsjkx.250600014

• 计算机软件&体系架构 • 上一篇    下一篇

基于RISC-V指令扩展的神经网络计算加速架构

蔡成欢1, 王一品1, 许嘉滨2, 张逢喆3, 周学功3, 曹伟3, 张帆3, 余新胜4   

  1. 1 复旦大学软件学院 上海 200433
    2 复旦大学计算机科学技术学院 上海 200433
    3 复旦大学大数据研究院 上海 200433
    4 中国电子科技集团公司第三十二研究所 上海 201899
  • 收稿日期:2025-06-03 修回日期:2025-09-04 发布日期:2025-12-09
  • 通讯作者: 张逢喆(fzzhang@fudan.edu.cn)
  • 作者简介:(18034065482@163.com)
  • 基金资助:
    国家重点研发计划(2022YFB4500903)

Neural Network Acceleration Architecture Based on RISC-V Instruction Set Extension

CAI Chenghuan1, WANG Yipin1, XU Jiabin2, ZHANG Fengzhe3, ZHOU Xuegong3, CAO Wei3, ZHANG Fan3, YU Xinsheng4   

  1. 1 School of Software Engineering, Fudan University, Shanghai 200433, China
    2 School of Computer Science and Technology, Fudan University, Shanghai 200433, China
    3 Institute of Big Data, Fudan University, Shanghai 200433, China
    4 The 32nd Research Institute, China Electronics Technology Group Corporation(CETC), Shanghai 201899, China
  • Received:2025-06-03 Revised:2025-09-04 Online:2025-12-09
  • About author:CAI Chenghuan,born in 1999,postgra-duate.His main research interest is in domain-specific hardware-software co-design.
    ZHANG Fengzhe,born in 1982,Ph.D,associate professor,Ph.D supervisor,is a member of CCF(No.21012M).His main research interests include compu-ter architecture and system software,and brain-inspired computing.
  • Supported by:
    This work was supported by the National Key R & D Program of China (2022YFB4500903).

摘要: 针对现阶段以RISC-V为核心的神经网络加速器对Transformer架构模型中矩阵计算及非线性计算加速不足的问题,开展了基于RISC-V指令扩展的神经网络计算加速架构研究,提出名为Taurus的神经网络加速器架构。针对模型架构特点,进行了矩阵指令扩展,并使用脉动阵列进行矩阵乘累加计算;为支持非线性计算加速,进行向量指令扩展,并设计特殊向量单元完成LayerNorm和Softmax的计算;为保证数据供给平衡,优化访存指令扩展,以保证矩阵计算单元、向量计算单元的数据供给,在进行指令扩展时采用标量寄存器的扩展方式,将运算数据信息存入寄存器中增大了寻址空间,以保证进行大规模数据运算时生成较少的指令条数。Taurus神经网络加速器架构在Gem5平台上完成了周期精确的模拟仿真,与开源加速器Gemmini相比,进行通用矩阵乘法运算时,脉动阵列利用率提高80%;在ResNet50和BERT模型推理中,Taurus与Gemmini相比,分别获得1.3倍和31.3倍的加速;与RISC-V相比,性能分别获得1 467倍和4 513倍的加速。

关键词: 神经网络, 矩阵计算, 非线性计算, 指令扩展

Abstract: To address the current shortcomings of RISC-V-based neural network accelerators in accelerating matrix computations and nonlinear operations within Transformer-based models,a neural network acceleration architecture based on RISC-V instruction set extension,named Taurus,is proposed.This architecture introduces matrix instruction extensions tailored to the characte-ristics of Transformer models and employs a systolic array to perform matrix multiply-accumulate operations.To accelerate nonlinear computations,vector instruction extensions are added,along with the design of specialized vector units to efficiently compute operations such as LayerNorm and Softmax.To ensure balanced data supply,memory access instruction extensions are optimized to provide sufficient data throughput to the matrix and vector computation units.The instruction set extensions adopt a scalar register expansion approach,embedding operand data information directly into the registers.This increases the addressing space and reduces the number of instructions required for large-scale data computations.The Taurus neural network accelerator architecture is cycle-accurately simulated on the Gem5 platform.Compared with the open-source accelerator Gemmini,Taurus achieves an 80% improvement in systolic array utilization during general matrix multiplication.For inference tasks on ResNet50 and BERT models,Taurus delivers 1.3× and 31.3× speedups respectively over Gemmini.Compared with the baseline RISC-V,Taurus achieves 1 467× and 4 513× performance improvements respectively.

Key words: Neural networks, Matrix computation, Nonlinear computation, Instruction set extension

中图分类号: 

  • TP302
[1]YOU H,SUN Z,SHI H,et al.Vitcod:Vision transformer acceleration via dedicated algorithm and accelerator co-design[C]//2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2023:273-286.
[2]WANG T,GONG L,WANG C,et al.Via:A novel vision-transformer accelerator based on fpga[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2022,41(11):4088-4099.
[3]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter per-formance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[4]LIANG X Y.Ascend AI Processor Architecture and Programming:In-Depth Understanding of CANN Technology Principles and Applications [M].Beijing:Tsinghua University Press,2019.
[5]LIU M Y,LI C H,LIN C Y,et al.Matrix Accelerator Designed for Vision Transformer[C]//2024 IEEE International Confe-rence on Consumer Electronics-Asia(ICCE-Asia).IEEE,2024:1-2.
[6]KIM S,HOOPER C,WATTANAWONGT,et al.Full stack optimization of transformer inference:a survey[J].arXiv:2302.14017,2023.
[7]CUI E,LI T,WEI Q.Risc-v instruction set architecture extensions:A survey[J].IEEE Access,2023,11:24696-24711.
[8]CAMMARATA D,PEROTTI M,BERTULETTI M,et al.Quadrilatero:A RISC-V programmable matrix coprocessor for low-power edge applications[J].arXiv:2504.07565,2025.
[9]PUROHIT Y,PAREEK D,SAVANI V.Development of a Sys-tem on Chip(SoC) for Matrix Multiplication Utilizing RISC-V and Vector Processor[C]//International Conference on Sustainable and Innovative Solutions for Current Challenges in Engineering & Technology.Springer,2025:1-12.
[10]JIAO Q,HU W,LIU F,et al.Risc-vtf:Risc-v based extended instruction set for transformer[C]//2021 IEEE International Conference on Systems,Man,and Cybernetics(SMC).IEEE,2021:1565-1570.
[11]BUTKO A,GARIBOTTI R,OST L,et al.Accuracy evaluation of gem5 simulator system[C]//7th International Workshop on Reconfigurable and Communication-centric Systems-on-chip(ReCoSoC).IEEE,2012:1-7.
[12]LOWE-POWER J,AHMAD A M,AKRAM A,et al.The gem5 simulator:Version 20.0+[J].arXiv:2007.03152,2020.
[13]SHAO Y S,XI S L,SRINIVASAN V,et al.Co-designing acce-lerators and SoC interfaces using gem5-Aladdin[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2016:1-12.
[14]ROGERS S,SLYCORD J,BAHARANI M,et al.gem5-salam:A system architecture for llvm-based accelerator modeling[C]//2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2020:471-482.
[15]VIEIRA J,ROMA N,FALCAO G,et al.Gem5-accel:A pre-RTL simulation toolchain for accelerator architecture validation[J].IEEE Computer Architecture Letters,2023,23(1):1-4.
[16]FEIST T.Vivado design suite[Z].White Paper,2012:24.
[17]GENC H,KIM S,AMID A,et al.Gemmini:Enabling systematic deep-learning architecture evaluation via full-stack integration[C]//2021 58th ACM/IEEE Design Automation Conference(DAC).IEEE,2021:769-774.
[18]CAVALCANTE M,SCHUIKI F,ZARUBAF,et al.Ara:A 1-GHz+scalable and energy-efficient RISC-V vector processor with multiprecision floating-point support in 22-nm FD-SOI[J].IEEE Transactions on Very Large Scale Integration Systems,2019,28(2):530-543.
[19]GAURAV T,BHATT A,PAREKH R.Design and Implementation of low power RISC V ISA based coprocessor design for Matrix multiplication[C]//2021 Second International Conference on Electronics and Sustainable Communication Systems(ICESC).IEEE,2021:189-195.
[20]TAI H Y.Enhanced RISC-V Matrix Extension Architecture[D].Taiwan:National Yang Ming Chiao Tung University,2023.
[21]YI X,ANTONIO R,DUMOULIN J,et al.OpenGeMM:A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling[J].arXiv:2411.09543,2024.
[22]Working draft of the proposed RISC-V V vector extension[EB/OL].https://github.com/riscv/riscv-v-spec.
[23]CHEN C,XIANG X,LIU C,et al.Xuantie-910:A commercial multi-core 12-stage pipeline out-of-order 64-bit high perfor-mance RISC-V processor with vector extension:Industrial pro-duct[C]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture(ISCA).IEEE,2020:52-64.
[24]KIRAN D C,GURUNARAYANAN S,MISRAJ P,et al.Register allocation for fine grain threads on multicore processor[J].Journal of King Saud University-Computer and Information Sciences,2017,29(1):85-92.
[25]PALA D.Design and programming of a coprocessor for a RISC-V architecture[D].Torino:Politecnico di Torino,2017.
[26]WATERMAN A,LEE Y,PATTERSON D A,et al.The RISC-V instruction set manual,volume I:User-level ISA,version 2.0:Tech.Rep.:UCB/EECS-2014-54[R].EECS Department,University of California,Berkeley,2014:4.
[27]SZE V,CHEN Y H,YANG T J,et al.Efficient processing of deep neural networks:A tutorial and survey[C]//Proceedings of the IEEE.2017:2295-2329.
[28]CAPRA M,BUSSOLINO B,MARCHISIO A,et al.Hardwareand software optimizations for accelerating deep neural networks:Survey of current trends,challenges,and the road ahead[J].IEEE Access,2020,8:225134-225180.
[29]THOMASD,MOORBY P.The Verilog © hardware description language[M].Springer Science & Business Media,2008.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!