计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 83-90.doi: 10.11896/jsjkx.241200074

• 高性能计算 • 上一篇    下一篇

基于RISC-V Matrix指令集扩展的LLM矢量点积加速研究

陈煦豪1,2,4, 胡思鹏1, 刘洪超1,3,4, 刘伯然4,5, 唐丹1,4, 赵地4,5   

  1. 1 北京开源芯片研究院 北京 100080
    2 上海科技大学信息科学与技术学院 上海 210210
    3 郑州大学河南先进技术研究院 郑州 450003
    4 中国科学院计算技术研究所处理器芯片全国重点实验室 北京 100190
    5 中国科学院大学 北京 100049
  • 收稿日期:2024-12-10 修回日期:2025-02-14 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 陈煦豪(chenxh2022@shanghaitech.edu.cn)
  • 基金资助:
    中国科学院战略性先导科技专项(XDA0320300)

Research on LLM Vector Dot Product Acceleration Based on RISC-V Matrix Instruction Set Extension

CHEN Xuhao1,2,4, HU Sipeng1, LIU Hongchao1,3,4, LIU Boran4,5, TANG Dan1,4, ZHAO Di4,5   

  1. 1 Beijing Institute of Open Source Chip,Beijing 100080,China
    2 School of Information Science and Technology,ShanghaiTech University,Shanghai 210210,China
    3 Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450003,China
    4 State Key Lab of Processors,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    5 University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2024-12-10 Revised:2025-02-14 Online:2025-05-15 Published:2025-05-12
  • About author:CHEN Xuhao,born in 2000,postgra-duate.His main research interests include computer architecture and instruction set extension.
  • Supported by:
    Strategic Priority Research Program of the Chinese Academy of Sciences(XDA0320300).

摘要: 鉴于边缘AI的高性能与低功耗需求,基于 RISC-V 指令集架构,针对边缘设备数字信号处理的实际问题,设计了一种边缘AI的专用指令集处理器,在有限的硬件开销下,提升了边缘AI的执行效率,降低了边缘AI的能量消耗,能够满足边缘AI应用中进行高效大语言模型(LLM) 推理计算的需求。 针对大语言模型的特性,基于RISC-V指令集扩展了自定义指令完成矢量点积计算,在专用的矢量点积加速硬件上进行大语言模型的运算加速;基于开源高性能RISC-V处理器核“香山”nanhu版本架构,实现了矢量点积专用指令集处理器nanhu-vdot,其在高性能处理器“香山”(nanhu版本)的基础上增加了矢量点积计算单元以及流水线处理逻辑;对nanhu-vdot进行FPGA硬件测试,在几乎没有增加额外的硬件资源和功耗消耗的前提下,矢量点积运算速度相比标量方法提高4倍以上,使用软硬件协同方案进行第二代生成式预训练(Generative Pre-Trained-2,GPT-2)模型推理,相比纯软件实现,速度提高了约30%。

关键词: 指令集扩展, 矢量点积, 软硬件协同, 大语言模型推理

Abstract: Considering the high-performance and low-power requirements of edge AI,this paper designs a specialized instruction set processor for edge AI based on the RISC-V instruction set architecture,addressing practical issues in digital signal processing for edge devices.This design enhances the execution efficiency of edge AI and reduces its energy consumption with limited hardware overhead,meeting the demands for efficient large language model(LLM) inference computation in edge AI applications.For the characteristics of large language models,custom instructions were extended based on the RISC-V instruction set to perform vector dot product calculations,accelerating the computation of large language models on dedicated vector dot product acceleration hardware.Based on the open-source high-performance RISC-V processor core XiangShan Nanhu architecture,the vector dot product specialized instruction set processor Nanhu-vdot is implemented,which adds vector dot product calculation units and pipeline processing logic on top of the XiangShan Nanhu.The Nanhu-vdot underwent FPGA hardware testing achieves over four times of the speed of scalar methods in vector dot product computation.Using a hardware-software co-design approach for second-generation generative pre-trained Transformer(GPT-2) model inference,the speed improves by approximately 30% compared to pure software implementation with almost no additional consumption of hardware resources and power consumption.

Key words: Instruction set extension, Vector dot product, Software and hardware collaboration, Large language model inference

中图分类号: 

  • TP302
[1]LI Y,ZHU J,FU Y,et al.Circular Reconfigurable Parallel Processor for Edge Computing:Industrial Product[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture(ISCA).IEEE,2024:863-875.
[2]DAGHERO F,PAGLIARI D J,PONCINO M.Energy-efficient deep learning inference on edgedevices[M]//Advances in Computers.Elsevier,2021:247-301.
[3]DHAR S,GUO J,LIU J,et al.A survey of on-device machine learning:An algorithmsand learning theory perspective[J].ACM Transactions on Internet of Things,2021,2(3):1-49.
[4]CHANDER V N,VARGHESE K.A Soft RISC-V Vector Processor for Edge-AI[C]//2022 35th International Conference on VLSI Design and 2022 21st International Conference on Embedded Systems(VLSID).IEEE,2022:263-268.
[5]SINGH R,GILL S S.Edge AI:a survey[J].Internet of Things and Cyber-Physical Systems,2023,3:71-92.
[6]HE T,CHEN X,WANG G.Research on Open Source Processor and Analysis of Current Development Dilemma Based on RISC-V[C]//2023 8th International Conference on Computer and Communication Systems(ICCCS).2023:768-774.
[7]GAO Y,QIAN W,CUI E F.RISC-V ISA Extension Toolchain Supports:A Survey[C]//Proceedings of the 2023 4th International Conference on Computing,Networks and Internet of Things(CNIOT '23).2023.
[8]KUSSWURM D.Streaming simd extensions[M]//Modern X86 Assembly Language Programming.2014:179-206.
[9]EMERY R.How AI and ML Applications Will Benefit fromVector Processing[EB/OL].https://www.enterpriseai.news/2020/07/31/how-ai-and-ml-applications-will-benefit-from-vector-processing.
[10]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[11]ZHOU L,ZHAO Z Q,PANG T,et al.Design of a Graph Convolutional Neural Network Accelerator Based on RISC-V [J].Computer Engineering and Science,2023,45(12):2113-2120.
[12]LIU C,WU Y J,WU J Z,et al.A Review of RISC-V Instruction Set Architecture Research [J].Journal of Software,2021,32(12):3992-4024.
[13]LI F,GUO S Z,HAO J W,et al.Implementation of a BasicMathematical Library for RISC-V [J].Journal of Electronics,2024,52(5):1633-1647.
[14]CUI E,LI T,WEI Q.RISC-V Instruction Set Architecture Extensions:A Survey[J].IEEE Access,2023,11:24696-24711.
[15]TORRES-SÁNCHEZ E,ALASTRUEY-BENEDÉ J,TORRES-MORENO E.Developing an AI IoT application with open software on a RISC-V SoC[C]//2020 XXXV Conference on Design of Circuits and Integrated Systems (DCIS).IEEE,2020:1-6.
[16]HAIDARZHY V.RISC-V Unleashed:The Definitive Guide to Next-Gen Computing[EB/OL].https://sirinsoftware.com/blog/risc-v-unleashed-the-definitive-guide-to-next-gen-computing.
[17]XU Y N,YU Z H,DAN T,et al.Towards Developing High Per-formance RISC-V Processors Using Agile Methodology[C]//2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).IEEE,2022:1178-1199.
[18]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9.
[19]Working draft of the proposed RISC-V V vector extension[EB/OL].https://github.com/riscv/riscv-v-spec.
[20]CHEN C,XIANG X,LIU C,et al.Xuantie-910:A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension:Industrial product[C]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).IEEE,2020:52-64.
[21]GENC H,KIM S,AMID A,et al.Gemmini:Enabling systematic deep-learning architecture evaluation via full-stack integration[C]//2021 58th ACM/IEEE Design Automation Conference (DAC).IEEE,2021:769-774.
[22]ZHAO J,KORPAN B,GONZALEZ A,et al.Sonicboom:The 3rd generation berkeley out-of-order machine[C]//Fourth Workshop on Computer Architecture Research with RISC-V.2020:1-7.
[23]BASHA S H S,DUBEY S R,PULABAIGARI V,et al.Impact of fully connected layers on performance of convolutional neural networks for image classification[J].Neurocomputing,2020,378:112-119.
[24]SHALEV-SHWARTZ S,BEN-DAVID S.Understanding ma-chine learning:From theory to algorithms[M].Cambridge:Cambridge University Press,2014.
[25]A Review of Transformer Models[EB/OL].https://www.researchgate.net/profile/Jennifer-Dsouza-6/publication/373757234_A_Review_of_Transformer_Models/links/64faeef25ce6b724f916364b/A-Review-of-Transformer-Models.pdf.
[26]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017.
[27]DEROSE J F,WANG J,BERGER M.Attention flows:Analyzing and comparing attention mechanisms in language models[J].IEEE Transactions on Visualization and Computer Gra-phics,2020,27(2):1160-1170.
[28]UPRETY S,JAISWAL A K,LIU H,et al.Investigating Context Effects in Similarity Judgements in Large Language Models[J].arXiv:2408.10711,2024.
[29]WANG X,XIONG Y,WEI Y,et al.LightSeq:A high performance inference library for transformers[J].arXiv:2010.13887,2020.
[30]JIANG S J.Design of an FFT-Specific Instruction Set Processor Based on RISC-V [D].Guangzhou:South China University of Technology,2023.
[31]BACHRACH J,VO H,RICHARDS B,et al.Chisel:construc-ting hardware in a scala embedded language[C]//Proceedings of the 49th Annual Design Automation Conference.2012:1216-1225.
[32]ZHU Y,ZHENG J,DING S,et al.Hardware Data Prefetch for XiangShan Processor[C]//2022 7th International Conference on Integrated Circuits and Microsystems (ICICM).2022:394-397.
[33]ZOU J R,TANG D,CAI Y,et al.A design of fetch target buffer implemented on XiangShan processor[C]// International Conference on Cloud Computing,Internet of Things,and Computer Applications.2022.
[34]Xilinx.Product Overview:1-1dt42z7 Development Board[EB/OL].https://china.xilinx.com/products/boards-and-kits/1-1dt42z7.html.
[35]LI P S,IZRAELEVITZ A M,BACHRACH J.Specification for the FIRRTL Language[EB/OL].https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-9.pdf.
[36]IZRAELEVITZ A,JACK K,LI P,et al.Reusability is FIRRTL ground:Hardware construction languages,compiler frameworks,and transformations[C]//2017 IEEE/ACM InternationalConference on Computer-Aided Design (ICCAD).2017:209-216.
[37]FREY S,GUERMANDI M,BENATTI S,et al.BioGAP:a 10-Core FP-capable Ultra-Low Power IoT Processor,with Medical-Grade AFE and BLE Connectivity for Wearable Biosignal Processing[EB/OL].https://ieeexplore.ieee.org/abstract/document/10189286.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!