Computer Science ›› 2025, Vol. 52 ›› Issue (5): 83-90.doi: 10.11896/jsjkx.241200074

• High Performance Computing • Previous Articles     Next Articles

Research on LLM Vector Dot Product Acceleration Based on RISC-V Matrix Instruction Set Extension

CHEN Xuhao1,2,4, HU Sipeng1, LIU Hongchao1,3,4, LIU Boran4,5, TANG Dan1,4, ZHAO Di4,5   

  1. 1 Beijing Institute of Open Source Chip,Beijing 100080,China
    2 School of Information Science and Technology,ShanghaiTech University,Shanghai 210210,China
    3 Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450003,China
    4 State Key Lab of Processors,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    5 University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2024-12-10 Revised:2025-02-14 Online:2025-05-15 Published:2025-05-12
  • About author:CHEN Xuhao,born in 2000,postgra-duate.His main research interests include computer architecture and instruction set extension.
  • Supported by:
    Strategic Priority Research Program of the Chinese Academy of Sciences(XDA0320300).

Abstract: Considering the high-performance and low-power requirements of edge AI,this paper designs a specialized instruction set processor for edge AI based on the RISC-V instruction set architecture,addressing practical issues in digital signal processing for edge devices.This design enhances the execution efficiency of edge AI and reduces its energy consumption with limited hardware overhead,meeting the demands for efficient large language model(LLM) inference computation in edge AI applications.For the characteristics of large language models,custom instructions were extended based on the RISC-V instruction set to perform vector dot product calculations,accelerating the computation of large language models on dedicated vector dot product acceleration hardware.Based on the open-source high-performance RISC-V processor core XiangShan Nanhu architecture,the vector dot product specialized instruction set processor Nanhu-vdot is implemented,which adds vector dot product calculation units and pipeline processing logic on top of the XiangShan Nanhu.The Nanhu-vdot underwent FPGA hardware testing achieves over four times of the speed of scalar methods in vector dot product computation.Using a hardware-software co-design approach for second-generation generative pre-trained Transformer(GPT-2) model inference,the speed improves by approximately 30% compared to pure software implementation with almost no additional consumption of hardware resources and power consumption.

Key words: Instruction set extension, Vector dot product, Software and hardware collaboration, Large language model inference

CLC Number: 

  • TP302
[1]LI Y,ZHU J,FU Y,et al.Circular Reconfigurable Parallel Processor for Edge Computing:Industrial Product[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture(ISCA).IEEE,2024:863-875.
[2]DAGHERO F,PAGLIARI D J,PONCINO M.Energy-efficient deep learning inference on edgedevices[M]//Advances in Computers.Elsevier,2021:247-301.
[3]DHAR S,GUO J,LIU J,et al.A survey of on-device machine learning:An algorithmsand learning theory perspective[J].ACM Transactions on Internet of Things,2021,2(3):1-49.
[4]CHANDER V N,VARGHESE K.A Soft RISC-V Vector Processor for Edge-AI[C]//2022 35th International Conference on VLSI Design and 2022 21st International Conference on Embedded Systems(VLSID).IEEE,2022:263-268.
[5]SINGH R,GILL S S.Edge AI:a survey[J].Internet of Things and Cyber-Physical Systems,2023,3:71-92.
[6]HE T,CHEN X,WANG G.Research on Open Source Processor and Analysis of Current Development Dilemma Based on RISC-V[C]//2023 8th International Conference on Computer and Communication Systems(ICCCS).2023:768-774.
[7]GAO Y,QIAN W,CUI E F.RISC-V ISA Extension Toolchain Supports:A Survey[C]//Proceedings of the 2023 4th International Conference on Computing,Networks and Internet of Things(CNIOT '23).2023.
[8]KUSSWURM D.Streaming simd extensions[M]//Modern X86 Assembly Language Programming.2014:179-206.
[9]EMERY R.How AI and ML Applications Will Benefit fromVector Processing[EB/OL].https://www.enterpriseai.news/2020/07/31/how-ai-and-ml-applications-will-benefit-from-vector-processing.
[10]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[11]ZHOU L,ZHAO Z Q,PANG T,et al.Design of a Graph Convolutional Neural Network Accelerator Based on RISC-V [J].Computer Engineering and Science,2023,45(12):2113-2120.
[12]LIU C,WU Y J,WU J Z,et al.A Review of RISC-V Instruction Set Architecture Research [J].Journal of Software,2021,32(12):3992-4024.
[13]LI F,GUO S Z,HAO J W,et al.Implementation of a BasicMathematical Library for RISC-V [J].Journal of Electronics,2024,52(5):1633-1647.
[14]CUI E,LI T,WEI Q.RISC-V Instruction Set Architecture Extensions:A Survey[J].IEEE Access,2023,11:24696-24711.
[15]TORRES-SÁNCHEZ E,ALASTRUEY-BENEDÉ J,TORRES-MORENO E.Developing an AI IoT application with open software on a RISC-V SoC[C]//2020 XXXV Conference on Design of Circuits and Integrated Systems (DCIS).IEEE,2020:1-6.
[16]HAIDARZHY V.RISC-V Unleashed:The Definitive Guide to Next-Gen Computing[EB/OL].https://sirinsoftware.com/blog/risc-v-unleashed-the-definitive-guide-to-next-gen-computing.
[17]XU Y N,YU Z H,DAN T,et al.Towards Developing High Per-formance RISC-V Processors Using Agile Methodology[C]//2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO).IEEE,2022:1178-1199.
[18]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI blog,2019,1(8):9.
[19]Working draft of the proposed RISC-V V vector extension[EB/OL].https://github.com/riscv/riscv-v-spec.
[20]CHEN C,XIANG X,LIU C,et al.Xuantie-910:A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension:Industrial product[C]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).IEEE,2020:52-64.
[21]GENC H,KIM S,AMID A,et al.Gemmini:Enabling systematic deep-learning architecture evaluation via full-stack integration[C]//2021 58th ACM/IEEE Design Automation Conference (DAC).IEEE,2021:769-774.
[22]ZHAO J,KORPAN B,GONZALEZ A,et al.Sonicboom:The 3rd generation berkeley out-of-order machine[C]//Fourth Workshop on Computer Architecture Research with RISC-V.2020:1-7.
[23]BASHA S H S,DUBEY S R,PULABAIGARI V,et al.Impact of fully connected layers on performance of convolutional neural networks for image classification[J].Neurocomputing,2020,378:112-119.
[24]SHALEV-SHWARTZ S,BEN-DAVID S.Understanding ma-chine learning:From theory to algorithms[M].Cambridge:Cambridge University Press,2014.
[25]A Review of Transformer Models[EB/OL].https://www.researchgate.net/profile/Jennifer-Dsouza-6/publication/373757234_A_Review_of_Transformer_Models/links/64faeef25ce6b724f916364b/A-Review-of-Transformer-Models.pdf.
[26]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[J].arXiv:1706.03762,2017.
[27]DEROSE J F,WANG J,BERGER M.Attention flows:Analyzing and comparing attention mechanisms in language models[J].IEEE Transactions on Visualization and Computer Gra-phics,2020,27(2):1160-1170.
[28]UPRETY S,JAISWAL A K,LIU H,et al.Investigating Context Effects in Similarity Judgements in Large Language Models[J].arXiv:2408.10711,2024.
[29]WANG X,XIONG Y,WEI Y,et al.LightSeq:A high performance inference library for transformers[J].arXiv:2010.13887,2020.
[30]JIANG S J.Design of an FFT-Specific Instruction Set Processor Based on RISC-V [D].Guangzhou:South China University of Technology,2023.
[31]BACHRACH J,VO H,RICHARDS B,et al.Chisel:construc-ting hardware in a scala embedded language[C]//Proceedings of the 49th Annual Design Automation Conference.2012:1216-1225.
[32]ZHU Y,ZHENG J,DING S,et al.Hardware Data Prefetch for XiangShan Processor[C]//2022 7th International Conference on Integrated Circuits and Microsystems (ICICM).2022:394-397.
[33]ZOU J R,TANG D,CAI Y,et al.A design of fetch target buffer implemented on XiangShan processor[C]// International Conference on Cloud Computing,Internet of Things,and Computer Applications.2022.
[34]Xilinx.Product Overview:1-1dt42z7 Development Board[EB/OL].https://china.xilinx.com/products/boards-and-kits/1-1dt42z7.html.
[35]LI P S,IZRAELEVITZ A M,BACHRACH J.Specification for the FIRRTL Language[EB/OL].https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-9.pdf.
[36]IZRAELEVITZ A,JACK K,LI P,et al.Reusability is FIRRTL ground:Hardware construction languages,compiler frameworks,and transformations[C]//2017 IEEE/ACM InternationalConference on Computer-Aided Design (ICCAD).2017:209-216.
[37]FREY S,GUERMANDI M,BENATTI S,et al.BioGAP:a 10-Core FP-capable Ultra-Low Power IoT Processor,with Medical-Grade AFE and BLE Connectivity for Wearable Biosignal Processing[EB/OL].https://ieeexplore.ieee.org/abstract/document/10189286.
[1] TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan. AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+” [J]. Computer Science, 2025, 52(5): 1-10.
[2] DAI Hanwen, CHEN Changbo. Accelerating Batched Matrix Multiplication for Variable Small Sizes Based on TVM andApplications [J]. Computer Science, 2025, 52(5): 25-40.
[3] LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing [J]. Computer Science, 2025, 52(5): 41-49.
[4] MA Zhaoyang, CHEN Juan, ZHOU Yichang, WU Xianyu, GAO Pengfei, RUAN Wenhao, ZHAN Haoming. TS3:Energy-Efficiency-First Optimal Thread Number Search Algorithm Based on Specific Starting Point Classification [J]. Computer Science, 2025, 52(5): 67-75.
[5] LI Zekai, ZHONG Jiaqing, FENG Shaojun, CHEN Juan, DENG Rongyu, XU Tao, TAN Zhengyuan, ZHOU Kexing, ZHU Pengzhi, MA Zhaoyang. CPU Power Modeling Accuracy Improvement Method Based on Training Set Clustering Selection [J]. Computer Science, 2024, 51(9): 59-70.
[6] WANG Yiyang, LIU Fagui, PENG Lingxia, ZHONG Guoxiang. Out-of-Distribution Hard Disk Failure Prediction with Affinity Propagation Clustering and Broad Learning Systems [J]. Computer Science, 2024, 51(8): 63-74.
[7] YANG Heng, LIU Qinrang, FAN Wang, PEI Xue, WEI Shuai, WANG Xuan. Study on Deep Learning Automatic Scheduling Optimization Based on Feature Importance [J]. Computer Science, 2024, 51(7): 22-28.
[8] BAI Wenchao, BAI Shuwen, HAN Xixian, ZHAO Yubo. Efficient Query Workload Prediction Algorithm Based on TCN-A [J]. Computer Science, 2024, 51(7): 71-79.
[9] DU Qingpeng, XU Yinlong, WU Si. Stripe Matching and Merging Algorithm-based Redundancy Transition for Locally Repairable Codes [J]. Computer Science, 2023, 50(12): 89-96.
[10] YUAN Peiyan, MA Yiwen. Optimal Edge Server Placement Method Based on Delay and Load [J]. Computer Science, 2023, 50(11A): 220900260-8.
[11] ZHANG Junna, CHEN Jiawei, BAO Xiang, LIU Chunhong, YUAN Peiyan. Cost-minimizing Task Offload Strategy for Mobile Devices Under Service Cache Constraint [J]. Computer Science, 2023, 50(10): 275-281.
[12] LIU Wei, GUO Lingbei, XIA Yujie, SHE Wei, TIAN Zhao. Raft Consensus Algorithm Based on Credit Evaluation Model [J]. Computer Science, 2023, 50(6): 322-329.
[13] YANG Qianlong, JIANG Lingyun. Study on Load Balancing Algorithm of Microservices Based on Machine Learning [J]. Computer Science, 2023, 50(5): 313-321.
[14] CHEN Lei, TANG Tao, QI Hai-jun, JIANG Hao, HE Kang. Design and Implementation of Multithreaded Reproducible DGEMV for Phytium Processor [J]. Computer Science, 2022, 49(10): 27-35.
[15] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!