计算机科学 ›› 2025, Vol. 52 ›› Issue (4): 94-100.doi: 10.11896/jsjkx.241000099

• 智能嵌入式系统 • 上一篇    下一篇

面向低资源芯片的高效自适应卷积神经网络加速器

庞明义1, 魏祥麟2, 张云祥2, 王斌2, 庄建军1   

  1. 1 南京信息工程大学电子与信息工程学院 南京 211800
    2 国防科技大学第六十三研究所 南京 210007
  • 收稿日期:2024-10-21 修回日期:2025-02-22 出版日期:2025-04-15 发布日期:2025-04-14
  • 通讯作者: 魏祥麟(weixianglin@nudt.edu.cn)
  • 作者简介:(202312490285@nuist.edu.cn)

Efficient Adaptive CNN Accelerator for Resource-limited Chips

PANG Mingyi1, WEI Xianglin2, ZHANG Yunxiang2, WANG Bin2, ZHUANG Jianjun1   

  1. 1 School of Electronics and Information Engineering,Nanjing University of Information Science & Technology,Nanjing 211800,China
    2 63rd Research Institute,National University of Defense Technology,Nanjing 210007,China
  • Received:2024-10-21 Revised:2025-02-22 Online:2025-04-15 Published:2025-04-14
  • About author:PANG Mingyi,born in 2001,postgra-duate.His main research interests include hardware acceleration and edge computing.
    WEI Xianglin,born in 1985,Ph.D,associate researcher.His main research interests include edge computing,deep learning and wireless network security.

摘要: 文中提出了一种面向非GPU类低资源芯片的自适应卷积神经网络加速器(Adaptive Convolutional Neural Network Accelerator,ACNNA),其可根据硬件平台资源约束和卷积神经网络结构自适应生成对应的硬件加速器。通过可重构特性,ACNNA可有效加速包括卷积层、池化层、激活层和全连接层在内的各种网络层组合。首先,设计了一种资源折叠式多通道处理引擎(Processing Engine,PE)阵列,将理想化卷积结构进行折叠以节省资源,在输出通道上展开以支持并行计算。其次,采用多级存储与乒乓缓存机制对流水线进行优化,有效提升数据处理效率。然后,提出了一种多级存储下的资源复用策略,结合设计空间探索算法,针对网络参数调度硬件资源分配,使低资源芯片可部署层次更深且参数更多的网络模型。以LeNet5和VGG16网络模型为例,在Ultra96 V2开发板上对ACNNA进行了验证。结果显示,采用ACNNA部署的VGG16最低仅消耗了原网络4%的资源量。在100MHz主频下,LeNet5加速器在2.05W的功耗下计算速率达0.37 GFLOPS;VGG16加速器在2.13W的功耗下计算速率达1.55 GFLOPS。与现有工作相比,所提方法的FPS提升超过83%。

关键词: 硬件加速, 卷积神经网络, 设计空间探索策略, 现场可编程门阵列

Abstract: This paper proposes an adaptive convolutional neural network accelerator(ACNNA) for non-GPU chips with limited resources,which can adaptively generate hardware accelerators based on resource constraints of hardware platform and convolutional neural network structures.Through its reconfigurable feature,ACNNA can effectively accelerate various layer combinations including convolutional layers,pooling layers,activation layers,and fully connected layers.Firstly,a resource folding multi-channel processing engine(PE) array is designed,which folds the idealized convolutional structure to save resources and unfolds on the output channel to support parallel computing.Secondly,multi-level storage and ping-pong caching mechanisms are adopted to optimize the pipeline,effectively improving data processing efficiency.Then,a resource reuse strategy under multi-level storage is proposed,which combined with the design space exploration algorithm can more reasonably schedule hardware resource allocation for network parameters,so that low resource chips can deploy deeper and more parameterized network models.Taking LeNet5 and VGG16 network models as examples,this paper validate ACNNA on the Ultra96 V2 development board.The results show that the ACNNA deployment of VGG16 consumes only 4% of resources of original network.At 100MHz main frequency,LeNet5 accelerator achieves a computing rate of 0.37 GFLOPS with a power consumption of 2.05W; VGG16 accelerator has a computing speed of 1.55 GFLOPS at a power consumption of 2.132W.Compared with existing work,ACNNA increases Frames Per Second(FPS) by over 83%.

Key words: Hardware acceleration, Convolutional neural network, Design space exploration strategy, Field programmable gate array

中图分类号: 

  • TP391
[1]CHEN X,XIE L X,WU J,et al.Cyclic CNN:Image Classification With Multiscale and Multilocation Contexts [J].IEEE Internet of Things Journal,2021,8:7466-7475.
[2]HUANG L,CHEN C,YUN J T,et al.Multi-Scale Feature Fusion Convolutional Neural Network for Indoor Small Target Detection [J].Frontiers in Neurorobotics,2022,16:881021.
[3]HEMA C R,MÁRQUEZ F P G.Emotional speech Recognition using CNN and Deep learning techniques [J].Applied Acoustics,2023,211:109492.
[4]JANG B,KIM M,HARERIMANA G,et al.Bi-LSTM Model to Increase Accuracy in Text Classification:Combining Word2vec CNN and Attention Mechanism [J].Applied Sciences,2020,10:5814.
[5]SYED R T,MARKO S,ULBRICHT M,et al.Towards Reconfigurable CNN Accelerator for FPGA Implementation [J].IEEE Transactions on Circuits and Systems II:Express Briefs,2023,70:1249-1253.
[6]BJERGE K,SCHOUGAARD J H,LARSEN D E.A scalableand efficient convolutional neural network accelerator using HLS for a system-on-chip design [J].Microprocess and Micro-systems,2021,87:104363.
[7]ZHANG Z C,MAHMUD M A,KOUZANI A Z,et al.FitNN:A Low-Resource FPGA-Based CNN Accelerator for Drones [J].IEEE Internet of Things Journal,2022,9:21357-21369.
[8]LI,S Z,WANG Q,JIANG J F,et al.An Efficient CNN Accelerator Using Inter-Frame Data Reuse of Videos on FPGAs [J].IEEE Transactions on Very Large Scale Integration(VLSI) Systems,2022,30:1587-1600.
[9]YAN S,LIU Z,WANG Y,et al.An FPGA-based MobileNet Accelerator Considering Network Structure Characteristics[C]//31st International Conference on Field-Programmable Logic and Applications(FPL).2021:17-23.
[10]WANG B,WEI X L,WANG C,et al.Adaptive design and implementation of automatic modulation recognition accelerator [J].Journal of Ambient Intelligence and Humanized Computing,2024,15:1-17.
[11]BAO C,XIE T,FENG W B,et al.A Power-Efficient Optimizing Framework FPGA Accelerator Based on Winograd for YOLO [J].IEEE Access,2020,8:94307-94317.
[12]LECUN Y,BOTTOU B,BENGIO Y,et al.Gradient-basedlearning applied to document recognition [C]//Proceedings of the IEEE.1998:2278-2324.
[13]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-ScaleImage Recognition [J].arXiv:1409.1556,2014.
[14]RIAZATI M,DANESHTALAB M,SJODIN M,et al.Au-toDeepHLS:Deep Neural Network High-level Synthesis using fixed-point precision [C]//2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems(AICAS).2022:122-125.
[15]NANE R,SIMA V M,PILATO C M,et al.A Survey and Evaluation of FPGA High-Level Synthesis Tools [J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2016,35:1591-1604.
[16]CHEN Y H,KRISHNA T,EMER J S,et al.Eyeriss:An energy efficient reconfigurable accelerator for deep convolutional neural networks [J].IEEE Solid-State Circuits,2017,52:127-138.
[17]BELABED T,SILVA V R,QUENON A,et al.A Novel Automate Python Edge-to-Edge:From Automated Generation on Cloud to User Application Deployment on Edge of Deep Neural Networks for Low Power IoT Systems FPGA-Based Acceleration [J].Sensors,2021,21:6050.
[18]MOUSOULIOTIS P G,PETROU L P.CNN-Grinder:From Algorithmic to High-Level Synthesis descriptions of CNNs for Low-end-low-cost FPGA SoCs [J].Microprocessors and Micro-systems,2020,73:102990.
[19]VENIERIS S I,BOUGANIS C.fpgaConvNet:A Framework for Mapping Convolutional Neural Networks on FPGAs [C]//2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM).2016:40-47.
[20]RIVERA-ACOSTA M,ORTEGA-CISNEROS S,RIVER A.Automatic Tool for Fast Generation of Custom Convolutional Neural Networks Accelerators for FPGA [J].Electronics,2019,8:641.
[21]MAZOUZ A,BRIDEGS C P.Automated Offline Design-Space Exploration and Online Design Reconfiguration for CNNs [C]//2020 IEEE Conference on Evolving and Adaptive Intelligent Systems(EAIS).2020:1-9.
[22]WANG F,SHEN M,LU Y,et al.TensorMap:A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators [J].IEEE Transactions on Computers,2024,73:1899-1912.
[23]ANDRULIS T,EMER J S,SZE V.CiMLoop:A Flexible,Accurate,and Fast Compute-In-Memory Modeling Tool [C]//2024 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).2024:10-23.
[24]WU X,WANG M,LIN J,et al.Amoeba:An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNs [J].IEEE Transactions on Very Large Scale Integration(VLSI) Systems,2024,32:1086-1099.
[25]JIA X,ZHANG Y,LIU G,et al.XVDPU:A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine [C]//2022 32nd International Conference on Field-Programmable Logic and Applications(FPL).2022:1-9.
[26]NAG S,DATTA G,KUNDU S,et al.ViTA:A Vision Transformer Inference Accelerator for Edge Applications [C]//2023 IEEE International Symposium on Circuits and Systems(ISCAS).2023:1-5.
[27]XU Y,LUO J,SUN W.Flare:An FPGA-Based Full Precision Low Power CNN Accelerator with Reconfigurable Structure [J].Sensors,2024,24:2239.
[28]CHEN T Q,LI M,LI Y,et al.MXNet:A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems [J].arXiv:1512.01274,2015.
[29]WANG E,DAVISJ J,CHEUNG P Y.A PYNQ-Based Framework for Rapid CNN Prototyping [C]//2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM).2018:223-223.
[30]CHEN S H,WU J M,PENG K J et al.Design and Implementation of Convolutional Neural Network Accelerator Based on ZYNQ Platform [J].Chinese Automation and Information Engineering,2024,45(1):30-34.
[31]WANG Y L,XIE K L,CHEN S Y,et al.A universal design on hardware acceleration of convolutional neural networks [J].Chinese Computer Science Engineering,2023,45(4):577-581.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!