计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 8-14.doi: 10.11896/jsjkx.221100104

• 高性能计算 • 上一篇    下一篇

CNN景象匹配算法的加速设计与FPGA实现

王晓峰, 李超然, 路坤锋, 栾天娇, 姚娜, 周辉, 谢宇嘉   

  1. 北京航天自动控制研究所 北京 100854
    宇航智能控制技术国家级重点实验室 北京 100854
  • 收稿日期:2022-11-12 修回日期:2023-05-11 出版日期:2023-11-15 发布日期:2023-11-06
  • 通讯作者: 王晓峰(wangxf.casc@foxmail.com)
  • 基金资助:
    国家重点实验室基金(61425010302)

Acceleration Design and FPGA Implementation of CNN Scene Matching Algorithm

WANG Xiaofeng, LI Chaoran, LU Kunfeng, LUAN Tianjiao, YAO Na, ZHOU Hui, XIE Yujia   

  1. Beijing Aerospace Automatic Control Institute,Beijing 100854,ChinaNational Aerospace Intelligence Control Technology Laboratory,Beijing 100854,China
  • Received:2022-11-12 Revised:2023-05-11 Online:2023-11-15 Published:2023-11-06
  • About author:WANG Xiaofeng,born in 1995,master,engineer.His main research interest is intelligent computing.
  • Supported by:
    State Key Laboratory Fund(61425010302).

摘要: 基于卷积神经网络的景象匹配算法较传统方法具有更高的匹配精度、更好的适应性以及更强的抗干扰能力。但是,该算法有海量的计算与存储需求,导致在边缘端部署存在巨大困难。为了提升计算实时性,文中设计并实现了一种高效的边缘端加速计算方案。在分析算法的计算特性与整体架构的基础上,基于Winograd快速卷积方法,设计了一种面向特征匹配层的专用加速器,并提出了利用专用加速器与深度学习处理器流水线式计算特征匹配层和特征提取网络的整体加速方案。在Xilinx的ZCU102开发板上进行实验发现,专用加速器的峰值算力达到576 GOPS,实际算力达422.08 GOPS,DSP的使用效率达4.5 Ope-ration/clock。加速计算系统的峰值算力达1600 GOPS,将CNN景象匹配算法的吞吐时延降低至157.89 ms。实验结果表明,该加速计算方案能高效利用FPGA的计算资源,实现CNN景象匹配算法的实时计算。

关键词: 加速计算, 景象匹配算法, 深度学习, FPGA, Winograd算法, 专用加速器

Abstract: Compared with traditional methods,the CNN-based scene matching algorithm has higher matching accuracy,better adaptability and stronger anti-interference ability.However,the algorithm has massive computing and storage requirements,which makes it difficult to deploy at the edge.To improve the real-time computing,an efficient edge-side acceleration scheme is designed and implemented.On the basis of analyzing the computation characteristics and overall architecture of the algorithm,correlation specific accelerator(CSA) is designed based on Winograd fast convolution method,and the acceleration scheme using CSA and deep-learning processor unit(DPU) pipelined computing feature correlation layer and feature extraction network is proposed.Experiments on Xilinx's ZCU102 development board finds that the peak perfor-mance of CSA reaches 576 GOPS,the actual performance reaches 422.08 GOPS,and the DSP usage efficiency reaches 4.5 Operation/clock.The peak performance of the accele-ration system reaches 1 600 GOPS,and the throughput delay of the algorithm is reduced to 157.89 ms.Experimental results show that the acceleration scheme can efficiently utilize the computing resources of the FPGA,to realize the real-time computing of the CNN-based scene matching algorithm.

Key words: Acceleration computing, Scene matching algorithm, Deep learning, FPGA, Winograd algorithm, Specific accelerator

中图分类号: 

  • TP391
[1]SIMONYAN K,ZISSERMAN A.Very Deep Convolutional Networks for Large-Scale Image Recognition[J].arXiv:1409.1556,2014.
[2]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[3]TAN M,LE Q.Efficientnet:Rethinking Model Scaling for Convolutional Neural Networks[C]//International Conference on Machine Learning.PMLR,2019:6105-6114.
[4]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(6):1137-1149.
[5]BOCHKOVSKIY A,WANG C Y,LIAO H Y M.Yolov4:Optimal Speed and Accuracy of Object Detection[J].arXiv:2004.10934,2020.
[6]TAN M,PANG R,LE Q V.Efficientdet:Scalable and Efficient Object Detection[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:10781-10790.
[7]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single ShotMultibox Detector[C]//European Confe-rence on Computer Vision.Cham:Springer,2016:21-37.
[8]HE K,GKIOXARI G,DOLLÁR P,et al.Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969.
[9]SUN P,ZHANG R,JIANG Y,et al.Sparse R-CNN:End-to-End Object Detection with Learnable Proposals[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:14454-14463.
[10]REN S H,CHANG W G,LIU X J.A Scene Matching Algo-rithm based on Wavelet Transform and Variable Scale Circle Template Fusion[J].Acta Electronica Sinica,2011,39(9):2200-2203.
[11]BO L F,HAN J,ZHANG Y,et al.Infrared and Visible Image Registration Algorithm using Improved Gradient Mutual Information and Particle Swarm Optimization Algorithm[J].Infrared and Laser Engineering,2012,41(1):248-254.
[12]CAO Z G,WU B.The Down-View Scene Matching Algorithm using HOG Features[J].Infrared and Laser Engineering,2012,41(2):513-516.
[13]ALEKSANDRA S,SIMON B.Optimizing SIFT for Matching of Short Wave Infrared and Visible Wavelength Images[J].Remote Sensing,2013,5(5):2037-2056.
[14]CHEN T,DU Z,SUN N,et al.Diannao:A Small-FootprintHigh-Throughput Accelerator for Ubiquitous Machine Learning[J].ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
[15]JOUPPI N P,YOUNG C,PATIL N,et al.In-Datacenter Per-formance Analysis of a Tensor Processing Unit[C]//Procee-dings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[16]CHEN Y H,KRISHNA T,EMER J S,et al.Eyeriss:An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks[J].IEEE Journal of Solid-State Circuits,2016,52(1):127-138.
[17]WILLIAMS S,WATERMAN A,PATTER-SON D A.Roof-line:An Insightful Visual Performance Model for Multicore Architectures[J].Communications of the ACM,2009,52(4):65-76.
[18]ZHANG C,LI P,SUN G,et al.Optimizing FPGA-based Acce-lerator Design for Deep Convolutional Neural Networks[C]//Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.2015:161-170.
[19]GUO K,SUI L,QIU J,et al.Angel-eye:A Complete DesignFlow for Mapping CNN onto Customized Hardware[C]//2016 IEEE Computer Society Annual Symposium on VLSI(ISVLSI).IEEE,2016:24-29.
[20]WANG X F,JIANG P L,ZHOU H,et al.High Parallelism FPGA Accelerator Design for Convolutional Neural Networks[J].Journal of Computer Applications,2021,41(3):812-819.
[21]WANG X,GE Y,GAO Y,et al.A More Scalable Deep-LearningProcessing Unit for Depthwise Separable Convolution[C]//2021 6th International Conference on Integrated Circuits and Micro-systems(ICICM).IEEE,2021:285-290.
[22]WANG X,LIU G,GE Y,et al.A More Efficient Deep-Learning Processing Unit Architecture with Runtime Configurable Parallelism[C]//2021 China Automation Congress(CAC).IEEE,2021:5941-5945.
[23]LU L,LIANG Y,XIAO Q,et al.Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs[C]//2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM).IEEE,2017:101-108.
[24]SHEN J,HUANG Y,WANG Z,et al.Towards a Uniform Template-Based Architecture for Accelerating 2D and 3D CNNs on FPGA[C]//Proceedings of the 2018 ACM/SIGDA Interna-tional Symposium on Field-Programmable Gate Arrays.2018:97-106.
[25]LU L,LIANG Y.SpWA:An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs[C]//Procee-dings s of the 55th Annual Design Automation Conference.2018:1-6.
[26]LAVIN A,GRAY S.Fast Algorithms for Convolutional Neural Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4013-4021.
[27]XILINX.DPUCZDX8G for Zynq UltraScale+ MPSoCs Product Guide(PG338)[EB/OL].(2022-06-24)[2022-12-07].https://docs.xilinx.com/r/en-US/pg338-dpu.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!