计算机科学 ›› 2019, Vol. 46 ›› Issue (11): 11-19.doi: 10.11896/jsjkx.191100500C
贾迅, 钱磊, 邬贵明, 吴东, 谢向辉
JIA Xun, QIAN Lei, WU Gui-ming, WU Dong, XIE Xiang-hui
摘要: 提升计算能效并满足新兴应用的性能需求是目前超级计算系统面临的两大挑战。FPGA(Field-Programmable Gate Array)低功耗和可重构的特性为应对上述挑战提供了可能。现有研究通过分析FPGA上计算核心的实际性能,探索了FPGA应用于高性能计算的可行性,但其性能分析未考虑卷积神经网络的计算核心且缺乏高性能处理器作为参照。文中针对当前高性能计算领域主要的计算核心(包括广度优先搜索、稀疏矩阵向量乘、Stencil、Smith-Waterman和卷积神经网络),总结了FPGA上各计算核心的实现和性能优化,并将其与SW26010众核处理器进行了对比;同时探讨了FPGA应用于高性能计算时存在的若干问题。分析表明,当前FPGA的能效最高为SW26010的63倍;FPGA上新兴应用(如图计算和深度学习)的性能最高为SW26010的26倍。未来,降低FPGA与主机的通信开销,提升其可编程性并完善基于FPGA的科学计算软件库,可有效推动FPGA在高性能计算方面的应用。
中图分类号:
[1]TOP500.Top 500 sites for June 2018 [EB/OL].[2018-05-29].https://www.top500.org/lists/2017/11/. [2]SHANNON L,COJOCARU V,DAO C N,et al.Technologyscaling in FPGAs:trends in applications and architectures[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2015:1-8. [3]Intel Corporation.Intel Stratix 10 MX product table [EB/OL].[2018-05-31].https://www.altera.com.cn/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-mx-pro-duct-table.pdf. [4]WU G M.Parallel algorithms and architectures for matrix computations on FPGA [D].Changsha:National University of Defense Technology,2011.(in Chinese) 邬贵明.FPGA矩阵计算并行算法与结构[D].长沙:国防科学技术大学,2011. [5]LEI G Q.Parallel algorithms and architectures for graph computations on FPGA [D].Changsha:National University of Defense Technology,2015.(in Chinese) 雷国庆.基于FPGA的图计算并行算法和体系结构研究[D].长沙:国防科学技术大学,2015. [6]ZHAO Y Y.The research on acceleration systems of deep beliefnetworks based on FPGAs [D].Hefei:University of Science and Technology of China,2017.(in Chinese) 赵洋洋.基于FPGA的深度信念网络加速系统研究[D].合肥:中国科学技术大学,2017. [7]LIAO X K,XIAO N.Emerging high-performance computingsystem and technology [J].Scientia Sinica Informationis,2016,46(9):1175-1210.(in Chinese) 廖湘科,肖侬.新型高性能计算系统与技术[J].中国科学:信息科学,2016,46(9):1175-1210. [8]VESTIAS M,NETO H.Trends of CPU,GPU and FPGA for high-performance computing[C]∥Proceedings of IEEE Conference on Field Programmable Logic and Applications.Piscataway:IEEE Press,2014:1-6. [9]ASANOVIC K,BODIK R,CATANZARO B C,et al.The landscape of parallel computing research:A view from Berkeley [R].Berkeley:University of California at Berkeley,2006. [10]ESCOBAR F A,CHANG X,VALDERRAMA C.Suitabilityanalysis of FPGAs for heterogeneous platforms in HPC [J].IEEE Transaction on Parallel and Distributed Systems,2016,27(2):600-612. [11]ZOHOURI H R,MARUYAMA N,SMITH A.Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs[C]∥Proceedings of the IEEE Conference on High Performance Computing,Networking,Storage and Analysis.Piscataway:IEEE Press,2016:409-420. [12]MUSLIM F B,MA L,ROOZMEH M,et al.Efficient FPGA implementation of OpenCL high-performance computing applications via high-level synthesis [J].IEEE Access,2017,5(99):2747-2762. [13]JIN Z M,FINKEL H,YOSHII K,et al.Evaluation of a floating-point intensive kernel on FPGA[C]∥Proceedings of the International Conference on Parallel and Distributed Computing.Berlin:Springer,2017:664-675. [14]BETKAOUI B,THOMAS D B,LUK W,et al.A framework for FPGA acceleration of large graph problems:Graphlet counting case study[C]∥Proceedings of IEEE Conference on Field Programmable Technology.Piscataway:IEEE Press,2011:9-16. [15]ATTIA O G,JOHNSON T,TOWNSEND K,et al.CyGraph:A reconfigurable architecture for parallel breadth-first search[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops.Piscataway:IEEE Press,2014:228-235. [16]ZHOU S J,CHELMIS C,PRASANNA V K.Accelerating largescale sing-source shortest path on FPGA[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops.Piscataway:IEEE Press,2015:129-136. [17]ZHU P F,ZHANG C,LI H,et al.An FPGA-based acceleration platform for auction algorithm[C]∥Proceedings of IEEE International Symposium on Circuits and Systems.Piscataway:IEEE Press,2012:1002-1005. [18]NURVITADHI E,WEISZ G,WANG Y,et al.GraphGen:AnFPGA framework for vertex-centric graph computation[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2014:25-28. [19]DAI G H,CHI Y Z,WANG Y,et al.FPGP:Graph processing framework on FPGA a case study of breadth-first search[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:105-110. [20]KYROLA A,BLELLOCH G,GUESTRIN C.GraphChi:Large-scale graph computation on just a PC[C]∥Proceedings of the Usenix Conference on Operating Systems Design and Implementation.New York:ACM Press,2012:31-46. [21]ZHOU S J,CHELMIS C,PRASANNA V K.High-throughput and energy-efficient graph processing on FPGA[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2016:103-110. [22]DAI G H,HUANG T H,CHI Y Z,et al.ForeGraph:Exploring large-scale graph processing on multi-FPGA architecture[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:217-226. [23]ENGELHARDT N,SO H K H.Towards flexible automaticgeneration of graph processing gateware[C]∥Proceedings of International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies.New York:ACM Press,2017:30-35. [24]ZHANG J L,KHORAM S,LI J.Boosting the performance ofFPGA-based graph processor using hybrid memory cube:A case for breadth first search[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:207-216. [25]KHORAM S,ZHANG J L,STANGE M,et al.Acceleratinggraph analytics by co-optimizing storage and access on an FPGA-HMC platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:239-248. [26]ZHANG J L,LI J.Degree-aware hybrid graph traversal on FPGA-HMC platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:229-238. [27]GOUMAS G,KOURTIS K,ANASTOPOULOS N,et al.Understanding the performance of sparse matrix-vector multiplication[C]∥Proceedings of the IEEE Conference on Parallel,Distributed and Network-Based Processing.Piscataway:IEEE Press,2008:283-292. [28]KESTUR S,DAVIS J D,CHUNG E S.Towards a universal FPGA matrix-vector multiplication architecture[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2012:9-16. [29]FOWERS J,OVTCHAROV K,STRAUSS K,et al.A highbandwidth FPGA accelerator for sparse matrix-vector multiplication[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2014:36-43. [30]GRIGORAS P,BUROVSKIY P,HUNG E,et al.AcceleratingSpMV on FPGAs by Compressing nonzero values[C]∥Procee-dings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2015:64-67. [31]GUO S,DOU Y,LEI Y W,et al.A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme[J].IEICE Electronics Express,2015,12(11):1-10. [32]UMUROGLU Y,JAHRE M.An energy efficient column-major backend for FPGA SpMV accelerators[C]∥Proceedings of IEEE Conference on Computer Design.Piscataway:IEEE Press,2014:432-439. [33]ZHOU L,PRASANNA V K.Sparse matrix-vector multiplication on FPGAs[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2005:63-74. [34]ZHANG Y,SHALABI Y H,NAGAR K K,et al.FPGA vs.GPU for sparse matrix vector multiply[C]∥Proceedings of IEEE Conference on Field Programmable Technology.Piscata-way:IEEE Press,2009:255-262. [35]DORRANCE R,REN F B,MARKOVIC D.A scalable sparsematrix-vector multiplication kernel for energy-efficient sparse-Blas on FPGAs[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2014:161-169. [36]GREGG D,SWEENEY C M,ELROY C M,et al.FPGA based sparse matrix vector multiplication using commodity DRAM technology[C]∥Proceedings of IEEE Conference on Field Programmable Logic and Applications.Piscataway:IEEE Press,2007:786-791. [37]UMUROGLU Y,JAHRE M.A vector caching scheme forstreaming FPGA SpMV accelerators[C]∥Proceedings of the International Symposium on Applied Reconfigurable Computing.Berlin:Springer,2015:15-26. [38]GRIGORAS P,BUROVSKIY P,LUK W.CASK-Open-sourcecustom architects for sparse kernels[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:179-184. [39]LI S C,WANG Y D,WEN W J,et al.A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication kernel[C]∥Proceedings of IEEE Confe-rence on ComputerAided Design.Piscataway:IEEE Press,2016:93-98. [40]SANO K,HATSUDA Y,YAMAMOTO S.Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2011:234-241. [41]SANO K,YAMAMOTO S,HATSUDA Y.Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation [J].ACM SIGARCH Computer Architecture News,2011,39(4):44-49. [42]SANO K,KONO F,NAKASATO N.Stream computation ofshallow water equation solver for FPGA-based 1D tsunami simu-lation[J].ACM SIGARCH Computer Architecture News,2015,43(4):82-87. [43]NAGASU K,SANO K,KONO F,et al.FPGA-based tsunamisimulation:Performance comparison with GPUs,and roofline model for scalability analysis [J].Journal of Parallel and Distributed Computing,2017,106:153-169. [44]WAIDYASOORIYA H M,TAKEI Y,TATSUMI S.OpenCL-based FPGA-platform for stencil computation and its optimization technology [J].IEEE Transactions on Parallel and Distri-buted Systems,2017,28(5):1390-1402. [45]ZOHOURI H R,PODOBAS A,MATSUOKA S.Combined spatial and temporal blocking for high-performance stencil computation on FPGA using OpenCL[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:153-162. [46]XIA F.Research on the hardware acceleration for biological sequence analysis [D].Changsha:National University of Defense Technology,2011.(in Chinese) 夏飞.生物序列分析算法硬件加速器关键技术研究[D].长沙:国防科学技术大学,2011. [47]RAMDAS T,EGAN G.A survey of FPGAs for acceleration of high performance computing and their application to computational molecular biology[C]∥Proceedings of the IEEE Region Ten Conference.Piscataway:IEEE Press.2005:1-6. [48]SETTLE S O.High-performance dynamic programming on FPGAs with OpenCL[C]∥Proceedings of the IEEE Conference on High Performance Extreme Computing.Piscataway:IEEE Press,2013:173-178. [49]TUCCI L D,BRIEN K,BLOTT M,et al.Architectural optimizations for high-performance and energy efficient Simit-Waterman implementation on FPGAs using OpenCL[C]∥Proceedings of the IEEE Conference on Design Automation and Test in Europe.Piscataway:IEEE Press,2017:716-721. [50]SIRASAO A,DELAYE E,SUNKAVALLI R,et al.FPGAbased OpenCL acceleration of genome sequencing software [R].San Jose:Xilinx Inc.2015. [51]RUCCI E,GARCIA C,BOTELLA G,et al.Accelerating Smith-Waterman alignment of long DNA sequencing with OpenCL on FPGA[C]∥Proceedings of the International Conference on Bioinformatics and Biomedical Engineering.Berlin:Springer,2017:500-511. [52]HOUTGAST E J,SIMA V M,ARS Z.High performancestreaming Smith-Waterman implementation with implicit synchronization on Intel FPGA using OpenCL[C]∥Proceedings of the IEEE Conference on Bioinformatics and Biomedical Engineering.Piscataway:IEEE Press,2018:492-496. [53]XIA F,ZOU D,LU L N,et al.FPGASW:Accelerating largescale Smith-Waterman sequence alignment application with backtracking on FPGA linear systolic array[J].InterdisciplinaryScience:Computational Life Science,2018,10(1):176-188. [54]CONG J,XIAO B J.Minimizing computation in convolutionalneural networks[C]∥Proceedings of International Conference on Artificial Neural Networks.Berlin:Springer,2014:281-290. [55]ZHANG C,LI P,SUN G Y,et al.Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2015:161-170. [56]PEEMEN M,SETIO A,MESMAN B,et al.Memory-centric accelerator for convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Aided Design.Piscataway:IEEE Press,2013:13-19. [57]SUDA N,CHANDRA V,DASIKA G,et al.Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:16-25. [58]ZHANG C,FANG Z M,ZHOU P P,et al.Caffeine:Towards uniformed representation and acceleration for deep convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Aided Design.Piscataway:IEEE Press,2016:79-86. [59]AYDONAT U,O’CONNELL S,CAPALIJA D,et al.An OpenCL deep learning accelerator on Arria 10[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:55-64. [60]LAVIN A,GRAY S.Fast algorithms for convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2016:4013-4021. [61]NURVITADHI E,VENKATESH G,SIM J,et al.Can FGPAs beat GPUs in accelerating next-generation deep neural networks?[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:5-14. [62]MOSS D,KRISHAN S,NURVITADHI E,et al.A customizable matrix multiplication framework for the Intel HARPv2 Xeon+FPGA platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:107-116. [63]ZHENG F,LI H L,LV H,et al.Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture [J].Journal of Computer Science and Technology,2015,30(1):145-162. [64]LIN H.Extreme-scale graph analysis on heterogeneous architecture [D].Beijing:Tsinghua University,2017.(in Chinese) 林恒.基于超大规模异构体系结构的图计算系统研究 [D].北京:清华大学,2017. [65]AO Y L,YANG C,LIU F F,et al.Performance optimization of the HPCG benchmark on the Sunway TaihuLight supercomputer[J].ACM Transactions on Architecture and Code Optimization,2018,15(1):11-21. [66]AO Y L,YANG C,WANG X L,et al.26 PFLOPS stencil computation for atmospheric modeling on Sunway TaihuLight[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium.Piscataway:IEEE Press,2017:535-544. [67]DUAN X H,XU K,CHAN Y D,et al.S-Aligner:Ultrascalable read mapping on Sunway Taihu Light[C]∥Proceedings of IEEE Conference on Cluster.Piscataway:IEEE Press,2017:36-46. [68]FANG J R,FU H H,ZHAO W L,et al.swDNN:A library foraccelerating deep learning applications on Sunway TaihuLight[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium.Piscataway:IEEE Press,2017:615-624. |
[1] | 尹宏俊, 邓楠, 程亚迪. 基于加速度模糊控制的六足机器人遥操作 Teleoperation Method for Hexapod Robot Based on Acceleration Fuzzy Control 计算机科学, 2022, 49(6A): 714-722. https://doi.org/10.11896/jsjkx.210300076 |
[2] | 傅思清, 黎铁军, 张建民. 面向粒子输运程序加速的体系结构设计 Architecture Design for Particle Transport Code Acceleration 计算机科学, 2022, 49(6): 81-88. https://doi.org/10.11896/jsjkx.210600179 |
[3] | 高捷, 刘沙, 黄则强, 郑天宇, 刘鑫, 漆锋滨. 基于国产众核处理器的深度神经网络算子加速库优化 Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor 计算机科学, 2022, 49(5): 355-362. https://doi.org/10.11896/jsjkx.210500226 |
[4] | 李浩东, 胡洁, 范勤勤. 基于并行分区搜索的多模态多目标优化及其应用 Multimodal Multi-objective Optimization Based on Parallel Zoning Search and Its Application 计算机科学, 2022, 49(5): 212-220. https://doi.org/10.11896/jsjkx.210300019 |
[5] | 陈乐, 高岭, 任杰, 党鑫, 王祎昊, 曹瑞, 郑杰, 王海. 基于自适应码率移动增强现实应用的能效优化研究 Adaptive Bitrate Streaming for Energy-Efficiency Mobile Augmented Reality 计算机科学, 2022, 49(1): 194-203. https://doi.org/10.11896/jsjkx.201100107 |
[6] | 王登天, 周华, 钱荷玥. LDPC自适应最小和译码算法及其FPGA实现 LDPC Adaptive Minimum Sum Decoding Algorithm and Its FPGA Implementation 计算机科学, 2021, 48(6A): 608-612. https://doi.org/10.11896/jsjkx.200800134 |
[7] | 齐延荣, 周夏冰, 李斌, 周清雷. 基于FPGA的CNN图像识别加速与优化 FPGA-based CNN Image Recognition Acceleration and Optimization 计算机科学, 2021, 48(4): 205-212. https://doi.org/10.11896/jsjkx.200600089 |
[8] | 程云飞, 田红心, 刘祖军. NOMA系统异构网络中联合用户关联和功率控制协同优化 Collaborative Optimization of Joint User Association and Power Control in NOMA Heterogeneous Network 计算机科学, 2021, 48(3): 269-274. https://doi.org/10.11896/jsjkx.191100213 |
[9] | 谢景明, 胡伟方, 韩林, 赵荣彩, 荆丽娜. 基于“嵩山”超级计算机系统的量子傅里叶变换模拟 Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System 计算机科学, 2021, 48(12): 36-42. https://doi.org/10.11896/jsjkx.201200023 |
[10] | 谭玲玲, 杨飞, 易军凯. 基于AVX指令集的Sketch算法优化研究 Optimization Study of Sketch Algorithm Based on AVX Instruction Set 计算机科学, 2021, 48(11A): 585-587. https://doi.org/10.11896/jsjkx.210100205 |
[11] | 陈国良, 张玉杰. 并行计算学科发展历程 Development of Parallel Computing Subject 计算机科学, 2020, 47(8): 1-4. https://doi.org/10.11896/jsjkx.200600027 |
[12] | 王喆, 唐麒, 王玲, 魏急波. 一种基于模拟退火的动态部分可重构系统划分-调度联合优化算法 Joint Optimization Algorithm for Partition-Scheduling of Dynamic Partial Reconfigurable Systems Based on Simulated Annealing 计算机科学, 2020, 47(8): 26-31. https://doi.org/10.11896/jsjkx.200500110 |
[13] | 李雨蓉, 刘杰, 刘亚林, 龚春叶, 王勇. 面向语音分离的深层转导式非负矩阵分解并行算法 Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation 计算机科学, 2020, 47(8): 49-55. https://doi.org/10.11896/jsjkx.190900202 |
[14] | 刘晓楠, 荆丽娜, 王立新, 王美玲. 基于申威26010处理器的大规模量子傅里叶变换模拟 Large-scale Quantum Fourier Transform Simulation Based on SW26010 计算机科学, 2020, 47(8): 93-97. https://doi.org/10.11896/jsjkx.200300015 |
[15] | 汪亮, 周新志, 严华. 基于GPU的实时SIFT算法 Real-time SIFT Algorithm Based on GPU 计算机科学, 2020, 47(8): 105-111. https://doi.org/10.11896/jsjkx.190700036 |
|