FPGA应用于高性能计算的研究现状和未来挑战

doi:10.11896/jsjkx.191100500C

摘要/Abstract

摘要： 提升计算能效并满足新兴应用的性能需求是目前超级计算系统面临的两大挑战。FPGA(Field-Programmable Gate Array)低功耗和可重构的特性为应对上述挑战提供了可能。现有研究通过分析FPGA上计算核心的实际性能,探索了FPGA应用于高性能计算的可行性,但其性能分析未考虑卷积神经网络的计算核心且缺乏高性能处理器作为参照。文中针对当前高性能计算领域主要的计算核心(包括广度优先搜索、稀疏矩阵向量乘、Stencil、Smith-Waterman和卷积神经网络),总结了FPGA上各计算核心的实现和性能优化,并将其与SW26010众核处理器进行了对比;同时探讨了FPGA应用于高性能计算时存在的若干问题。分析表明,当前FPGA的能效最高为SW26010的63倍;FPGA上新兴应用(如图计算和深度学习)的性能最高为SW26010的26倍。未来,降低FPGA与主机的通信开销,提升其可编程性并完善基于FPGA的科学计算软件库,可有效推动FPGA在高性能计算方面的应用。

关键词: FPGA, 高性能计算, 加速, 能效, 新兴应用

Abstract: Improving the energy efficiency and satisfying the performance need of emerging applications are two important challenges faced by current supercomputing systems.Featured with low power consumption and flexible reconfigurability,FPGA is a promising computation platform for overcoming the above challenges.To explore the feasibility,performance of high-performance computing (HPC) kernels on FPGA has been analyzed by extensive researches.How-ever,kernel of convolutional neural network is not considered in these studies,and the analysis lacks a high-performance processor for reference.Aiming at the dominant kernels in today’s HPC landscape,including breadth-first search,sparse matrix vector multiplication,stencil,smith-waterman and convolutional neural network,this paper summarized the implementation and performance optimization of these kernels on FPGA.Meanwhile,a comparison between FPGA and SW26010 many-core processor regarding their performance and energy efficiency was conducted.Furthermore,major problems of adopting FPGA for constructing HPC systems were also discussed.For the kernels considered in this paper,FPGA can outperform SW26010 processor by 63x in terms of energy efficiency.As for performance of emerging applications like graph analytics and deep learning,FPGA can outperform SW26010 by 26x.Lower communication overhead,better programmability and more integral software library for scientific computing will make FPGA an amenable platform for future supercomputing systems.

Key words: Acceleration, Emerging applications, Energy efficiency, FPGA, High performance computing

中图分类号:

TP302

贾迅, 钱磊, 邬贵明, 吴东, 谢向辉. FPGA应用于高性能计算的研究现状和未来挑战[J]. 计算机科学, 2019, 46(11): 11-19. https://doi.org/10.11896/jsjkx.191100500C

JIA Xun, QIAN Lei, WU Gui-ming, WU Dong, XIE Xiang-hui. Research Advances and Future Challenges of FPGA-based High Performance Computing[J]. Computer Science, 2019, 46(11): 11-19. https://doi.org/10.11896/jsjkx.191100500C

参考文献

[1]TOP500.Top 500 sites for June 2018 [EB/OL].[2018-05-29].https://www.top500.org/lists/2017/11/.
[2]SHANNON L,COJOCARU V,DAO C N,et al.Technologyscaling in FPGAs:trends in applications and architectures[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2015:1-8.
[3]Intel Corporation.Intel Stratix 10 MX product table [EB/OL].[2018-05-31].https://www.altera.com.cn/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-mx-pro-duct-table.pdf.
[4]WU G M.Parallel algorithms and architectures for matrix computations on FPGA [D].Changsha:National University of Defense Technology,2011.(in Chinese)
邬贵明.FPGA矩阵计算并行算法与结构[D].长沙:国防科学技术大学,2011.
[5]LEI G Q.Parallel algorithms and architectures for graph computations on FPGA [D].Changsha:National University of Defense Technology,2015.(in Chinese)
雷国庆.基于FPGA的图计算并行算法和体系结构研究[D].长沙:国防科学技术大学,2015.
[6]ZHAO Y Y.The research on acceleration systems of deep beliefnetworks based on FPGAs [D].Hefei:University of Science and Technology of China,2017.(in Chinese)
赵洋洋.基于FPGA的深度信念网络加速系统研究[D].合肥:中国科学技术大学,2017.
[7]LIAO X K,XIAO N.Emerging high-performance computingsystem and technology [J].Scientia Sinica Informationis,2016,46(9):1175-1210.(in Chinese)
廖湘科,肖侬.新型高性能计算系统与技术[J].中国科学:信息科学,2016,46(9):1175-1210.
[8]VESTIAS M,NETO H.Trends of CPU,GPU and FPGA for high-performance computing[C]∥Proceedings of IEEE Conference on Field Programmable Logic and Applications.Piscataway:IEEE Press,2014:1-6.
[9]ASANOVIC K,BODIK R,CATANZARO B C,et al.The landscape of parallel computing research:A view from Berkeley [R].Berkeley:University of California at Berkeley,2006.
[10]ESCOBAR F A,CHANG X,VALDERRAMA C.Suitabilityanalysis of FPGAs for heterogeneous platforms in HPC [J].IEEE Transaction on Parallel and Distributed Systems,2016,27(2):600-612.
[11]ZOHOURI H R,MARUYAMA N,SMITH A.Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs[C]∥Proceedings of the IEEE Conference on High Performance Computing,Networking,Storage and Analysis.Piscataway:IEEE Press,2016:409-420.
[12]MUSLIM F B,MA L,ROOZMEH M,et al.Efficient FPGA implementation of OpenCL high-performance computing applications via high-level synthesis [J].IEEE Access,2017,5(99):2747-2762.
[13]JIN Z M,FINKEL H,YOSHII K,et al.Evaluation of a floating-point intensive kernel on FPGA[C]∥Proceedings of the International Conference on Parallel and Distributed Computing.Berlin:Springer,2017:664-675.
[14]BETKAOUI B,THOMAS D B,LUK W,et al.A framework for FPGA acceleration of large graph problems:Graphlet counting case study[C]∥Proceedings of IEEE Conference on Field Programmable Technology.Piscataway:IEEE Press,2011:9-16.
[15]ATTIA O G,JOHNSON T,TOWNSEND K,et al.CyGraph:A reconfigurable architecture for parallel breadth-first search[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops.Piscataway:IEEE Press,2014:228-235.
[16]ZHOU S J,CHELMIS C,PRASANNA V K.Accelerating largescale sing-source shortest path on FPGA[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops.Piscataway:IEEE Press,2015:129-136.
[17]ZHU P F,ZHANG C,LI H,et al.An FPGA-based acceleration platform for auction algorithm[C]∥Proceedings of IEEE International Symposium on Circuits and Systems.Piscataway:IEEE Press,2012:1002-1005.
[18]NURVITADHI E,WEISZ G,WANG Y,et al.GraphGen:AnFPGA framework for vertex-centric graph computation[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2014:25-28.
[19]DAI G H,CHI Y Z,WANG Y,et al.FPGP:Graph processing framework on FPGA a case study of breadth-first search[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:105-110.
[20]KYROLA A,BLELLOCH G,GUESTRIN C.GraphChi:Large-scale graph computation on just a PC[C]∥Proceedings of the Usenix Conference on Operating Systems Design and Implementation.New York:ACM Press,2012:31-46.
[21]ZHOU S J,CHELMIS C,PRASANNA V K.High-throughput and energy-efficient graph processing on FPGA[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2016:103-110.
[22]DAI G H,HUANG T H,CHI Y Z,et al.ForeGraph:Exploring large-scale graph processing on multi-FPGA architecture[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:217-226.
[23]ENGELHARDT N,SO H K H.Towards flexible automaticgeneration of graph processing gateware[C]∥Proceedings of International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies.New York:ACM Press,2017:30-35.
[24]ZHANG J L,KHORAM S,LI J.Boosting the performance ofFPGA-based graph processor using hybrid memory cube:A case for breadth first search[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:207-216.
[25]KHORAM S,ZHANG J L,STANGE M,et al.Acceleratinggraph analytics by co-optimizing storage and access on an FPGA-HMC platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:239-248.
[26]ZHANG J L,LI J.Degree-aware hybrid graph traversal on FPGA-HMC platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:229-238.
[27]GOUMAS G,KOURTIS K,ANASTOPOULOS N,et al.Understanding the performance of sparse matrix-vector multiplication[C]∥Proceedings of the IEEE Conference on Parallel,Distributed and Network-Based Processing.Piscataway:IEEE Press,2008:283-292.
[28]KESTUR S,DAVIS J D,CHUNG E S.Towards a universal FPGA matrix-vector multiplication architecture[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2012:9-16.
[29]FOWERS J,OVTCHAROV K,STRAUSS K,et al.A highbandwidth FPGA accelerator for sparse matrix-vector multiplication[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2014:36-43.
[30]GRIGORAS P,BUROVSKIY P,HUNG E,et al.AcceleratingSpMV on FPGAs by Compressing nonzero values[C]∥Procee-dings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2015:64-67.
[31]GUO S,DOU Y,LEI Y W,et al.A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme[J].IEICE Electronics Express,2015,12(11):1-10.
[32]UMUROGLU Y,JAHRE M.An energy efficient column-major backend for FPGA SpMV accelerators[C]∥Proceedings of IEEE Conference on Computer Design.Piscataway:IEEE Press,2014:432-439.
[33]ZHOU L,PRASANNA V K.Sparse matrix-vector multiplication on FPGAs[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2005:63-74.
[34]ZHANG Y,SHALABI Y H,NAGAR K K,et al.FPGA vs.GPU for sparse matrix vector multiply[C]∥Proceedings of IEEE Conference on Field Programmable Technology.Piscata-way:IEEE Press,2009:255-262.
[35]DORRANCE R,REN F B,MARKOVIC D.A scalable sparsematrix-vector multiplication kernel for energy-efficient sparse-Blas on FPGAs[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2014:161-169.
[36]GREGG D,SWEENEY C M,ELROY C M,et al.FPGA based sparse matrix vector multiplication using commodity DRAM technology[C]∥Proceedings of IEEE Conference on Field Programmable Logic and Applications.Piscataway:IEEE Press,2007:786-791.
[37]UMUROGLU Y,JAHRE M.A vector caching scheme forstreaming FPGA SpMV accelerators[C]∥Proceedings of the International Symposium on Applied Reconfigurable Computing.Berlin:Springer,2015:15-26.
[38]GRIGORAS P,BUROVSKIY P,LUK W.CASK-Open-sourcecustom architects for sparse kernels[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:179-184.
[39]LI S C,WANG Y D,WEN W J,et al.A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication kernel[C]∥Proceedings of IEEE Confe-rence on ComputerAided Design.Piscataway:IEEE Press,2016:93-98.
[40]SANO K,HATSUDA Y,YAMAMOTO S.Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2011:234-241.
[41]SANO K,YAMAMOTO S,HATSUDA Y.Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation [J].ACM SIGARCH Computer Architecture News,2011,39(4):44-49.
[42]SANO K,KONO F,NAKASATO N.Stream computation ofshallow water equation solver for FPGA-based 1D tsunami simu-lation[J].ACM SIGARCH Computer Architecture News,2015,43(4):82-87.
[43]NAGASU K,SANO K,KONO F,et al.FPGA-based tsunamisimulation:Performance comparison with GPUs,and roofline model for scalability analysis [J].Journal of Parallel and Distributed Computing,2017,106:153-169.
[44]WAIDYASOORIYA H M,TAKEI Y,TATSUMI S.OpenCL-based FPGA-platform for stencil computation and its optimization technology [J].IEEE Transactions on Parallel and Distri-buted Systems,2017,28(5):1390-1402.
[45]ZOHOURI H R,PODOBAS A,MATSUOKA S.Combined spatial and temporal blocking for high-performance stencil computation on FPGA using OpenCL[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:153-162.
[46]XIA F.Research on the hardware acceleration for biological sequence analysis [D].Changsha:National University of Defense Technology,2011.(in Chinese)
夏飞.生物序列分析算法硬件加速器关键技术研究[D].长沙:国防科学技术大学,2011.
[47]RAMDAS T,EGAN G.A survey of FPGAs for acceleration of high performance computing and their application to computational molecular biology[C]∥Proceedings of the IEEE Region Ten Conference.Piscataway:IEEE Press.2005:1-6.
[48]SETTLE S O.High-performance dynamic programming on FPGAs with OpenCL[C]∥Proceedings of the IEEE Conference on High Performance Extreme Computing.Piscataway:IEEE Press,2013:173-178.
[49]TUCCI L D,BRIEN K,BLOTT M,et al.Architectural optimizations for high-performance and energy efficient Simit-Waterman implementation on FPGAs using OpenCL[C]∥Proceedings of the IEEE Conference on Design Automation and Test in Europe.Piscataway:IEEE Press,2017:716-721.
[50]SIRASAO A,DELAYE E,SUNKAVALLI R,et al.FPGAbased OpenCL acceleration of genome sequencing software [R].San Jose:Xilinx Inc.2015.
[51]RUCCI E,GARCIA C,BOTELLA G,et al.Accelerating Smith-Waterman alignment of long DNA sequencing with OpenCL on FPGA[C]∥Proceedings of the International Conference on Bioinformatics and Biomedical Engineering.Berlin:Springer,2017:500-511.
[52]HOUTGAST E J,SIMA V M,ARS Z.High performancestreaming Smith-Waterman implementation with implicit synchronization on Intel FPGA using OpenCL[C]∥Proceedings of the IEEE Conference on Bioinformatics and Biomedical Engineering.Piscataway:IEEE Press,2018:492-496.
[53]XIA F,ZOU D,LU L N,et al.FPGASW:Accelerating largescale Smith-Waterman sequence alignment application with backtracking on FPGA linear systolic array[J].InterdisciplinaryScience:Computational Life Science,2018,10(1):176-188.
[54]CONG J,XIAO B J.Minimizing computation in convolutionalneural networks[C]∥Proceedings of International Conference on Artificial Neural Networks.Berlin:Springer,2014:281-290.
[55]ZHANG C,LI P,SUN G Y,et al.Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2015:161-170.
[56]PEEMEN M,SETIO A,MESMAN B,et al.Memory-centric accelerator for convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Aided Design.Piscataway:IEEE Press,2013:13-19.
[57]SUDA N,CHANDRA V,DASIKA G,et al.Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:16-25.
[58]ZHANG C,FANG Z M,ZHOU P P,et al.Caffeine:Towards uniformed representation and acceleration for deep convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Aided Design.Piscataway:IEEE Press,2016:79-86.
[59]AYDONAT U,O’CONNELL S,CAPALIJA D,et al.An OpenCL deep learning accelerator on Arria 10[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:55-64.
[60]LAVIN A,GRAY S.Fast algorithms for convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2016:4013-4021.
[61]NURVITADHI E,VENKATESH G,SIM J,et al.Can FGPAs beat GPUs in accelerating next-generation deep neural networks?[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:5-14.
[62]MOSS D,KRISHAN S,NURVITADHI E,et al.A customizable matrix multiplication framework for the Intel HARPv2 Xeon+FPGA platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:107-116.
[63]ZHENG F,LI H L,LV H,et al.Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture [J].Journal of Computer Science and Technology,2015,30(1):145-162.
[64]LIN H.Extreme-scale graph analysis on heterogeneous architecture [D].Beijing:Tsinghua University,2017.(in Chinese)
林恒.基于超大规模异构体系结构的图计算系统研究 [D].北京:清华大学,2017.
[65]AO Y L,YANG C,LIU F F,et al.Performance optimization of the HPCG benchmark on the Sunway TaihuLight supercomputer[J].ACM Transactions on Architecture and Code Optimization,2018,15(1):11-21.
[66]AO Y L,YANG C,WANG X L,et al.26 PFLOPS stencil computation for atmospheric modeling on Sunway TaihuLight[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium.Piscataway:IEEE Press,2017:535-544.
[67]DUAN X H,XU K,CHAN Y D,et al.S-Aligner:Ultrascalable read mapping on Sunway Taihu Light[C]∥Proceedings of IEEE Conference on Cluster.Piscataway:IEEE Press,2017:36-46.
[68]FANG J R,FU H H,ZHAO W L,et al.swDNN:A library foraccelerating deep learning applications on Sunway TaihuLight[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium.Piscataway:IEEE Press,2017:615-624.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed