Computer Science ›› 2019, Vol. 46 ›› Issue (11): 11-19.doi: 10.11896/jsjkx.191100500C

• Surveys • Previous Articles     Next Articles

Research Advances and Future Challenges of FPGA-based High Performance Computing

JIA Xun, QIAN Lei, WU Gui-ming, WU Dong, XIE Xiang-hui   

  1. (State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi,Jiangsu 214125,China)
  • Received:2018-09-15 Online:2019-11-15 Published:2019-11-14

Abstract: Improving the energy efficiency and satisfying the performance need of emerging applications are two important challenges faced by current supercomputing systems.Featured with low power consumption and flexible reconfigurability,FPGA is a promising computation platform for overcoming the above challenges.To explore the feasibility,performance of high-performance computing (HPC) kernels on FPGA has been analyzed by extensive researches.How-ever,kernel of convolutional neural network is not considered in these studies,and the analysis lacks a high-performance processor for reference.Aiming at the dominant kernels in today’s HPC landscape,including breadth-first search,sparse matrix vector multiplication,stencil,smith-waterman and convolutional neural network,this paper summarized the implementation and performance optimization of these kernels on FPGA.Meanwhile,a comparison between FPGA and SW26010 many-core processor regarding their performance and energy efficiency was conducted.Furthermore,major problems of adopting FPGA for constructing HPC systems were also discussed.For the kernels considered in this paper,FPGA can outperform SW26010 processor by 63x in terms of energy efficiency.As for performance of emerging applications like graph analytics and deep learning,FPGA can outperform SW26010 by 26x.Lower communication overhead,better programmability and more integral software library for scientific computing will make FPGA an amenable platform for future supercomputing systems.

Key words: Acceleration, Emerging applications, Energy efficiency, FPGA, High performance computing

CLC Number: 

  • TP302
[1]TOP500.Top 500 sites for June 2018 [EB/OL].[2018-05-29].https://www.top500.org/lists/2017/11/.
[2]SHANNON L,COJOCARU V,DAO C N,et al.Technologyscaling in FPGAs:trends in applications and architectures[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2015:1-8.
[3]Intel Corporation.Intel Stratix 10 MX product table [EB/OL].[2018-05-31].https://www.altera.com.cn/content/dam/altera-www/global/en_US/pdfs/literature/pt/stratix-10-mx-pro-duct-table.pdf.
[4]WU G M.Parallel algorithms and architectures for matrix computations on FPGA [D].Changsha:National University of Defense Technology,2011.(in Chinese)
邬贵明.FPGA矩阵计算并行算法与结构[D].长沙:国防科学技术大学,2011.
[5]LEI G Q.Parallel algorithms and architectures for graph computations on FPGA [D].Changsha:National University of Defense Technology,2015.(in Chinese)
雷国庆.基于FPGA的图计算并行算法和体系结构研究[D].长沙:国防科学技术大学,2015.
[6]ZHAO Y Y.The research on acceleration systems of deep beliefnetworks based on FPGAs [D].Hefei:University of Science and Technology of China,2017.(in Chinese)
赵洋洋.基于FPGA的深度信念网络加速系统研究[D].合肥:中国科学技术大学,2017.
[7]LIAO X K,XIAO N.Emerging high-performance computingsystem and technology [J].Scientia Sinica Informationis,2016,46(9):1175-1210.(in Chinese)
廖湘科,肖侬.新型高性能计算系统与技术[J].中国科学:信息科学,2016,46(9):1175-1210.
[8]VESTIAS M,NETO H.Trends of CPU,GPU and FPGA for high-performance computing[C]∥Proceedings of IEEE Conference on Field Programmable Logic and Applications.Piscataway:IEEE Press,2014:1-6.
[9]ASANOVIC K,BODIK R,CATANZARO B C,et al.The landscape of parallel computing research:A view from Berkeley [R].Berkeley:University of California at Berkeley,2006.
[10]ESCOBAR F A,CHANG X,VALDERRAMA C.Suitabilityanalysis of FPGAs for heterogeneous platforms in HPC [J].IEEE Transaction on Parallel and Distributed Systems,2016,27(2):600-612.
[11]ZOHOURI H R,MARUYAMA N,SMITH A.Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs[C]∥Proceedings of the IEEE Conference on High Performance Computing,Networking,Storage and Analysis.Piscataway:IEEE Press,2016:409-420.
[12]MUSLIM F B,MA L,ROOZMEH M,et al.Efficient FPGA implementation of OpenCL high-performance computing applications via high-level synthesis [J].IEEE Access,2017,5(99):2747-2762.
[13]JIN Z M,FINKEL H,YOSHII K,et al.Evaluation of a floating-point intensive kernel on FPGA[C]∥Proceedings of the International Conference on Parallel and Distributed Computing.Berlin:Springer,2017:664-675.
[14]BETKAOUI B,THOMAS D B,LUK W,et al.A framework for FPGA acceleration of large graph problems:Graphlet counting case study[C]∥Proceedings of IEEE Conference on Field Programmable Technology.Piscataway:IEEE Press,2011:9-16.
[15]ATTIA O G,JOHNSON T,TOWNSEND K,et al.CyGraph:A reconfigurable architecture for parallel breadth-first search[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops.Piscataway:IEEE Press,2014:228-235.
[16]ZHOU S J,CHELMIS C,PRASANNA V K.Accelerating largescale sing-source shortest path on FPGA[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops.Piscataway:IEEE Press,2015:129-136.
[17]ZHU P F,ZHANG C,LI H,et al.An FPGA-based acceleration platform for auction algorithm[C]∥Proceedings of IEEE International Symposium on Circuits and Systems.Piscataway:IEEE Press,2012:1002-1005.
[18]NURVITADHI E,WEISZ G,WANG Y,et al.GraphGen:AnFPGA framework for vertex-centric graph computation[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2014:25-28.
[19]DAI G H,CHI Y Z,WANG Y,et al.FPGP:Graph processing framework on FPGA a case study of breadth-first search[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:105-110.
[20]KYROLA A,BLELLOCH G,GUESTRIN C.GraphChi:Large-scale graph computation on just a PC[C]∥Proceedings of the Usenix Conference on Operating Systems Design and Implementation.New York:ACM Press,2012:31-46.
[21]ZHOU S J,CHELMIS C,PRASANNA V K.High-throughput and energy-efficient graph processing on FPGA[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2016:103-110.
[22]DAI G H,HUANG T H,CHI Y Z,et al.ForeGraph:Exploring large-scale graph processing on multi-FPGA architecture[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:217-226.
[23]ENGELHARDT N,SO H K H.Towards flexible automaticgeneration of graph processing gateware[C]∥Proceedings of International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies.New York:ACM Press,2017:30-35.
[24]ZHANG J L,KHORAM S,LI J.Boosting the performance ofFPGA-based graph processor using hybrid memory cube:A case for breadth first search[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:207-216.
[25]KHORAM S,ZHANG J L,STANGE M,et al.Acceleratinggraph analytics by co-optimizing storage and access on an FPGA-HMC platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:239-248.
[26]ZHANG J L,LI J.Degree-aware hybrid graph traversal on FPGA-HMC platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:229-238.
[27]GOUMAS G,KOURTIS K,ANASTOPOULOS N,et al.Understanding the performance of sparse matrix-vector multiplication[C]∥Proceedings of the IEEE Conference on Parallel,Distributed and Network-Based Processing.Piscataway:IEEE Press,2008:283-292.
[28]KESTUR S,DAVIS J D,CHUNG E S.Towards a universal FPGA matrix-vector multiplication architecture[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2012:9-16.
[29]FOWERS J,OVTCHAROV K,STRAUSS K,et al.A highbandwidth FPGA accelerator for sparse matrix-vector multiplication[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2014:36-43.
[30]GRIGORAS P,BUROVSKIY P,HUNG E,et al.AcceleratingSpMV on FPGAs by Compressing nonzero values[C]∥Procee-dings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2015:64-67.
[31]GUO S,DOU Y,LEI Y W,et al.A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme[J].IEICE Electronics Express,2015,12(11):1-10.
[32]UMUROGLU Y,JAHRE M.An energy efficient column-major backend for FPGA SpMV accelerators[C]∥Proceedings of IEEE Conference on Computer Design.Piscataway:IEEE Press,2014:432-439.
[33]ZHOU L,PRASANNA V K.Sparse matrix-vector multiplication on FPGAs[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2005:63-74.
[34]ZHANG Y,SHALABI Y H,NAGAR K K,et al.FPGA vs.GPU for sparse matrix vector multiply[C]∥Proceedings of IEEE Conference on Field Programmable Technology.Piscata-way:IEEE Press,2009:255-262.
[35]DORRANCE R,REN F B,MARKOVIC D.A scalable sparsematrix-vector multiplication kernel for energy-efficient sparse-Blas on FPGAs[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2014:161-169.
[36]GREGG D,SWEENEY C M,ELROY C M,et al.FPGA based sparse matrix vector multiplication using commodity DRAM technology[C]∥Proceedings of IEEE Conference on Field Programmable Logic and Applications.Piscataway:IEEE Press,2007:786-791.
[37]UMUROGLU Y,JAHRE M.A vector caching scheme forstreaming FPGA SpMV accelerators[C]∥Proceedings of the International Symposium on Applied Reconfigurable Computing.Berlin:Springer,2015:15-26.
[38]GRIGORAS P,BUROVSKIY P,LUK W.CASK-Open-sourcecustom architects for sparse kernels[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:179-184.
[39]LI S C,WANG Y D,WEN W J,et al.A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication kernel[C]∥Proceedings of IEEE Confe-rence on ComputerAided Design.Piscataway:IEEE Press,2016:93-98.
[40]SANO K,HATSUDA Y,YAMAMOTO S.Scalable streaming-array of simple soft-processors for stencil computations with constant memory-bandwidth[C]∥Proceedings of IEEE Conference on Field-Programmable Custom Computing Machines.Piscataway:IEEE Press,2011:234-241.
[41]SANO K,YAMAMOTO S,HATSUDA Y.Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation [J].ACM SIGARCH Computer Architecture News,2011,39(4):44-49.
[42]SANO K,KONO F,NAKASATO N.Stream computation ofshallow water equation solver for FPGA-based 1D tsunami simu-lation[J].ACM SIGARCH Computer Architecture News,2015,43(4):82-87.
[43]NAGASU K,SANO K,KONO F,et al.FPGA-based tsunamisimulation:Performance comparison with GPUs,and roofline model for scalability analysis [J].Journal of Parallel and Distributed Computing,2017,106:153-169.
[44]WAIDYASOORIYA H M,TAKEI Y,TATSUMI S.OpenCL-based FPGA-platform for stencil computation and its optimization technology [J].IEEE Transactions on Parallel and Distri-buted Systems,2017,28(5):1390-1402.
[45]ZOHOURI H R,PODOBAS A,MATSUOKA S.Combined spatial and temporal blocking for high-performance stencil computation on FPGA using OpenCL[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:153-162.
[46]XIA F.Research on the hardware acceleration for biological sequence analysis [D].Changsha:National University of Defense Technology,2011.(in Chinese)
夏飞.生物序列分析算法硬件加速器关键技术研究[D].长沙:国防科学技术大学,2011.
[47]RAMDAS T,EGAN G.A survey of FPGAs for acceleration of high performance computing and their application to computational molecular biology[C]∥Proceedings of the IEEE Region Ten Conference.Piscataway:IEEE Press.2005:1-6.
[48]SETTLE S O.High-performance dynamic programming on FPGAs with OpenCL[C]∥Proceedings of the IEEE Conference on High Performance Extreme Computing.Piscataway:IEEE Press,2013:173-178.
[49]TUCCI L D,BRIEN K,BLOTT M,et al.Architectural optimizations for high-performance and energy efficient Simit-Waterman implementation on FPGAs using OpenCL[C]∥Proceedings of the IEEE Conference on Design Automation and Test in Europe.Piscataway:IEEE Press,2017:716-721.
[50]SIRASAO A,DELAYE E,SUNKAVALLI R,et al.FPGAbased OpenCL acceleration of genome sequencing software [R].San Jose:Xilinx Inc.2015.
[51]RUCCI E,GARCIA C,BOTELLA G,et al.Accelerating Smith-Waterman alignment of long DNA sequencing with OpenCL on FPGA[C]∥Proceedings of the International Conference on Bioinformatics and Biomedical Engineering.Berlin:Springer,2017:500-511.
[52]HOUTGAST E J,SIMA V M,ARS Z.High performancestreaming Smith-Waterman implementation with implicit synchronization on Intel FPGA using OpenCL[C]∥Proceedings of the IEEE Conference on Bioinformatics and Biomedical Engineering.Piscataway:IEEE Press,2018:492-496.
[53]XIA F,ZOU D,LU L N,et al.FPGASW:Accelerating largescale Smith-Waterman sequence alignment application with backtracking on FPGA linear systolic array[J].InterdisciplinaryScience:Computational Life Science,2018,10(1):176-188.
[54]CONG J,XIAO B J.Minimizing computation in convolutionalneural networks[C]∥Proceedings of International Conference on Artificial Neural Networks.Berlin:Springer,2014:281-290.
[55]ZHANG C,LI P,SUN G Y,et al.Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2015:161-170.
[56]PEEMEN M,SETIO A,MESMAN B,et al.Memory-centric accelerator for convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Aided Design.Piscataway:IEEE Press,2013:13-19.
[57]SUDA N,CHANDRA V,DASIKA G,et al.Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2016:16-25.
[58]ZHANG C,FANG Z M,ZHOU P P,et al.Caffeine:Towards uniformed representation and acceleration for deep convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Aided Design.Piscataway:IEEE Press,2016:79-86.
[59]AYDONAT U,O’CONNELL S,CAPALIJA D,et al.An OpenCL deep learning accelerator on Arria 10[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:55-64.
[60]LAVIN A,GRAY S.Fast algorithms for convolutional neural networks[C]∥Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2016:4013-4021.
[61]NURVITADHI E,VENKATESH G,SIM J,et al.Can FGPAs beat GPUs in accelerating next-generation deep neural networks?[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2017:5-14.
[62]MOSS D,KRISHAN S,NURVITADHI E,et al.A customizable matrix multiplication framework for the Intel HARPv2 Xeon+FPGA platform[C]∥Proceedings of IEEE Conference on Field-Programmable Gate Arrays.Piscataway:IEEE Press,2018:107-116.
[63]ZHENG F,LI H L,LV H,et al.Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture [J].Journal of Computer Science and Technology,2015,30(1):145-162.
[64]LIN H.Extreme-scale graph analysis on heterogeneous architecture [D].Beijing:Tsinghua University,2017.(in Chinese)
林恒.基于超大规模异构体系结构的图计算系统研究 [D].北京:清华大学,2017.
[65]AO Y L,YANG C,LIU F F,et al.Performance optimization of the HPCG benchmark on the Sunway TaihuLight supercomputer[J].ACM Transactions on Architecture and Code Optimization,2018,15(1):11-21.
[66]AO Y L,YANG C,WANG X L,et al.26 PFLOPS stencil computation for atmospheric modeling on Sunway TaihuLight[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium.Piscataway:IEEE Press,2017:535-544.
[67]DUAN X H,XU K,CHAN Y D,et al.S-Aligner:Ultrascalable read mapping on Sunway Taihu Light[C]∥Proceedings of IEEE Conference on Cluster.Piscataway:IEEE Press,2017:36-46.
[68]FANG J R,FU H H,ZHAO W L,et al.swDNN:A library foraccelerating deep learning applications on Sunway TaihuLight[C]∥Proceedings of IEEE International Parallel and Distributed Processing Symposium.Piscataway:IEEE Press,2017:615-624.
[1] YIN Hong-jun, DENG Nan, CHENG Ya-di. Teleoperation Method for Hexapod Robot Based on Acceleration Fuzzy Control [J]. Computer Science, 2022, 49(6A): 714-722.
[2] GAO Jie, LIU Sha, HUANG Ze-qiang, ZHENG Tian-yu, LIU Xin, QI Feng-bin. Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor [J]. Computer Science, 2022, 49(5): 355-362.
[3] CHEN Yong, XU Qi, WANG Xiao-ming, GAO Jin-yu, SHEN Rui-juan. Energy Efficient Power Allocation for MIMO-NOMA Communication Systems [J]. Computer Science, 2021, 48(6A): 398-403.
[4] WANG Deng-tian, ZHOU Hua, QIAN He-yue. LDPC Adaptive Minimum Sum Decoding Algorithm and Its FPGA Implementation [J]. Computer Science, 2021, 48(6A): 608-612.
[5] GUO Biao, TANG Qi, WEN Zhi-min, FU Juan, WANG Ling, WEI Ji-bo. List-based Software and Hardware Partitioning Algorithm for Dynamic Partial Reconfigurable System-on-Chip [J]. Computer Science, 2021, 48(6): 19-25.
[6] QI Yan-rong, ZHOU Xia-bing, LI Bin, ZHOU Qing-lei. FPGA-based CNN Image Recognition Acceleration and Optimization [J]. Computer Science, 2021, 48(4): 205-212.
[7] CHENG Yun-fei, TIAN Hong-xin, LIU Zu-jun. Collaborative Optimization of Joint User Association and Power Control in NOMA Heterogeneous Network [J]. Computer Science, 2021, 48(3): 269-274.
[8] CHEN Guo-liang, ZHANG Yu-jie, . Development of Parallel Computing Subject [J]. Computer Science, 2020, 47(8): 1-4.
[9] WANG Zhe, TANG Qi, WANG Ling, WEI Ji-bo. Joint Optimization Algorithm for Partition-Scheduling of Dynamic Partial Reconfigurable Systems Based on Simulated Annealing [J]. Computer Science, 2020, 47(8): 26-31.
[10] LI Yu-rong, LIU Jie, LIU Ya-lin, GONG Chun-ye, WANG Yong. Parallel Algorithm of Deep Transductive Non-negative Matrix Factorization for Speech Separation [J]. Computer Science, 2020, 47(8): 49-55.
[11] WANG Liang, ZHOU Xin-zhi, YNA Hua. Real-time SIFT Algorithm Based on GPU [J]. Computer Science, 2020, 47(8): 105-111.
[12] ZHANG Long-xin, ZHOU Li-qian, WEN Hong, XIAO Man-sheng, DENG Xiao-jun. Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems [J]. Computer Science, 2020, 47(8): 112-118.
[13] CHEN Li-feng, ZHU Lu-ping. Encrypted Dynamic Configuration Method of FPGA Based on Cloud [J]. Computer Science, 2020, 47(7): 278-281.
[14] ZHAO Bo, YANG Ming, TANG Zhi-wei and CAI Yu-xin. Intelligent Video Surveillance Systems Based on FPGA [J]. Computer Science, 2020, 47(6A): 609-611.
[15] CAI Yu-xin, TANG Zhi-wei, ZHAO Bo, YANG Ming and WU Yu-fei. Accelerated Software System Based on Embedded Multicore DSP [J]. Computer Science, 2020, 47(6A): 622-625.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!