Computer Science ›› 2023, Vol. 50 ›› Issue (4): 397-403.doi: 10.11896/jsjkx.220100232

• Interdiscipline & Frontier • Previous Articles    

Batched Eigenvalue Decomposition Algorithms for Hermitian Matrices on GPU

HUANG Rongfeng1,2, LIU Shifang1,2, ZHAO Yonghua1   

  1. 1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China
    2 University of Chinese Academy of Sciences,Beijing 100080,China
  • Received:2022-01-25 Revised:2022-08-24 Online:2023-04-15 Published:2023-04-06
  • About author:HUANG Rongfeng,born in 1990,Ph.D candidate.His main research interests include efficient implementation of math library on heterogeneous platforms and so on.
    ZHAO Yonghua,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include parallel algorithm and software,parallel programming model,spectral clustering data analysis and so on.
  • Supported by:
    National Key Research and Development Program of China(2017YFB0202202)and Strategic Priority Research Program of Chinese Academy of Sciences(XDC05000000).

Abstract: Batched matrix computing problems are widely existed in scientific computing and engineering applications.With rapid performance improvements,GPU has become an important tool to solve such problems.The eigenvalue decomposition belongs to the two-sided decomposition and must be solved by the iterative algorithm.Iterative numbers for different matrices can be varied.Therefore,designing eigenvalue decomposition algorithms for batched matrices on the GPU is more challenging than designing batched algorithms for the one-sided decomposition,such as LU decomposition.This paper proposes batched algorithms based on the Jacobi algorithms for eigenvalue decomposition of Hermitian matrices.For matrices that cannot reside in shared memory wholly,the block technique is used to improve the arithmetic intensity,thus improving the use of GPU resources.Algorithms presented in this paper run completely on the GPU,avoiding the communication between the CPU and GPU.Kernel fusion is adopted to decrease the overhead of launching kernel and global memory access.Experimental results on V100 GPU show that our algorithms are better than existing works.Performance evaluation results of the Roofline model indicate that our implementations are close to the upper bound,approaching 4.11TFLOPS.

Key words: Hermitian matrix, Eigenvalue decomposition, Batch computing, Roofline model, Performance evaluation

CLC Number: 

  • TP301
[1]ABDELFATTAH A,BABOULIN M,DOBREV V,et al.High-performance Tensor Contractions for GPUs[J].Procedia Computer Science,2016,80(1):108-118.
[2]MOLERO J M,GARZÓN E M,GARCÍA I,et al.Efficient Implementation of Hyperspectral Anomaly Detection Techniques on GPUs and Multicore Processors[J].IEEE Journal of Selec-ted Topics in Applied Earth Observations and Remote Sensing,2014,7(6):2256-2266.
[3]VILLA O,GAWANDE N,TUMEO A.Accelerating Subsurface Transport Simulationon Heterogeneous Clusters[C]//International Conference on Cluster Computing.New York:IEEE Press,2013:1-8.
[4]ZHANG T,LIU X,WANG X,et al.cuTensor-Tubal:Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs[J].IEEE Transactions on Parallel and Distributed Systems,2020,31(3):595-610.
[5]DONG T,HAIDAR A,TOMOV S,et al.A Fast Batched Cho-lesky Factorization on a GPU[C]//International Conference on Parallel Processing.New York:IEEE Press,2014:432-440.
[6]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Factorization and Inversion of a Million Matrices using GPUs:Challenges and Countermeasures[C]//International Conference on Computational Science.Berlin:Springer,2017:606-615.
[7]ABDELFATTAH A,COSTA T,DONGARRA J,et al.A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines[J].ACM Transactions on Mathematical Software,2021,47(7):1-23.
[8]FRANCIS J G F.The QR Transformation:A Unitary Analogue to The LR Transformation—Part 1[J].The Computer Journal,1961,4(3):265-271.
[9]FRANCIS J G F.The QR Transformation—Part 2[J].TheComputer Journal,1964,4(4):332-345.
[10]CUPPEN J.A Divide and Conquer Method for The Symmetric Eigenproblem[J].Numerical Mathematics,1980,36(2):177-195.
[11]GU M,EISENSTAT S C.A Divide-and-Conquer Algorithm for the Symmetric Tridiagonal Eigenproblem[J].SIAM Journal on Matrix Analysis and Applications,1995,16(1):172-191.
[12]BIENTINESI P,DHILLON I S,GEIJN R A V D.A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations[J].SIAM Journal on Scientific Computing,2005,27(1):43-66.
[13]WILLEMS P R,LANG B.A Framework for The MR3 Algo-rithm:Theory and Implementation[J].SIAM Journal on Scientific Computing,2013,35(2):A740-A766.
[14]WANG T,GUO L,LI G,et al.Implementing the Jacobi Algorithm for Solving Eigenvalues of Symmetric Matrices with CUDA[C]//International Conference on Networking,Architecture,and Storage.New York:IEEE Press,2012:69-78.
[15]TORUN M U,YILMA O,AKANSU A N.FPGA,GPU,andCPU Implementations of Jacobi Algorithm for Eigenanalysis[J].Journal of Parallel & Distributed Computing,2016,96(12):172-180.
[16]YU W Z,MOUSSA J,KUS P,et al.GPU-Acceleration of The ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems[J].Computer Physics Communications,2021,262(5):107808.
[17]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:AnInsightful Visual Performance Model for Multicore Architectures[J].Communication of the ACM,2009,52(4):65-76.
[18]BRENT R P,LUK F T.The Solution of Singular Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays[J].SIAM Journal on Scientific and Statistical Computing,1985,6(1):69-84.
[19]RIVERA C,CHEN J,XIONG N,et al.TSM2X:High-Perfor-mance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs[J].Journal of Parallel and Distributed Computing,2021,151(3):70-85.
[20]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Perfor-mance,Design,and Autotuning of Batched GEMM For GPUs[C]//International Conference on High Performance Computing.Berlin:Springer,2016:21-38.
[21]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Perfor-mance tuning and optimization techniques of fixed and variable size batched Cholesky factorization on GPUs[J].Procedia Computer Science,2016,80(C):119-130.
[1] LIU Lin-yun, CHEN Kai-yan, LI Xiong-wei, ZHANG Yang, XIE Fang-fang. Overview of Side Channel Analysis Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(5): 296-302.
[2] WANG Yi-chao, LIAO Qiu-cheng, ZUO Si-cheng, XIE Rui, LIN Xin-hua. Performance Evaluation of ARM-ISA SoC for High Performance Computing [J]. Computer Science, 2019, 46(8): 95-99.
[3] ZHANG Biao, DONG Meng-yu, FAN Bei-bei. Enterprise Performance Evaluation Model Based on Triangular Fuzzy Multi-attribute Decision Making [J]. Computer Science, 2019, 46(6A): 547-549.
[4] LI Meng-xiao, YAO Shi-yuan. Design and Improvement of Face Recognition System Based on PCA [J]. Computer Science, 2019, 46(6A): 577-579.
[5] DING Wei-long, XUE Li-li, CHEN Wan-jun, WU Fu-li. Visualization Method for University Teachers’ Performance Data Based on Mixed Layout Strategy [J]. Computer Science, 2019, 46(2): 24-29.
[6] DONG Si-qi, LI Hai-long, QU Yu-ben, ZHANG Zhao, HU Lei. Survey of Research on Computation Unloading Strategy in Mobile Edge Computing [J]. Computer Science, 2019, 46(11): 32-40.
[7] LUO Shu-yan, ZHU Yi-an, ZENG Cheng. Performance Evaluation and Optimization of Inter-cores Communication for Heterogeneous
Multi-core Processor Unit
[J]. Computer Science, 2018, 45(6A): 262-265.
[8] ZHANG Shao-nan, QIU Ke-ni, ZHANG Wei-gong, WANG Jing, ZHENG Jia-xin, BAI Rui-ying and ZHU Xiao-yan. Queuing Theory-guided Performance Evaluation on Reconfigurable High-speed Device Connected Bus [J]. Computer Science, 2017, 44(Z6): 504-509.
[9] LI Hong-jun, CUI Xi-ning, MU Ming and HAN Wei. Research on Distributed Embedded Computer Performance Evaluation Model [J]. Computer Science, 2017, 44(4): 153-156.
[10] LIN Xin-hua, QIN Qiang, LI Shuo, WEN Min-hua and MATSUOKA Satoshi. Evaluating Intel AVX2 Vgather Instructions with Stencils [J]. Computer Science, 2017, 44(1): 20-24.
[11] WANG Wei, WANG Jia-jun, WANG Ming-ming, ZHANG Wen-jing and CHEN Jin-guang. Defense Technology Based on Dynamic Space-Time Performance for Flooding Attacks in Mobile Ad Hoc Networks [J]. Computer Science, 2017, 44(1): 159-166.
[12] SHEN Hua, HE Yan-xiang and ZHANG Ming-wu. Performance Bottlenecks Location Scheme Based on Structural Features of Service Composition Model [J]. Computer Science, 2015, 42(9): 107-117.
[13] ZHANG Cheng-wei, CHENG Wen-qing and HEI Xiao-jun. Measurement Study of 3G Mobile Networks Using Android Platform [J]. Computer Science, 2015, 42(2): 24-28.
[14] LI Zhi-jia, HU Xiang, JIAO Li and WANG Wei-feng. Performance Evaluation of Job Scheduling and InfiniBand Network Interconnection in High Performance Computing System Based on Stochastic Petri Nets [J]. Computer Science, 2015, 42(1): 33-37.
[15] ZHOU Xiao-peng,ZHANG Xiao-fang and ZHAO Xiao-nan. Research of Performance Evaluation of Cloud Storage [J]. Computer Science, 2014, 41(4): 190-194.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!