批量厄米矩阵特征值分解的GPU算法

doi:10.11896/jsjkx.220100232

计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 397-403.doi: 10.11896/jsjkx.220100232

• 交叉&前沿 • 上一篇

批量厄米矩阵特征值分解的GPU算法

黄荣锋^1,2, 刘世芳^1,2, 赵永华¹

1 中国科学院计算机网络信息中心北京 100080
2 中国科学院大学北京 100080

收稿日期:2022-01-25 修回日期:2022-08-24 出版日期:2023-04-15 发布日期:2023-04-06
通讯作者: 赵永华(yhzhao@sccas.cn)
作者简介:(hrf@sccas.cn)
基金资助:
国家重点研发计划(2017YFB0202202);中国科学院战略性先导科技专项(XDC05000000)

Batched Eigenvalue Decomposition Algorithms for Hermitian Matrices on GPU

HUANG Rongfeng^1,2, LIU Shifang^1,2, ZHAO Yonghua¹

1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China
2 University of Chinese Academy of Sciences,Beijing 100080,China

Received:2022-01-25 Revised:2022-08-24 Online:2023-04-15 Published:2023-04-06
About author:HUANG Rongfeng,born in 1990,Ph.D candidate.His main research interests include efficient implementation of math library on heterogeneous platforms and so on.
ZHAO Yonghua,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include parallel algorithm and software,parallel programming model,spectral clustering data analysis and so on.
Supported by:
National Key Research and Development Program of China(2017YFB0202202)and Strategic Priority Research Program of Chinese Academy of Sciences(XDC05000000).

摘要/Abstract

摘要： 批量矩阵计算问题广泛存在于科学计算与工程应用领域。随着性能的快速提升,GPU已成为解决这类问题的重要工具之一。矩阵特征值分解属于双边分解,需要使用迭代算法进行求解,不同矩阵的迭代次数可能不同,因此,在GPU上设计批量矩阵的特征值分解算法比设计LU分解等单边分解算法更具挑战性。文中针对不同规模的矩阵,基于Jacobi算法设计了相应的批量厄米矩阵特征值分解GPU算法。对于共享内存无法存储的矩阵,采用矩阵“块”操作技术提升计算强度,从而提高GPU的资源利用率。所提算法完全在GPU上运行,避免了CPU与GPU之间的通信。在算法实现上,通过kernel融合减少了kernel启动负载和全局内存访问。在V100 GPU上的实验结果表明,所提算法优于已有工作。Roofline性能分析模型表明,文中给出的实现已接近理论上限,达到了4.11TFLOPS。

关键词: 厄米矩阵, 特征值分解, 批量计算, Roofline模型, 性能分析

Abstract: Batched matrix computing problems are widely existed in scientific computing and engineering applications.With rapid performance improvements,GPU has become an important tool to solve such problems.The eigenvalue decomposition belongs to the two-sided decomposition and must be solved by the iterative algorithm.Iterative numbers for different matrices can be varied.Therefore,designing eigenvalue decomposition algorithms for batched matrices on the GPU is more challenging than designing batched algorithms for the one-sided decomposition,such as LU decomposition.This paper proposes batched algorithms based on the Jacobi algorithms for eigenvalue decomposition of Hermitian matrices.For matrices that cannot reside in shared memory wholly,the block technique is used to improve the arithmetic intensity,thus improving the use of GPU resources.Algorithms presented in this paper run completely on the GPU,avoiding the communication between the CPU and GPU.Kernel fusion is adopted to decrease the overhead of launching kernel and global memory access.Experimental results on V100 GPU show that our algorithms are better than existing works.Performance evaluation results of the Roofline model indicate that our implementations are close to the upper bound,approaching 4.11TFLOPS.

Key words: Hermitian matrix, Eigenvalue decomposition, Batch computing, Roofline model, Performance evaluation

中图分类号:

TP301

黄荣锋, 刘世芳, 赵永华. 批量厄米矩阵特征值分解的GPU算法[J]. 计算机科学, 2023, 50(4): 397-403. https://doi.org/10.11896/jsjkx.220100232

HUANG Rongfeng, LIU Shifang, ZHAO Yonghua. Batched Eigenvalue Decomposition Algorithms for Hermitian Matrices on GPU[J]. Computer Science, 2023, 50(4): 397-403. https://doi.org/10.11896/jsjkx.220100232

参考文献

[1]ABDELFATTAH A,BABOULIN M,DOBREV V,et al.High-performance Tensor Contractions for GPUs[J].Procedia Computer Science,2016,80(1):108-118.
[2]MOLERO J M,GARZÓN E M,GARCÍA I,et al.Efficient Implementation of Hyperspectral Anomaly Detection Techniques on GPUs and Multicore Processors[J].IEEE Journal of Selec-ted Topics in Applied Earth Observations and Remote Sensing,2014,7(6):2256-2266.
[3]VILLA O,GAWANDE N,TUMEO A.Accelerating Subsurface Transport Simulationon Heterogeneous Clusters[C]//International Conference on Cluster Computing.New York:IEEE Press,2013:1-8.
[4]ZHANG T,LIU X,WANG X,et al.cuTensor-Tubal:Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs[J].IEEE Transactions on Parallel and Distributed Systems,2020,31(3):595-610.
[5]DONG T,HAIDAR A,TOMOV S,et al.A Fast Batched Cho-lesky Factorization on a GPU[C]//International Conference on Parallel Processing.New York:IEEE Press,2014:432-440.
[6]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Factorization and Inversion of a Million Matrices using GPUs:Challenges and Countermeasures[C]//International Conference on Computational Science.Berlin:Springer,2017:606-615.
[7]ABDELFATTAH A,COSTA T,DONGARRA J,et al.A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines[J].ACM Transactions on Mathematical Software,2021,47(7):1-23.
[8]FRANCIS J G F.The QR Transformation:A Unitary Analogue to The LR Transformation—Part 1[J].The Computer Journal,1961,4(3):265-271.
[9]FRANCIS J G F.The QR Transformation—Part 2[J].TheComputer Journal,1964,4(4):332-345.
[10]CUPPEN J.A Divide and Conquer Method for The Symmetric Eigenproblem[J].Numerical Mathematics,1980,36(2):177-195.
[11]GU M,EISENSTAT S C.A Divide-and-Conquer Algorithm for the Symmetric Tridiagonal Eigenproblem[J].SIAM Journal on Matrix Analysis and Applications,1995,16(1):172-191.
[12]BIENTINESI P,DHILLON I S,GEIJN R A V D.A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations[J].SIAM Journal on Scientific Computing,2005,27(1):43-66.
[13]WILLEMS P R,LANG B.A Framework for The MR3 Algo-rithm:Theory and Implementation[J].SIAM Journal on Scientific Computing,2013,35(2):A740-A766.
[14]WANG T,GUO L,LI G,et al.Implementing the Jacobi Algorithm for Solving Eigenvalues of Symmetric Matrices with CUDA[C]//International Conference on Networking,Architecture,and Storage.New York:IEEE Press,2012:69-78.
[15]TORUN M U,YILMA O,AKANSU A N.FPGA,GPU,andCPU Implementations of Jacobi Algorithm for Eigenanalysis[J].Journal of Parallel & Distributed Computing,2016,96(12):172-180.
[16]YU W Z,MOUSSA J,KUS P,et al.GPU-Acceleration of The ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems[J].Computer Physics Communications,2021,262(5):107808.
[17]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:AnInsightful Visual Performance Model for Multicore Architectures[J].Communication of the ACM,2009,52(4):65-76.
[18]BRENT R P,LUK F T.The Solution of Singular Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays[J].SIAM Journal on Scientific and Statistical Computing,1985,6(1):69-84.
[19]RIVERA C,CHEN J,XIONG N,et al.TSM2X:High-Perfor-mance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs[J].Journal of Parallel and Distributed Computing,2021,151(3):70-85.
[20]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Perfor-mance,Design,and Autotuning of Batched GEMM For GPUs[C]//International Conference on High Performance Computing.Berlin:Springer,2016:21-38.
[21]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Perfor-mance tuning and optimization techniques of fixed and variable size batched Cholesky factorization on GPUs[J].Procedia Computer Science,2016,80(C):119-130.

相关文章 15

[1]	朱雨, 庞建民, 徐金龙, 陶小涵, 王军. 面向SW26010处理器的三维Stencil自适应分块参数算法 Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor 计算机科学, 2021, 48(6): 10-18. https://doi.org/10.11896/jsjkx.200700059
[2]	吴培培, 吴兆贤, 唐文兵. 基于吸收态马尔可夫链的智能无人车系统实时性能分析 Real-time Performance Analysis of Intelligent Unmanned Vehicle System Based on Absorbing Markov Chain 计算机科学, 2021, 48(11A): 147-153. https://doi.org/10.11896/jsjkx.210300050
[3]	李梦潇, 姚仕元. 基于PCA的人脸识别系统的设计与改进 Design and Improvement of Face Recognition System Based on PCA 计算机科学, 2019, 46(6A): 577-579.
[4]	李璐璐,裘雪红,周端,张剑贤. 片上网络容错技术研究 Research on Fault Tolerant Technology for Networks-on-Chip 计算机科学, 2018, 45(3): 305-310. https://doi.org/10.11896/j.issn.1002-137X.2018.03.050
[5]	唐滔,彭林,黄春,杨灿群. 面向存储层次设计优化的GPU程序性能分析 Performance Analysis of GPU Programs Towards Better Memory Hierarchy Design 计算机科学, 2017, 44(12): 1-10. https://doi.org/10.11896/j.issn.1002-137X.2017.12.001
[6]	杜欣,汪春燕,倪友聪,叶鹏,肖如良. 基于规则的软件体系结构层性能优化模型 Rule-based Performance Optimization Model at Software Architecture Level 计算机科学, 2015, 42(10): 189-192.
[7]	杨雷,代钰,张斌,王昊. 考虑虚拟机间性能互扰基于排队网的多层Web应用性能分析模型 Queueing Network Based Performance Model for Multi-tiered Web Application with Consideration of Performance Interference among Virtual Machines 计算机科学, 2015, 42(1): 47-49. https://doi.org/10.11896/j.issn.1002-137X.2015.01.010
[8]	巩庆奎,张常有,张先轶,张云泉. 基于Julia语言的并行计算方法初探 Primary Investigation into Parallel Computing in Julia Language 计算机科学, 2015, 42(1): 44-46. https://doi.org/10.11896/j.issn.1002-137X.2015.01.009
[9]	王望,姚淑珍,谭火彬. 基于DSPN的TTE总线建模与性能分析 Modeling and Performance Analysis of TTE Bus Based on DSPN 计算机科学, 2014, 41(7): 45-48. https://doi.org/10.11896/j.issn.1002-137X.2014.07.008
[10]	刘高峰,李明,王亚军,张鹏. 基于非负特征值分解的极化SAR子空间分解滤波 Subspace Decomposition Filtering Based on NNED for Polarimetric SAR 计算机科学, 2013, 40(5): 266-270.
[11]	陈刚,陈旭,徐元,边昳,鲁华祥. 基于CORDIC算法的高精度浮点对称矩阵特征值分解的FPGA实现 Floating-point CORDIC-based Real-valued Symmetric Matrix Eigenvalue Decomposition on FPGA 计算机科学, 2013, 40(5): 35-37.
[12]	林玉娥，李敬兆，梁兴柱，林玉荣. 边界近邻零空间鉴别分析 Marginal Neighborhood Nullspace Discriminant Analysis 计算机科学, 2013, 40(3): 291-294.
[13]	张斌,王曦. 面向Web服务的SAML路径验证协议及其性能分析 SAMI. Path Verification Protocol for Web Service and its Performance Analysis 计算机科学, 2013, 40(3): 192-196.
[14]	黄骁飞,白晓颖,苑丽杰. 异构云平台性能监控与分析研究 Performance Monitoring and Analysis of Heterogeneous Cloud Platforms 计算机科学, 2013, 40(11): 147-151.
[15]	官铮，赵东风，余介夫. 一种TDMA无线传感器网络MAC协议能量有效性分析 Analysis Model for the Efficient of the TDMA MAC for Wireless Sensor Network 计算机科学, 2012, 39(Z6): 151-153.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

批量厄米矩阵特征值分解的GPU算法

Batched Eigenvalue Decomposition Algorithms for Hermitian Matrices on GPU

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0