计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 397-403.doi: 10.11896/jsjkx.220100232
• 交叉&前沿 • 上一篇
黄荣锋1,2, 刘世芳1,2, 赵永华1
HUANG Rongfeng1,2, LIU Shifang1,2, ZHAO Yonghua1
摘要: 批量矩阵计算问题广泛存在于科学计算与工程应用领域。随着性能的快速提升,GPU已成为解决这类问题的重要工具之一。矩阵特征值分解属于双边分解,需要使用迭代算法进行求解,不同矩阵的迭代次数可能不同,因此,在GPU上设计批量矩阵的特征值分解算法比设计LU分解等单边分解算法更具挑战性。文中针对不同规模的矩阵,基于Jacobi算法设计了相应的批量厄米矩阵特征值分解GPU算法。对于共享内存无法存储的矩阵,采用矩阵“块”操作技术提升计算强度,从而提高GPU的资源利用率。所提算法完全在GPU上运行,避免了CPU与GPU之间的通信。在算法实现上,通过kernel融合减少了kernel启动负载和全局内存访问。在V100 GPU上的实验结果表明,所提算法优于已有工作。Roofline性能分析模型表明,文中给出的实现已接近理论上限,达到了4.11TFLOPS。
中图分类号:
[1]ABDELFATTAH A,BABOULIN M,DOBREV V,et al.High-performance Tensor Contractions for GPUs[J].Procedia Computer Science,2016,80(1):108-118. [2]MOLERO J M,GARZÓN E M,GARCÍA I,et al.Efficient Implementation of Hyperspectral Anomaly Detection Techniques on GPUs and Multicore Processors[J].IEEE Journal of Selec-ted Topics in Applied Earth Observations and Remote Sensing,2014,7(6):2256-2266. [3]VILLA O,GAWANDE N,TUMEO A.Accelerating Subsurface Transport Simulationon Heterogeneous Clusters[C]//International Conference on Cluster Computing.New York:IEEE Press,2013:1-8. [4]ZHANG T,LIU X,WANG X,et al.cuTensor-Tubal:Efficient Primitives for Tubal-Rank Tensor Learning Operations on GPUs[J].IEEE Transactions on Parallel and Distributed Systems,2020,31(3):595-610. [5]DONG T,HAIDAR A,TOMOV S,et al.A Fast Batched Cho-lesky Factorization on a GPU[C]//International Conference on Parallel Processing.New York:IEEE Press,2014:432-440. [6]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Factorization and Inversion of a Million Matrices using GPUs:Challenges and Countermeasures[C]//International Conference on Computational Science.Berlin:Springer,2017:606-615. [7]ABDELFATTAH A,COSTA T,DONGARRA J,et al.A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines[J].ACM Transactions on Mathematical Software,2021,47(7):1-23. [8]FRANCIS J G F.The QR Transformation:A Unitary Analogue to The LR Transformation—Part 1[J].The Computer Journal,1961,4(3):265-271. [9]FRANCIS J G F.The QR Transformation—Part 2[J].TheComputer Journal,1964,4(4):332-345. [10]CUPPEN J.A Divide and Conquer Method for The Symmetric Eigenproblem[J].Numerical Mathematics,1980,36(2):177-195. [11]GU M,EISENSTAT S C.A Divide-and-Conquer Algorithm for the Symmetric Tridiagonal Eigenproblem[J].SIAM Journal on Matrix Analysis and Applications,1995,16(1):172-191. [12]BIENTINESI P,DHILLON I S,GEIJN R A V D.A Parallel Eigensolver for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations[J].SIAM Journal on Scientific Computing,2005,27(1):43-66. [13]WILLEMS P R,LANG B.A Framework for The MR3 Algo-rithm:Theory and Implementation[J].SIAM Journal on Scientific Computing,2013,35(2):A740-A766. [14]WANG T,GUO L,LI G,et al.Implementing the Jacobi Algorithm for Solving Eigenvalues of Symmetric Matrices with CUDA[C]//International Conference on Networking,Architecture,and Storage.New York:IEEE Press,2012:69-78. [15]TORUN M U,YILMA O,AKANSU A N.FPGA,GPU,andCPU Implementations of Jacobi Algorithm for Eigenanalysis[J].Journal of Parallel & Distributed Computing,2016,96(12):172-180. [16]YU W Z,MOUSSA J,KUS P,et al.GPU-Acceleration of The ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems[J].Computer Physics Communications,2021,262(5):107808. [17]WILLIAMS S,WATERMAN A,PATTERSON D.Roofline:AnInsightful Visual Performance Model for Multicore Architectures[J].Communication of the ACM,2009,52(4):65-76. [18]BRENT R P,LUK F T.The Solution of Singular Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays[J].SIAM Journal on Scientific and Statistical Computing,1985,6(1):69-84. [19]RIVERA C,CHEN J,XIONG N,et al.TSM2X:High-Perfor-mance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs[J].Journal of Parallel and Distributed Computing,2021,151(3):70-85. [20]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Perfor-mance,Design,and Autotuning of Batched GEMM For GPUs[C]//International Conference on High Performance Computing.Berlin:Springer,2016:21-38. [21]ABDELFATTAH A,HAIDAR A,TOMOV S,et al.Perfor-mance tuning and optimization techniques of fixed and variable size batched Cholesky factorization on GPUs[J].Procedia Computer Science,2016,80(C):119-130. |
[1] | 朱雨, 庞建民, 徐金龙, 陶小涵, 王军. 面向SW26010处理器的三维Stencil自适应分块参数算法 Adaptive Tiling Size Algorithm for 3D Stencil Computation on SW26010 Many-core Processor 计算机科学, 2021, 48(6): 10-18. https://doi.org/10.11896/jsjkx.200700059 |
[2] | 吴培培, 吴兆贤, 唐文兵. 基于吸收态马尔可夫链的智能无人车系统实时性能分析 Real-time Performance Analysis of Intelligent Unmanned Vehicle System Based on Absorbing Markov Chain 计算机科学, 2021, 48(11A): 147-153. https://doi.org/10.11896/jsjkx.210300050 |
[3] | 李梦潇, 姚仕元. 基于PCA的人脸识别系统的设计与改进 Design and Improvement of Face Recognition System Based on PCA 计算机科学, 2019, 46(6A): 577-579. |
[4] | 李璐璐,裘雪红,周端,张剑贤. 片上网络容错技术研究 Research on Fault Tolerant Technology for Networks-on-Chip 计算机科学, 2018, 45(3): 305-310. https://doi.org/10.11896/j.issn.1002-137X.2018.03.050 |
[5] | 唐滔,彭林,黄春,杨灿群. 面向存储层次设计优化的GPU程序性能分析 Performance Analysis of GPU Programs Towards Better Memory Hierarchy Design 计算机科学, 2017, 44(12): 1-10. https://doi.org/10.11896/j.issn.1002-137X.2017.12.001 |
[6] | 杜欣,汪春燕,倪友聪,叶 鹏,肖如良. 基于规则的软件体系结构层性能优化模型 Rule-based Performance Optimization Model at Software Architecture Level 计算机科学, 2015, 42(10): 189-192. |
[7] | 杨雷,代钰,张斌,王昊. 考虑虚拟机间性能互扰基于排队网的多层Web应用性能分析模型 Queueing Network Based Performance Model for Multi-tiered Web Application with Consideration of Performance Interference among Virtual Machines 计算机科学, 2015, 42(1): 47-49. https://doi.org/10.11896/j.issn.1002-137X.2015.01.010 |
[8] | 巩庆奎,张常有,张先轶,张云泉. 基于Julia语言的并行计算方法初探 Primary Investigation into Parallel Computing in Julia Language 计算机科学, 2015, 42(1): 44-46. https://doi.org/10.11896/j.issn.1002-137X.2015.01.009 |
[9] | 王望,姚淑珍,谭火彬. 基于DSPN的TTE总线建模与性能分析 Modeling and Performance Analysis of TTE Bus Based on DSPN 计算机科学, 2014, 41(7): 45-48. https://doi.org/10.11896/j.issn.1002-137X.2014.07.008 |
[10] | 刘高峰,李明,王亚军,张鹏. 基于非负特征值分解的极化SAR子空间分解滤波 Subspace Decomposition Filtering Based on NNED for Polarimetric SAR 计算机科学, 2013, 40(5): 266-270. |
[11] | 陈刚,陈旭,徐元,边昳,鲁华祥. 基于CORDIC算法的高精度浮点对称矩阵特征值分解的FPGA实现 Floating-point CORDIC-based Real-valued Symmetric Matrix Eigenvalue Decomposition on FPGA 计算机科学, 2013, 40(5): 35-37. |
[12] | 林玉娥,李敬兆,梁兴柱,林玉荣. 边界近邻零空间鉴别分析 Marginal Neighborhood Nullspace Discriminant Analysis 计算机科学, 2013, 40(3): 291-294. |
[13] | 张斌,王曦. 面向Web服务的SAML路径验证协议及其性能分析 SAMI. Path Verification Protocol for Web Service and its Performance Analysis 计算机科学, 2013, 40(3): 192-196. |
[14] | 黄骁飞,白晓颖,苑丽杰. 异构云平台性能监控与分析研究 Performance Monitoring and Analysis of Heterogeneous Cloud Platforms 计算机科学, 2013, 40(11): 147-151. |
[15] | 官铮,赵东风,余介夫. 一种TDMA无线传感器网络MAC协议能量有效性分析 Analysis Model for the Efficient of the TDMA MAC for Wireless Sensor Network 计算机科学, 2012, 39(Z6): 151-153. |
|