计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 6-12.doi: 10.11896/jsjkx.191000009

• 计算机体系结构 • 上一篇    下一篇

广义稠密对称特征问题标准化算法在GPU集群上的有效实现

刘世芳1,2, 赵永华1, 于天禹1,2, 黄荣锋1,2   

  1. 1 中国科学院计算机网络信息中心 北京100080;
    2 中国科学院大学 北京100080
  • 收稿日期:2019-10-08 出版日期:2020-04-15 发布日期:2020-04-15
  • 通讯作者: 赵永华(yhzhao@sccas.cn)
  • 基金资助:
    国家重点研发计划项目(2017YFB02022);中国科学院战略性先导科技专项(C类)(XDC01040000)

Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster

LIU Shi-fang1,2, ZHAO Yong-hua1, YU Tian-yu1,2, HUANG Rong-feng1,2   

  1. 1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China;
    2 University of Chinese Academy of Sciences,Beijing 100080,China
  • Received:2019-10-08 Online:2020-04-15 Published:2020-04-15
  • Contact: ZHAO Yong-hua,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include parallel algorithm and software,parallel programming model,spectral clustering data analysis and so on.
  • About author:LIU Shi-fang,born in 1994,Ph.D.Her main research interests include numerical parallel algorithm library implementation and optimization of parallel algorithm library under heterogeneous parallel machine.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China(2017YFB02022) and Strategic Priority Research Program of Chinese Academy of Sciences (XDC01040000).

摘要: 广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置 2 块 Nvidia Tesla K20 GPGPU卡及2 颗 Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21Tflops。

关键词: Cholesky分解, GPU集群, 广义对称特征问题标准化算法, 三角矩阵方程, 转置算法

Abstract: The solution of the generalized dense symmetric eigenproblem is the main task of many applied sciences and enginee-ring,and is an important part in the calculation of electromagnetics,electronic structures,finite element models and quantum che-mistry.Transforming generalized symmetric eigenproblem into a standard symmetric eigenproblem is an important computational step for solving the generalized dense symmetric eigenproblem.For the GPU cluster,the generalized blocked algorithm for gene-ralized dense symmetric eigenproblem was presented based on MPI+CUDA on GPU cluster.In order to adapt to the architecture of the GPU cluster,the generalized symmetric eigenproblem standardization algorithm presented in this paper adopts the method of combining the Cholesky decomposition of the positive definite matrix with the traditional standardized blocked algorithm,which reduces the unnecessary communication overhead in the standardized algorithm and increases the parallelism of the algorithm.Moreover,In the MPI+CUDA based generalized symmetric eigenproblem standardization algorithm,the data transferoperation between the GPU and the CPU is utilized to mask the data copy operation in the GPU,which eliminates the time spent on copying,thereby improving the performance of the program.At the same time,a fully parallel point-to-point transposition algo-rithm between the row communication domain and the column communication domain in the two-dimensional communication grid was presented.In addition,a parallel blocked algorithm based on MPI+CUDA for the solution of the triangular matrix equation BX=A with multiple right-end terms was also given.On the supercomputer system “Era” of the Computer Network Information Center of Chinese Academy of Sciences,each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors.This paper tested different scale matrices using up to 32GPUs.The implementation performance of the ge-neralized symmetric eigenproblem standardization algorithm based on MPI+CUDA has achieved better acceleration and perfor-mance,and have good scalability.When tested with 50000×50000-order matrix using 32GPUs,the peak performance reach approximately 9.21 Tflops.

Key words: Cholesky decomposition, Generalized symmetric eigenproblem standardization blocked algorithm, GPU cluster, Transpose algorithm, Triangular matrix equations

中图分类号: 

  • TP301
[1]KENT P C.Computational challenges of large-scale,long-time,first-principles molecular dynamics[J].Journal of Physics:Conference Series,2008,125:012058.
[2]AUCKENTHALER T,BLUM V,BUNGARTZ H J,et al.Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations[J].Parallel Computing,2011,37(12):783-794.
[3]SINGH D J.Introduction to the LAPW method[M]//Planewaves,Pseudopotentials and the LAPW Method.Boston,MA:Springer US,1994:35-43.
[4]CHEN X S,VONG S W,LI W,et al.Noda iterations for generalized eigenproblems following Perron-Frobenius theory[J].Numerical Algorithms,2019,80(3):937-955.
[5]CUCURINGU M,DAVIES P,GLIELMO A,et al.SPONGE:A generalized eigenproblem for clustering signed networks[J].arXiv:1904.08575,2019.
[6]BUTKOVICˇ P,JONES D.On special cases of the generalized max-plus eigenproblem[J].SIAM Journal on Matrix Analysis and Applications,2016,37(3):1002-1021.
[7]BLACKFORD L S,CHOI J,CLEARY A,et al.ScaLAPACK users’ guide[M].Philadelphia:SIAM,1997.
[8]ALPATOV P,BAKER G,EDWARDS C,et al.PLAPACK Parallel Linear Algebra Package Design Overview[C]//Proceedings of the 1997 ACM/IEEE Conference on Supercomputing.IEEE,1997:29-29.
[9]BAKER G,GUNNELS J,MORROW G,et al.PLAPACK:highperformance through high-level abstraction[C]// 1998 International Conference on Parallel Processing (Cat.No.98EX205).IEEE,1998.
[10]IMAMURA T,YAMADA S,MACHIDA M.Development of a high-performance eigensolver on a peta-scale next-generation supercomputer system[J].Progress in Nuclear Science and Technology,2011,2:643-650.
[11]POULSON J,MARKER B,GEIJN R A V D,et al.Elemental:A new framework for distributed memory dense matrix computations[J].ACM Transactions on Mathematical Software (TOMS),2013,39(2):13.
[12]PETSCHOW M,PEISE E,BIENTINESI P.High-performance solvers for dense Hermitian eigenproblems[J].SIAM Journal on Scientific Computing,2013,35(1):C1-C22.
[13]HENDRICKSON B,JESSUP E,SMITH C.Toward an efficient parallel eigensolver for dense symmetric matrices[J].SIAM Journal on Scientific Computing,1998,20(3):1132-1154.
[14]TOMOV S,NATH R,DU P,et al.MAGMA Users’ Guide[EB/OL].ICL,UTK (November 2009).https://icl.cs.utk.edu/projectsfiles/magma/doxygen/.
[15]HAIDAR A,SOLCÁR,GATES M,et al.Leading edge hybrid multi-GPU algorithms for generalized eigenproblems in electronic structure calculations[M]//Lecture Notes in Computer Science.Berlin:Springer,2013:67-80.
[16]ZHAO Y H,CHI X B,CHENG Q.Efficient parallel blocked algorithms for generalized Hermitian eigenproblem[J].Journal of Computer Research and Development,2007,44(10):1724-1732.
[17]LICHTENSTEIN W,JOHNSSON S L.Block-cyclic dense linear algebra[J].SIAM Journal on Scientific Computing,1993,14(6):1259-1288.
[18]BOLEN J,DAVIS A,DAZEY B,et al.Massively parallel distributed computing:World's first 281 gigaflop supercomputer[EB/OL].Proceedings of the Intel Supercomputer Users Group.https://pdfs.semanticscholar.org/52ed/b836555ead0769708968adbb58a64120d34b.pdf.
[19]BISSELING R H,VAN DE VORST J G G.Parallel LU decomposition on a transputer network[C]//Conference Organized by Koninklijke/Shell-Laboratory,Amsterdam.Berlin:Springer,1988:61-77.
[1] 许新鹏, 胡斌星.
基于ICCG法的飞行器部件强度校核快速计算方法
Fast Calculation Method of Aircraft Component Strength Check Based on ICCG
计算机科学, 2020, 47(11A): 624-627. https://doi.org/10.11896/jsjkx.191100154
[2] 王梅,王莎莎,孙莺萁,宋考平,田枫,廖士中.
SVRRPMCC:一种支持向量回归机的正则化路径近似算法
SVRRPMCC:A Regularization Path Approximation Algorithm of Support Vector Regression
计算机科学, 2017, 44(12): 42-47. https://doi.org/10.11896/j.issn.1002-137X.2017.12.008
[3] 陆慧娟,魏莎莎,关伟,缪燕子.
基于鱼群优化算法和Cholesky分解的RELM的基因表达数据分类
Improved RELM Based on Fish Swarm Optimization Algorithm and Cholesky Decomposition for Gene Expression Data Classification
计算机科学, 2014, 41(12): 226-230. https://doi.org/10.11896/j.issn.1002-137X.2014.12.049
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!