Computer Science ›› 2020, Vol. 47 ›› Issue (4): 6-12.doi: 10.11896/jsjkx.191000009

• Computer Architecture • Previous Articles     Next Articles

Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster

LIU Shi-fang1,2, ZHAO Yong-hua1, YU Tian-yu1,2, HUANG Rong-feng1,2   

  1. 1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China;
    2 University of Chinese Academy of Sciences,Beijing 100080,China
  • Received:2019-10-08 Online:2020-04-15 Published:2020-04-15
  • Contact: ZHAO Yong-hua,born in 1966,Ph.D,professor,Ph.D supervisor.His main research interests include parallel algorithm and software,parallel programming model,spectral clustering data analysis and so on.
  • About author:LIU Shi-fang,born in 1994,Ph.D.Her main research interests include numerical parallel algorithm library implementation and optimization of parallel algorithm library under heterogeneous parallel machine.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China(2017YFB02022) and Strategic Priority Research Program of Chinese Academy of Sciences (XDC01040000).

Abstract: The solution of the generalized dense symmetric eigenproblem is the main task of many applied sciences and enginee-ring,and is an important part in the calculation of electromagnetics,electronic structures,finite element models and quantum che-mistry.Transforming generalized symmetric eigenproblem into a standard symmetric eigenproblem is an important computational step for solving the generalized dense symmetric eigenproblem.For the GPU cluster,the generalized blocked algorithm for gene-ralized dense symmetric eigenproblem was presented based on MPI+CUDA on GPU cluster.In order to adapt to the architecture of the GPU cluster,the generalized symmetric eigenproblem standardization algorithm presented in this paper adopts the method of combining the Cholesky decomposition of the positive definite matrix with the traditional standardized blocked algorithm,which reduces the unnecessary communication overhead in the standardized algorithm and increases the parallelism of the algorithm.Moreover,In the MPI+CUDA based generalized symmetric eigenproblem standardization algorithm,the data transferoperation between the GPU and the CPU is utilized to mask the data copy operation in the GPU,which eliminates the time spent on copying,thereby improving the performance of the program.At the same time,a fully parallel point-to-point transposition algo-rithm between the row communication domain and the column communication domain in the two-dimensional communication grid was presented.In addition,a parallel blocked algorithm based on MPI+CUDA for the solution of the triangular matrix equation BX=A with multiple right-end terms was also given.On the supercomputer system “Era” of the Computer Network Information Center of Chinese Academy of Sciences,each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors.This paper tested different scale matrices using up to 32GPUs.The implementation performance of the ge-neralized symmetric eigenproblem standardization algorithm based on MPI+CUDA has achieved better acceleration and perfor-mance,and have good scalability.When tested with 50000×50000-order matrix using 32GPUs,the peak performance reach approximately 9.21 Tflops.

Key words: Cholesky decomposition, Generalized symmetric eigenproblem standardization blocked algorithm, GPU cluster, Transpose algorithm, Triangular matrix equations

CLC Number: 

  • TP301
[1]KENT P C.Computational challenges of large-scale,long-time,first-principles molecular dynamics[J].Journal of Physics:Conference Series,2008,125:012058.
[2]AUCKENTHALER T,BLUM V,BUNGARTZ H J,et al.Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations[J].Parallel Computing,2011,37(12):783-794.
[3]SINGH D J.Introduction to the LAPW method[M]//Planewaves,Pseudopotentials and the LAPW Method.Boston,MA:Springer US,1994:35-43.
[4]CHEN X S,VONG S W,LI W,et al.Noda iterations for generalized eigenproblems following Perron-Frobenius theory[J].Numerical Algorithms,2019,80(3):937-955.
[5]CUCURINGU M,DAVIES P,GLIELMO A,et al.SPONGE:A generalized eigenproblem for clustering signed networks[J].arXiv:1904.08575,2019.
[6]BUTKOVICˇ P,JONES D.On special cases of the generalized max-plus eigenproblem[J].SIAM Journal on Matrix Analysis and Applications,2016,37(3):1002-1021.
[7]BLACKFORD L S,CHOI J,CLEARY A,et al.ScaLAPACK users’ guide[M].Philadelphia:SIAM,1997.
[8]ALPATOV P,BAKER G,EDWARDS C,et al.PLAPACK Parallel Linear Algebra Package Design Overview[C]//Proceedings of the 1997 ACM/IEEE Conference on Supercomputing.IEEE,1997:29-29.
[9]BAKER G,GUNNELS J,MORROW G,et al.PLAPACK:highperformance through high-level abstraction[C]// 1998 International Conference on Parallel Processing (Cat.No.98EX205).IEEE,1998.
[10]IMAMURA T,YAMADA S,MACHIDA M.Development of a high-performance eigensolver on a peta-scale next-generation supercomputer system[J].Progress in Nuclear Science and Technology,2011,2:643-650.
[11]POULSON J,MARKER B,GEIJN R A V D,et al.Elemental:A new framework for distributed memory dense matrix computations[J].ACM Transactions on Mathematical Software (TOMS),2013,39(2):13.
[12]PETSCHOW M,PEISE E,BIENTINESI P.High-performance solvers for dense Hermitian eigenproblems[J].SIAM Journal on Scientific Computing,2013,35(1):C1-C22.
[13]HENDRICKSON B,JESSUP E,SMITH C.Toward an efficient parallel eigensolver for dense symmetric matrices[J].SIAM Journal on Scientific Computing,1998,20(3):1132-1154.
[14]TOMOV S,NATH R,DU P,et al.MAGMA Users’ Guide[EB/OL].ICL,UTK (November 2009).https://icl.cs.utk.edu/projectsfiles/magma/doxygen/.
[15]HAIDAR A,SOLCÁR,GATES M,et al.Leading edge hybrid multi-GPU algorithms for generalized eigenproblems in electronic structure calculations[M]//Lecture Notes in Computer Science.Berlin:Springer,2013:67-80.
[16]ZHAO Y H,CHI X B,CHENG Q.Efficient parallel blocked algorithms for generalized Hermitian eigenproblem[J].Journal of Computer Research and Development,2007,44(10):1724-1732.
[17]LICHTENSTEIN W,JOHNSSON S L.Block-cyclic dense linear algebra[J].SIAM Journal on Scientific Computing,1993,14(6):1259-1288.
[18]BOLEN J,DAVIS A,DAZEY B,et al.Massively parallel distributed computing:World's first 281 gigaflop supercomputer[EB/OL].Proceedings of the Intel Supercomputer Users Group.https://pdfs.semanticscholar.org/52ed/b836555ead0769708968adbb58a64120d34b.pdf.
[19]BISSELING R H,VAN DE VORST J G G.Parallel LU decomposition on a transputer network[C]//Conference Organized by Koninklijke/Shell-Laboratory,Amsterdam.Berlin:Springer,1988:61-77.
[1] WANG Mei, WANG Sha-sha, SUN Ying-qi, SONG Kao-ping, TIAN Feng and LIAO Shi-zhong. SVRRPMCC:A Regularization Path Approximation Algorithm of Support Vector Regression [J]. Computer Science, 2017, 44(12): 42-47.
[2] LU Hui-juan,WEI Sha-sha,GUAN Wei and MIAO Yan-zi. Improved RELM Based on Fish Swarm Optimization Algorithm and Cholesky Decomposition for Gene Expression Data Classification [J]. Computer Science, 2014, 41(12): 226-230.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!