计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 15-22.doi: 10.11896/jsjkx.231000204
郭帅哲, 高建花, 计卫星
GUO Shuaizhe, GAO Jianhua, JI Weixing
摘要: 广义最小残差法(Generalized Minimum Residual,GMRES)是一种求解稀疏线性系统的迭代方法,被广泛应用于科学与工程计算等领域。数据量的爆炸式增长,使得GMRES算法求解的问题规模快速膨胀。为了支持大规模问题的求解,研究人员提出了面向集群的分布式GMRES算法。然而在现有的大多数集群中,节点间的网络性能仍与节点内的GPU高速互联网络存在较大差距,限制了分布式GMRES算法的性能。针对GPU集群上的分布式GMRES算法,提出了一种基于混合精度的加速求解方法,使用低精度浮点表示,显著降低了通信过程的时间开销。此外,提出了一种数据传输的精度调控算法,动态自适应调整传输数据的精度,以保证迭代算法最佳的求解效果。实验结果表明,所提基于混合精度的优化方法可实现平均2.4倍的加速比,结合其他优化方法后可实现平均7.6倍的加速比。
中图分类号:
[1]SAAD Y,SCHULTZ M H.GMRES:A generalized minimal residual algorithm for solving nonsymmetric linear systems[J].Society for Industrial and Applied Mathematics,1986,7(3):856-869. [2]DAVIS T A,HU Y F.The University of Florida Sparse Matrix Collection[J].ACM Transactions on Mathematical Software,2011,38(1):1-25. [3]Top500.June 2023 List[EB/OL].[2023-10-01].https://top500.org/lists/top500/2023/06/. [4]NVIDIA.NVLink&NVSwitch[EB/OL].[2023-10-01].https://www.nvidia.com/en-us/data-center/nvlink/. [5]KHODJA L Z,COUTURIER R,GIERSCH A,et al.Parallelsparse linear solver with GMRES method using minimization techniques of communications for GPU clusters[J].The Journal of Supercomputing,2014,69:200-224. [6]ROCm Software Platform.rocALUTION[EB/OL].[2023-10-01].https://github.com/ROCmSoftwarePlatform/rocALUTION. [7]NVIDIA Blog.TensorFloat-32 in the A100 GPU Accelerates AI Training,HPC up to 20x[EB/OL].(2020-05-14)[2023-10-01].https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/. [8]Intel.BFLOAT16- hardware numerics definition[EB/OL].(2018-11)[2023-10-01].https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf. [9]MICIKEVICIUS P,STOSIC D,BURGESS Net al.FP8 Formats for Deep Learning[J].arXiv:2209.05433v1,2022. [10]IOANNIDIS E I,CHEIMARIOS N,SPYROPOULOS A N,et al.On the performance of various parallel GMRES implementations on CPU and GPU clusters[J].arXiv:1906.04051,2019. [11]YAMAZAKI I,RAJAMANICKAM S,BOMAN E G,et al.Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans(SC 14).LA,USA,2014:933-944. [12]BAHI J M,COUTURIER R,KHODJA L Z,et al.ParallelGMRES implementation for solving sparse linear systems on GPU clusters[C]//Proceedings of the 19th High Performance Computing Symposia(HPC '11).2011. [13]MATSUMOTO K,IDOMURA Y,INAT,et al.Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster[J].The Journal of Supercomputing,2019,75:8115-8146. [14]HE K,TAN S X,ZHAO H Y,et al.Parallel GMRES solver forfast analysis of large linear dynamic systems on GPU platforms[J].Integration,2016,52:10-22. [15]LACOSTE X.Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems[C]//Distributed,Parallel,and Cluster Computing.2015. [16]LINDQUIST N,LUSZCZEK P,DONGARRA J,et al.Accelerating Restarted GMRES With Mixed Precision Arithmetic[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(4):1027-1037. [17]BOUCHARD A,PARENTEAU M,LAURENDEAU É.Towarda Multi-GPU Implementation of a GMRES Solver in CHAMPS[C]//The 8th Annual Chapel Implementers and Users Workshop.2021. [18]MA W P,HU Y W,YUANW,et al.Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices[J].Mathematical Problems in Engineering:Theory,Methods and Applications,2021,2021(Pt.9):6804723.1-6804723.17. [19]DEVRIES B,IANNELLI J,TREFFTZ C,et al.Parallel Implementations of FGMRES for Solving Large,Sparse Non-symme-tric Linear Systems[J].Procedia Computer Science,2013,18:491-500. [20]ZHANG J,DENG L,LI RT,et al.Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids[J].The Journal of Supercomputing,2023,79:20116-20140. |
|