基于混合精度的分布式GMRES算法优化

doi:10.11896/jsjkx.231000204

计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 15-22.doi: 10.11896/jsjkx.231000204

基于混合精度的分布式GMRES算法优化

郭帅哲, 高建花, 计卫星

北京理工大学计算机学院北京 100081

收稿日期:2023-10-30 修回日期:2024-03-29 出版日期:2024-09-15 发布日期:2024-09-10
通讯作者: 高建花(gjh@bit.edu.cn)
作者简介:(chakra_guo@bit.edu.cn)

Optimizing Distributed GMRES Algorithm with Mixed Precision

GUO Shuaizhe, GAO Jianhua, JI Weixing

School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China

Received:2023-10-30 Revised:2024-03-29 Online:2024-09-15 Published:2024-09-10
About author:GUO Shuaizhe,born in 2000,postgra-duate,is a student member of CCF(No.P4392G).His main research interests include heterogeneous computing and performance optimization.
GAO Jianhua,born in 1995,Ph.D,postdoctoral researcher,is a professional member of CCF(No.79759M).Her main research interests include parallel and high performance computing.

摘要/Abstract

摘要： 广义最小残差法(Generalized Minimum Residual,GMRES)是一种求解稀疏线性系统的迭代方法,被广泛应用于科学与工程计算等领域。数据量的爆炸式增长,使得GMRES算法求解的问题规模快速膨胀。为了支持大规模问题的求解,研究人员提出了面向集群的分布式GMRES算法。然而在现有的大多数集群中,节点间的网络性能仍与节点内的GPU高速互联网络存在较大差距,限制了分布式GMRES算法的性能。针对GPU集群上的分布式GMRES算法,提出了一种基于混合精度的加速求解方法,使用低精度浮点表示,显著降低了通信过程的时间开销。此外,提出了一种数据传输的精度调控算法,动态自适应调整传输数据的精度,以保证迭代算法最佳的求解效果。实验结果表明,所提基于混合精度的优化方法可实现平均2.4倍的加速比,结合其他优化方法后可实现平均7.6倍的加速比。

关键词: 广义最小残差法, 混合精度, GPU集群, 分布式系统

Abstract: The generalized minimum residual(GMRES) method is an iterative method for solving sparse linear systems.It is broadly used in many areas like scientific and engineering computing.The exponential data growth makes the scale of problems solved by the GMRES algorithm expand rapidly.To support the solving of large-scale problems,researchers have implemented distributed GMRES algorithm on clusters.However,the current inter-node network still significantly lags behind intra-node fa-brics in terms of both bandwidth and latency,which greatly limits the performance of the distributed GMRES algorithm.This paper proposes a mixed-precision approach for optimizing the GMRES algorithm on GPU clusters,where the data transferred is represented in a low-precision format,the network traffic during inter-GPU communication is significantly reduced.In addition,this paper proposes a balancing algorithm that dynamically adjusts the precision of the data transferred to achieve the satisfied resi-dual.Experimental results show that the proposed method achieves an average speedup of 2.4×,and a further average speedup of 7.6× when combined with other optimizations.

Key words: Generalized minimum residual, Mixed precision, GPU cluster, Distributed system

中图分类号:

TP311

郭帅哲, 高建花, 计卫星. 基于混合精度的分布式GMRES算法优化[J]. 计算机科学, 2024, 51(9): 15-22. https://doi.org/10.11896/jsjkx.231000204

GUO Shuaizhe, GAO Jianhua, JI Weixing. Optimizing Distributed GMRES Algorithm with Mixed Precision[J]. Computer Science, 2024, 51(9): 15-22. https://doi.org/10.11896/jsjkx.231000204

参考文献

[1]SAAD Y,SCHULTZ M H.GMRES:A generalized minimal residual algorithm for solving nonsymmetric linear systems[J].Society for Industrial and Applied Mathematics,1986,7(3):856-869.
[2]DAVIS T A,HU Y F.The University of Florida Sparse Matrix Collection[J].ACM Transactions on Mathematical Software,2011,38(1):1-25.
[3]Top500.June 2023 List[EB/OL].[2023-10-01].https://top500.org/lists/top500/2023/06/.
[4]NVIDIA.NVLink&NVSwitch[EB/OL].[2023-10-01].https://www.nvidia.com/en-us/data-center/nvlink/.
[5]KHODJA L Z,COUTURIER R,GIERSCH A,et al.Parallelsparse linear solver with GMRES method using minimization techniques of communications for GPU clusters[J].The Journal of Supercomputing,2014,69:200-224.
[6]ROCm Software Platform.rocALUTION[EB/OL].[2023-10-01].https://github.com/ROCmSoftwarePlatform/rocALUTION.
[7]NVIDIA Blog.TensorFloat-32 in the A100 GPU Accelerates AI Training,HPC up to 20x[EB/OL].(2020-05-14)[2023-10-01].https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/.
[8]Intel.BFLOAT16- hardware numerics definition[EB/OL].(2018-11)[2023-10-01].https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf.
[9]MICIKEVICIUS P,STOSIC D,BURGESS Net al.FP8 Formats for Deep Learning[J].arXiv:2209.05433v1,2022.
[10]IOANNIDIS E I,CHEIMARIOS N,SPYROPOULOS A N,et al.On the performance of various parallel GMRES implementations on CPU and GPU clusters[J].arXiv:1906.04051,2019.
[11]YAMAZAKI I,RAJAMANICKAM S,BOMAN E G,et al.Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans(SC 14).LA,USA,2014:933-944.
[12]BAHI J M,COUTURIER R,KHODJA L Z,et al.ParallelGMRES implementation for solving sparse linear systems on GPU clusters[C]//Proceedings of the 19th High Performance Computing Symposia(HPC '11).2011.
[13]MATSUMOTO K,IDOMURA Y,INAT,et al.Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster[J].The Journal of Supercomputing,2019,75:8115-8146.
[14]HE K,TAN S X,ZHAO H Y,et al.Parallel GMRES solver forfast analysis of large linear dynamic systems on GPU platforms[J].Integration,2016,52:10-22.
[15]LACOSTE X.Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems[C]//Distributed,Parallel,and Cluster Computing.2015.
[16]LINDQUIST N,LUSZCZEK P,DONGARRA J,et al.Accelerating Restarted GMRES With Mixed Precision Arithmetic[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(4):1027-1037.
[17]BOUCHARD A,PARENTEAU M,LAURENDEAU É.Towarda Multi-GPU Implementation of a GMRES Solver in CHAMPS[C]//The 8th Annual Chapel Implementers and Users Workshop.2021.
[18]MA W P,HU Y W,YUANW,et al.Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices[J].Mathematical Problems in Engineering:Theory,Methods and Applications,2021,2021(Pt.9):6804723.1-6804723.17.
[19]DEVRIES B,IANNELLI J,TREFFTZ C,et al.Parallel Implementations of FGMRES for Solving Large,Sparse Non-symme-tric Linear Systems[J].Procedia Computer Science,2013,18:491-500.
[20]ZHANG J,DENG L,LI RT,et al.Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids[J].The Journal of Supercomputing,2023,79:20116-20140.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于混合精度的分布式GMRES算法优化

Optimizing Distributed GMRES Algorithm with Mixed Precision

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0