基于混合精度的分布式GMRES算法优化

doi:10.11896/jsjkx.231000204

Abstract

Abstract: The generalized minimum residual(GMRES) method is an iterative method for solving sparse linear systems.It is broadly used in many areas like scientific and engineering computing.The exponential data growth makes the scale of problems solved by the GMRES algorithm expand rapidly.To support the solving of large-scale problems,researchers have implemented distributed GMRES algorithm on clusters.However,the current inter-node network still significantly lags behind intra-node fa-brics in terms of both bandwidth and latency,which greatly limits the performance of the distributed GMRES algorithm.This paper proposes a mixed-precision approach for optimizing the GMRES algorithm on GPU clusters,where the data transferred is represented in a low-precision format,the network traffic during inter-GPU communication is significantly reduced.In addition,this paper proposes a balancing algorithm that dynamically adjusts the precision of the data transferred to achieve the satisfied resi-dual.Experimental results show that the proposed method achieves an average speedup of 2.4×,and a further average speedup of 7.6× when combined with other optimizations.

Key words: Generalized minimum residual, Mixed precision, GPU cluster, Distributed system

CLC Number:

TP311

GUO Shuaizhe, GAO Jianhua, JI Weixing. Optimizing Distributed GMRES Algorithm with Mixed Precision[J].Computer Science, 2024, 51(9): 15-22.

References

[1]SAAD Y,SCHULTZ M H.GMRES:A generalized minimal residual algorithm for solving nonsymmetric linear systems[J].Society for Industrial and Applied Mathematics,1986,7(3):856-869.
[2]DAVIS T A,HU Y F.The University of Florida Sparse Matrix Collection[J].ACM Transactions on Mathematical Software,2011,38(1):1-25.
[3]Top500.June 2023 List[EB/OL].[2023-10-01].https://top500.org/lists/top500/2023/06/.
[4]NVIDIA.NVLink&NVSwitch[EB/OL].[2023-10-01].https://www.nvidia.com/en-us/data-center/nvlink/.
[5]KHODJA L Z,COUTURIER R,GIERSCH A,et al.Parallelsparse linear solver with GMRES method using minimization techniques of communications for GPU clusters[J].The Journal of Supercomputing,2014,69:200-224.
[6]ROCm Software Platform.rocALUTION[EB/OL].[2023-10-01].https://github.com/ROCmSoftwarePlatform/rocALUTION.
[7]NVIDIA Blog.TensorFloat-32 in the A100 GPU Accelerates AI Training,HPC up to 20x[EB/OL].(2020-05-14)[2023-10-01].https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/.
[8]Intel.BFLOAT16- hardware numerics definition[EB/OL].(2018-11)[2023-10-01].https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf.
[9]MICIKEVICIUS P,STOSIC D,BURGESS Net al.FP8 Formats for Deep Learning[J].arXiv:2209.05433v1,2022.
[10]IOANNIDIS E I,CHEIMARIOS N,SPYROPOULOS A N,et al.On the performance of various parallel GMRES implementations on CPU and GPU clusters[J].arXiv:1906.04051,2019.
[11]YAMAZAKI I,RAJAMANICKAM S,BOMAN E G,et al.Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans(SC 14).LA,USA,2014:933-944.
[12]BAHI J M,COUTURIER R,KHODJA L Z,et al.ParallelGMRES implementation for solving sparse linear systems on GPU clusters[C]//Proceedings of the 19th High Performance Computing Symposia(HPC '11).2011.
[13]MATSUMOTO K,IDOMURA Y,INAT,et al.Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster[J].The Journal of Supercomputing,2019,75:8115-8146.
[14]HE K,TAN S X,ZHAO H Y,et al.Parallel GMRES solver forfast analysis of large linear dynamic systems on GPU platforms[J].Integration,2016,52:10-22.
[15]LACOSTE X.Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems[C]//Distributed,Parallel,and Cluster Computing.2015.
[16]LINDQUIST N,LUSZCZEK P,DONGARRA J,et al.Accelerating Restarted GMRES With Mixed Precision Arithmetic[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(4):1027-1037.
[17]BOUCHARD A,PARENTEAU M,LAURENDEAU É.Towarda Multi-GPU Implementation of a GMRES Solver in CHAMPS[C]//The 8th Annual Chapel Implementers and Users Workshop.2021.
[18]MA W P,HU Y W,YUANW,et al.Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices[J].Mathematical Problems in Engineering:Theory,Methods and Applications,2021,2021(Pt.9):6804723.1-6804723.17.
[19]DEVRIES B,IANNELLI J,TREFFTZ C,et al.Parallel Implementations of FGMRES for Solving Large,Sparse Non-symme-tric Linear Systems[J].Procedia Computer Science,2013,18:491-500.
[20]ZHANG J,DENG L,LI RT,et al.Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids[J].The Journal of Supercomputing,2023,79:20116-20140.

Related Articles 15

[1]	ZHENG Xu, FAN Hongjie, LIU Junfei. Pre-allocated Capacity Quota Limiting System Based on Microservice [J]. Computer Science, 2024, 51(6): 346-353.
[2]	FENG Liao-liao, DING Yan, LIU Kun-lin, MA Ke-lin, CHANG Jun-sheng. Research Advance on BFT Consensus Algorithms [J]. Computer Science, 2022, 49(4): 329-339.
[3]	TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[4]	LU Yong-chao, WANG Bin-yi, HU Jiang-feng, MU Yang, REN Jun-long. Research on Integrated Electronic Time Synchronization Technology [J]. Computer Science, 2021, 48(6A): 629-632.
[5]	LIU Shi-fang, ZHAO Yong-hua, YU Tian-yu, HUANG Rong-feng. Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster [J]. Computer Science, 2020, 47(4): 6-12.
[6]	FAN Jian-feng, LI Yi, WU Wen-yuan, FENG Yong. Double Blockchain Based Station Dynamic Loop Information Monitoring System [J]. Computer Science, 2019, 46(12): 155-164.
[7]	LI Yan, MA Jun-ming, AN Bo, CAO Dong-gang. Web Based Lightweight Tool for Big Data Processing and Visualization [J]. Computer Science, 2018, 45(9): 60-64.
[8]	MA Liang-li and LIU Qing. Researches of Redundancy Coding Technologies on Reducing Reconstruction Data Amount [J]. Computer Science, 2017, 44(Z6): 463-469.
[9]	HUANG Yong WU Jin-zhao. Security Model for Distributed System Based on Seal Calculus [J]. Computer Science, 2015, 42(7): 178-181.
[10]	PAN Zhong-qiang and CHANG Xin-feng. Efficient Key Management Scheme for Data-centric Storage Wireless Sensor Networks [J]. Computer Science, 2014, 41(Z11): 277-281.
[11]	WANG Yu,ZHAO Yue-long and HOU Fang. Minimum Redundancy Storage Regeneration Code Research MSRRC Based on Matrix Operation [J]. Computer Science, 2014, 41(Z11): 191-194.
[12]	REN Guo-chao,WANG Jiang and MA Xiao-xing. ConUp:SCA Middleware with Dynamic Component Updating Support [J]. Computer Science, 2014, 41(9): 60-62.
[13]	HE Pan, YUAN Yue and WU Kai-gui. Optimal Multi-objective Monitoring Resources Allocation in Distributed Systems [J]. Computer Science, 2014, 41(5): 64-67.
[14]	ZHANG Ya-jun,LI Zhou-jun,LIAO Xiang-ke,JIANG Rui-cheng and LI Hai-feng. Survey of Automated Whitebox Fuzz Testing [J]. Computer Science, 2014, 41(2): 7-10.
[15]	HUANG Xiang and CHEN Zhi-gang. Performance Modeling Approach for Resource Sensitive Distributed Systems [J]. Computer Science, 2013, 40(9): 174-181.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Optimizing Distributed GMRES Algorithm with Mixed Precision

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0