Computer Science ›› 2024, Vol. 51 ›› Issue (9): 15-22.doi: 10.11896/jsjkx.231000204

• High Performance Computing • Previous Articles     Next Articles

Optimizing Distributed GMRES Algorithm with Mixed Precision

GUO Shuaizhe, GAO Jianhua, JI Weixing   

  1. School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China
  • Received:2023-10-30 Revised:2024-03-29 Online:2024-09-15 Published:2024-09-10
  • About author:GUO Shuaizhe,born in 2000,postgra-duate,is a student member of CCF(No.P4392G).His main research interests include heterogeneous computing and performance optimization.
    GAO Jianhua,born in 1995,Ph.D,postdoctoral researcher,is a professional member of CCF(No.79759M).Her main research interests include parallel and high performance computing.

Abstract: The generalized minimum residual(GMRES) method is an iterative method for solving sparse linear systems.It is broadly used in many areas like scientific and engineering computing.The exponential data growth makes the scale of problems solved by the GMRES algorithm expand rapidly.To support the solving of large-scale problems,researchers have implemented distributed GMRES algorithm on clusters.However,the current inter-node network still significantly lags behind intra-node fa-brics in terms of both bandwidth and latency,which greatly limits the performance of the distributed GMRES algorithm.This paper proposes a mixed-precision approach for optimizing the GMRES algorithm on GPU clusters,where the data transferred is represented in a low-precision format,the network traffic during inter-GPU communication is significantly reduced.In addition,this paper proposes a balancing algorithm that dynamically adjusts the precision of the data transferred to achieve the satisfied resi-dual.Experimental results show that the proposed method achieves an average speedup of 2.4×,and a further average speedup of 7.6× when combined with other optimizations.

Key words: Generalized minimum residual, Mixed precision, GPU cluster, Distributed system

CLC Number: 

  • TP311
[1]SAAD Y,SCHULTZ M H.GMRES:A generalized minimal residual algorithm for solving nonsymmetric linear systems[J].Society for Industrial and Applied Mathematics,1986,7(3):856-869.
[2]DAVIS T A,HU Y F.The University of Florida Sparse Matrix Collection[J].ACM Transactions on Mathematical Software,2011,38(1):1-25.
[3]Top500.June 2023 List[EB/OL].[2023-10-01].https://top500.org/lists/top500/2023/06/.
[4]NVIDIA.NVLink&NVSwitch[EB/OL].[2023-10-01].https://www.nvidia.com/en-us/data-center/nvlink/.
[5]KHODJA L Z,COUTURIER R,GIERSCH A,et al.Parallelsparse linear solver with GMRES method using minimization techniques of communications for GPU clusters[J].The Journal of Supercomputing,2014,69:200-224.
[6]ROCm Software Platform.rocALUTION[EB/OL].[2023-10-01].https://github.com/ROCmSoftwarePlatform/rocALUTION.
[7]NVIDIA Blog.TensorFloat-32 in the A100 GPU Accelerates AI Training,HPC up to 20x[EB/OL].(2020-05-14)[2023-10-01].https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/.
[8]Intel.BFLOAT16- hardware numerics definition[EB/OL].(2018-11)[2023-10-01].https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf.
[9]MICIKEVICIUS P,STOSIC D,BURGESS Net al.FP8 Formats for Deep Learning[J].arXiv:2209.05433v1,2022.
[10]IOANNIDIS E I,CHEIMARIOS N,SPYROPOULOS A N,et al.On the performance of various parallel GMRES implementations on CPU and GPU clusters[J].arXiv:1906.04051,2019.
[11]YAMAZAKI I,RAJAMANICKAM S,BOMAN E G,et al.Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans(SC 14).LA,USA,2014:933-944.
[12]BAHI J M,COUTURIER R,KHODJA L Z,et al.ParallelGMRES implementation for solving sparse linear systems on GPU clusters[C]//Proceedings of the 19th High Performance Computing Symposia(HPC '11).2011.
[13]MATSUMOTO K,IDOMURA Y,INAT,et al.Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster[J].The Journal of Supercomputing,2019,75:8115-8146.
[14]HE K,TAN S X,ZHAO H Y,et al.Parallel GMRES solver forfast analysis of large linear dynamic systems on GPU platforms[J].Integration,2016,52:10-22.
[15]LACOSTE X.Scheduling and memory optimizations for sparse direct solver on multi-core/multi-gpu duster systems[C]//Distributed,Parallel,and Cluster Computing.2015.
[16]LINDQUIST N,LUSZCZEK P,DONGARRA J,et al.Accelerating Restarted GMRES With Mixed Precision Arithmetic[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(4):1027-1037.
[17]BOUCHARD A,PARENTEAU M,LAURENDEAU É.Towarda Multi-GPU Implementation of a GMRES Solver in CHAMPS[C]//The 8th Annual Chapel Implementers and Users Workshop.2021.
[18]MA W P,HU Y W,YUANW,et al.Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices[J].Mathematical Problems in Engineering:Theory,Methods and Applications,2021,2021(Pt.9):6804723.1-6804723.17.
[19]DEVRIES B,IANNELLI J,TREFFTZ C,et al.Parallel Implementations of FGMRES for Solving Large,Sparse Non-symme-tric Linear Systems[J].Procedia Computer Science,2013,18:491-500.
[20]ZHANG J,DENG L,LI RT,et al.Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids[J].The Journal of Supercomputing,2023,79:20116-20140.
[1] ZHENG Xu, FAN Hongjie, LIU Junfei. Pre-allocated Capacity Quota Limiting System Based on Microservice [J]. Computer Science, 2024, 51(6): 346-353.
[2] FENG Liao-liao, DING Yan, LIU Kun-lin, MA Ke-lin, CHANG Jun-sheng. Research Advance on BFT Consensus Algorithms [J]. Computer Science, 2022, 49(4): 329-339.
[3] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[4] LU Yong-chao, WANG Bin-yi, HU Jiang-feng, MU Yang, REN Jun-long. Research on Integrated Electronic Time Synchronization Technology [J]. Computer Science, 2021, 48(6A): 629-632.
[5] LIU Shi-fang, ZHAO Yong-hua, YU Tian-yu, HUANG Rong-feng. Efficient Implementation of Generalized Dense Symmetric Eigenproblem StandardizationAlgorithm on GPU Cluster [J]. Computer Science, 2020, 47(4): 6-12.
[6] FAN Jian-feng, LI Yi, WU Wen-yuan, FENG Yong. Double Blockchain Based Station Dynamic Loop Information Monitoring System [J]. Computer Science, 2019, 46(12): 155-164.
[7] LI Yan, MA Jun-ming, AN Bo, CAO Dong-gang. Web Based Lightweight Tool for Big Data Processing and Visualization [J]. Computer Science, 2018, 45(9): 60-64.
[8] MA Liang-li and LIU Qing. Researches of Redundancy Coding Technologies on Reducing Reconstruction Data Amount [J]. Computer Science, 2017, 44(Z6): 463-469.
[9] HUANG Yong WU Jin-zhao. Security Model for Distributed System Based on Seal Calculus [J]. Computer Science, 2015, 42(7): 178-181.
[10] PAN Zhong-qiang and CHANG Xin-feng. Efficient Key Management Scheme for Data-centric Storage Wireless Sensor Networks [J]. Computer Science, 2014, 41(Z11): 277-281.
[11] WANG Yu,ZHAO Yue-long and HOU Fang. Minimum Redundancy Storage Regeneration Code Research MSRRC Based on Matrix Operation [J]. Computer Science, 2014, 41(Z11): 191-194.
[12] REN Guo-chao,WANG Jiang and MA Xiao-xing. ConUp:SCA Middleware with Dynamic Component Updating Support [J]. Computer Science, 2014, 41(9): 60-62.
[13] HE Pan, YUAN Yue and WU Kai-gui. Optimal Multi-objective Monitoring Resources Allocation in Distributed Systems [J]. Computer Science, 2014, 41(5): 64-67.
[14] ZHANG Ya-jun,LI Zhou-jun,LIAO Xiang-ke,JIANG Rui-cheng and LI Hai-feng. Survey of Automated Whitebox Fuzz Testing [J]. Computer Science, 2014, 41(2): 7-10.
[15] HUANG Xiang and CHEN Zhi-gang. Performance Modeling Approach for Resource Sensitive Distributed Systems [J]. Computer Science, 2013, 40(9): 174-181.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!