Computer Science ›› 2026, Vol. 53 ›› Issue (5): 354-366.doi: 10.11896/jsjkx.250500033

• Computer Architecture • Previous Articles     Next Articles

Optimizing Distributed GMRES Algorithm with Floating-point Vector Compression

CHEN Shutong1, GAO Jianhua2, JI Weixing2, LI Chunfeng1   

  1. 1 School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
    2 School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
  • Received:2025-05-12 Revised:2025-07-22 Published:2026-05-08
  • About author:CHEN Shutong,born in 2000,postgra-duate,is a student member of CCF(No.P4660G).His main research interest is high performance computing and applications.
    LI Chunfeng,born in 1995,Ph.D,is a member of CCF(No.P4285G).His main research interests include compu-ter architecture and high performance computing.
  • Supported by:
    National Science and Technology Major Project(2022ZD0116311).

Abstract: The generalized minimal residuals(GMRES) method is one of the mainstream iterative algorithms for solving sparse linear systems,widely applied in circuit simulation,computational fluid dynamics,autonomous driving,and other fields.To efficiently solve large-scale sparse linear systems,researchers have developed distributed GMRES algorithms that use inter-node communication and distributed computing to achieve parallel solutions for large-scale problems.However,the sparse matrix-vector multiplication(SpMV) computation in distributed GMRES introduces significant communication demands for floating-point vectors,making the communication overhead of floating-point vectors a major performance bottleneck.Based on the restarted GMRES,this paper proposes a distributed restart GMRES(DR-GMRES) algorithm tailored for CPU-GPU heterogeneous systems,incorporating a collaborative computing model designed to align with the algorithm’s characteristics and heterogeneous architecture.Furthermore,for the communication requirements of floating-point vectors in SpMV and the numerical properties of GMRES’s inner-outer iterations,a two-layer hybrid compression strategy is proposed.This strategy combines lossy and lossless hybrid compression with inner-outer iteration hybrid compression to minimize the communication overhead of DR-GMRES while preserving acceptable solving accuracy.Additionally,a multi-buffer compressed communication architecture is designed to implement asynchronous data compression efficiently.Experimental results on 16 large-scale sparse linear systems demonstrate that,compared to the original distributed GMRES,the DR-GMRES algorithm based on hybrid data compression achieves an average speedup of ×3.37(with a maximum of ×8.02) with negligible impact on computational accuracy.Actual simulations experimentally verify that the DR-GMRES algorithm based on hybrid data compression achieves a ×3.09 acceleration in OSQP solving time over the baseline.

Key words: Generalized minimal residuals, Distributed system, Data compression, Communication optimization

CLC Number: 

  • TP311
[1]NARDEAN S,FERRONATO M,ABUSHAIKHA A.LinearSolvers for Reservoir Simulation Problems:An Overview and Recent Developments[J].Archives of Computational Methods in Engineering,2022,29(6):4341-4378.
[2]BARTELS R H,GOLUB G H.The Simplex Method of Linear Programming Using LU Decomposition[J].Communications of the ACM,1969,12(5):266-268.
[3]LIESEN J,STRAKOS Z.Krylov Subspace Methods:Principles and Analysis[M].Oxford:Oxford University Press,2012:268-274.
[4]SAAD Y,SCHULTZ M H.GMRES:A Generalized MinimalResidual Algorithm for Solving Nonsymmetric Linear Systems[J].SIAM Journal on Scientific and Statistical Computing,1986,7(3):856-869.
[5]YAN Z,WANG X,CHEN R.Exploring the Impact of Adenoid Obstruction on the Human Nasal Airflow Through Parallel Computational Fluid Dynamics[J].Engineering Reports,2025,7(1):e12972.
[6]CAO X,CHEN M,QI Q,et al.An Improved GMRES Method for Solving Electromagnetic Scattering Problems by MoM[J].IEEE Transactions on Antennas and Propagation,2022,70(11):10751-10757.
[7]CHEN B,CHENG J,YU W.A Practical Randomized GMRESAlgorithm for Solving Linear Equation System in Circuit Simulation[C]//Proceedings of the 30th Asia and South Pacific Design Automation Conference.New York:ACM,2025:183-189.
[8]GAO J,JI W,WANG Y.Optimization of Large-Scale SparseMatrix-Vector Multiplication on Multi-GPU Systems[J].ACM Transactions on Architecture and Code Optimization,2024,21(4):1-24.
[9]ZHANG Y,GUO S,GAO J,et al.Load Balancing Optimizations for Distributed GMRES Algorithm[C]//International Confe-rence on Algorithms and Architectures for Parallel Processing.Singapore:Springer,2024:47-56.
[10]XU Z,ALONSO J J,DARVE E.A Numerically Stable Communication-Avoiding-Step GMRES Algorithm[J].SIAM Journal on Matrix Analysis and Applications,2024,45(4):2039-2074.
[11]LINDQUIST N,LUSZCZEK P,DONGARRA J.AcceleratingRestarted GMRES with Mixed Precision Arithmetic[J].IEEE Transactions on Parallel and Distributed Systems,2021,33(4):1027-1037.
[12]LIU J,WANG Y,GAO J,et al.pSpMV:Precision-based Sparse Matrix Partition and SpMV Optimization[J].CCF Transactions on High Performance Computing,2024,6(6):549-565.
[13]GUO S Z,GAO J H,JI W X.Optimizing Distributed GMRES Algorithm with Mixed Precision[J].Computer Science,2024,51(9):15-22.
[14]ALIAGA J I,ANZT H,GRUTZMACHER T,et al.Compressed Basis GMRES on High-Performance Graphics Processing units[J].The International Journal of High Performance Computing Applications,2023,37(2):82-100.
[15]ZHANG J,DENG L,LI R,et al.Achieving High Performance and Portable Parallel GMRES Algorithm for Compressible Flow Simulations on Unstructured grids[J].The Journal of Supercomputing,2023,79(17):20116-20140.
[16]ZIANE KHODJA L,COUTURIRE R,GIERSCH A,et al.Pa-rallel Sparse Linear Solver With GMRES Method Using Minimization Techniques of Communications for GPU Clusters[J].The Journal of Supercomputing,2014,69:200-224.
[17]YANG B,LIU H,CHEN Z.Preconditioned GMRES Solver on Multiple-GPU Architecture[J].Computers & Mathematics with Applications,2016,72(4):1076-1095.
[18]MATSUMOTO K,IDOMURA Y,INA T,et al.Implementation and Performance Evaluation of a Communication-Avoiding GMRES Method for Stencil-Based Code on GPU Cluster[J].The Journal of Supercomputing,2019,75:8115-8146.
[19]LIN S,XIE Z.A Jacobi_PCG Solver for Sparse Linear Systems on Multi-GPU Cluster[J].The Journal of Supercomputing,2017,73:433-454.
[20]GHYSELS P,ASHBY T J,MEERBERGEN K,et al.HidingGlobal Communication Latency in the GMRES Algorithm on Massively Parallel Machines[J].SIAM Journal on Scientific Computing,2013,35(1):C48-C71.
[21]YAMAZAKI I,HOEMMEN M,LUSZCZEK P,et al.Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives[C]//2017 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).New York:IEEE,2017:1118-1127.
[22]LI C,TANG M,TONG R,et al.P-cloth:Interactive ComplexCloth Simulation on Multi-GPU Systems Using Dynamic Matrix Assembly And Pipelined Implicit Integrators[J].ACM Transactions on Graphics,2020,39(6):1-15.
[23]MA X.An Overview of Recent Developments and Applications of the GMRES Method[J].Pure Math,2013,3:181-187.
[24]HUANG Y,DI S,YU X,et al.Cuszp:An ultra-fast Gpu Error-Bounded Lossy Compression Framework with Optimized end-to-end Performance[C]//Proceedings of the International Confe-rence for High Performance Computing,Networking,Storage and Analysis.New York:IEEE,2023:1-13.
[25]DAVIS T A,HU Y.The University of Florida Sparse Matrix Collection[J].ACM Transactions on Mathematical Software,2011,38(1):1-25.
[26]FENG M,ZHANG H.Application of Baidu Apollo Open Platform in a Course of Control Simulation Experiments[J].Computer Applications in Engineering Education,2022,30(3):892-906.
[27]STELLATO B,BANJAC G,GOULART P,et al.OSQP:AnOperator Splitting Solver for Quadratic Programs[J].Mathematical Programming Computation,2020,12(4):637-672.
[1] LI Hao, DING Lizhong, FU Jiarun, LINGHU Zhaohuan. Data Compression of Instruction Fine-tuning for Large Models:Refinement Based on Inference Contribution [J]. Computer Science, 2026, 53(3): 136-142.
[2] SONG Rirong, CHEN Qinwen, CHEN Xing. Distributed Automated Testing for Android Applications Based on Reinforcement Learning [J]. Computer Science, 2025, 52(12): 40-47.
[3] WANG Xin, CHEN Kun, SUN Lingyun. Research on Foundation Model Methods for Addressing Non-IID Issues in Federated Learning [J]. Computer Science, 2025, 52(12): 302-313.
[4] LIAO Rui, TANG Jie, LIANG Tongjia, ZHENG Xinlei, WANG Binyi, QI Zhiqiang. Label Data Compression Algorithms Based on BWT,MTF and ANS [J]. Computer Science, 2025, 52(11A): 241000081-6.
[5] GUO Shuaizhe, GAO Jianhua, JI Weixing. Optimizing Distributed GMRES Algorithm with Mixed Precision [J]. Computer Science, 2024, 51(9): 15-22.
[6] ZHENG Xu, FAN Hongjie, LIU Junfei. Pre-allocated Capacity Quota Limiting System Based on Microservice [J]. Computer Science, 2024, 51(6): 346-353.
[7] XU Jinlong, LI Pengfei, LI Jianan, CHEN Biaoyuan, GAO Wei, HAN Lin. Study on Distributed Training Optimization Based on Hybrid Parallel [J]. Computer Science, 2024, 51(12): 120-128.
[8] CHEN Xingtian, XIONG Xiaofu, BAI Yong, HU Haiyang. High Speed Data Compression Method of Merge Unit Based on SCD File [J]. Computer Science, 2023, 50(12): 123-129.
[9] FENG Liao-liao, DING Yan, LIU Kun-lin, MA Ke-lin, CHANG Jun-sheng. Research Advance on BFT Consensus Algorithms [J]. Computer Science, 2022, 49(4): 329-339.
[10] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[11] LU Yong-chao, WANG Bin-yi, HU Jiang-feng, MU Yang, REN Jun-long. Research on Integrated Electronic Time Synchronization Technology [J]. Computer Science, 2021, 48(6A): 629-632.
[12] FAN Jian-feng, LI Yi, WU Wen-yuan, FENG Yong. Double Blockchain Based Station Dynamic Loop Information Monitoring System [J]. Computer Science, 2019, 46(12): 155-164.
[13] LI Yan, MA Jun-ming, AN Bo, CAO Dong-gang. Web Based Lightweight Tool for Big Data Processing and Visualization [J]. Computer Science, 2018, 45(9): 60-64.
[14] MA Liang-li and LIU Qing. Researches of Redundancy Coding Technologies on Reducing Reconstruction Data Amount [J]. Computer Science, 2017, 44(Z6): 463-469.
[15] WANG Wei-ping and YANG Miao. USDZQ Optimization Based on Ant Colony Algorithm and Application in ECG Compression [J]. Computer Science, 2015, 42(Z11): 550-553.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!