Computer Science ›› 2026, Vol. 53 ›› Issue (6): 117-127.doi: 10.11896/jsjkx.251000118

• High Performance Computing • Previous Articles     Next Articles

GPU-based Implementation and Optimization of Banded Matrix LU Factorization

SUN Xiaoxue1,2, JIA Haipeng2, ZHANG Yunquan2, YU Yue1,2, QIN Pinle1   

  1. 1 School of Computer Science and Technology,North University of China,Taiyuan 030051,China
    2 Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2025-10-27 Revised:2025-12-11 Online:2026-06-15 Published:2026-06-09
  • About author:SUN Xiaoxue,born in 2000,postgra-duate,is a member of CCF(No.Q9485G).Her main research interests include parallel computing and high performance computing.
    JIA Haipeng,born in 1983,Ph.D,senior engineer,is a member of CCF(No.31889M).His main research interests include parallel computing and heterogeneous computing.
  • Supported by:
    National Key Research and Development Program of China(2023YFB3001701) and National Natural Science Foundation of China(61972376,62372432,62072431).

Abstract: Band matrices are a class of sparse matrices characterized by non-zero elements concentrated around the main diagonal,and they are widely used in scientific computing and engineering simulations.LU factorization of such matrices plays a vital role in solving large-scale linear systems,particularly in simulation workloads,physical modeling,and engineering optimization problems.However,traditional LU factorization algorithms often suffer from limited parallelism and inefficient memory access patterns,which restrict their performance on modern hardware platforms.To address these challenges,this paper proposes a high-efficiency implementation and optimization strategy for band matrix LU decomposition on domestic domain-specific computing units(DCUs).The proposed method introduces two key techniques:one is a right-looking structure combined with a block partitioning strategy to improve computational locality and parallel execution efficiency;and the other is a high-performance panel factorization algorithm that reduces kernel launch frequency and global memory access overhead.Experimental results across different matrix sizes and bandwidths demonstrate that the proposed GPU-based implementation consistently outperforms CPU-based LAPACK and Intel MKL solutions in terms of runtime and resource utilization,while preserving numerical stability.Empirical results show that this approach exhibits strong scalability and practical engineering value,providing theoretical foundations and practical gui-dance for the efficient implementation of banded-matrix methods on heterogeneous parallel computing platforms.

Key words: Banded matrices, LU factorization, Parallel computing, Fusion kernel, Performance optimization

CLC Number: 

  • TP301.6
[1]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A Survey of General-Purpose Computation on Graphics Hardware[J].Computer Graphics Forum,2007,26(1):80-113.
[2]MANAVSKI S A.CUDA Compatible GPU as an EfficientHardware Accelerator for AES Cryptography[C]//InternationalConference on Signal Processing and Communications.IEEE,2007:65-68.
[3]BLANCHARD P,HIGHAM N J.Mixed precision block fused multiply-add:error analysis and application to GPU tensor cores[J].SIAM Journal on Scientific Computing,2020,42(3):C124-C141.
[4]HUSBANDS P,YELICK K.Multi-Threading and One-SidedCommunication in Parallel LU Factorization[C]//Proceedings of the 2007 ACM/IEEE Conference on Supercomputing.ACM,2007:1-10.
[5]TOMOV S,NATH R,LTAIEF H,et al.Dense Linear Algebra Solvers for Multicore with GPU Accelerators[C]//Proceedings of the 2010 IEEE International Symposium on Parallel & Dis-tributed Processing,Workshops and PhD Forum(I-PDPSW).IEEE,2010:1-8.
[6]AGULLO E,AUGONNET C,DONGARRA J,et al.A Hybridi-zation Methodology for High-Performance Linear Algebra Software for GPUs[M]// GPU Computing Gems Jade Edition.Waltham:Morgan Kaufmann,2012:473-484.
[7]DONGARRA J,HAIDAR A,KURZAK J,et al.Model-DrivenOne-Sided Factorizations on Multicore Accelerated Systems[J].Supercomputing Frontiers and Innovations,2014,1(1):85-115.
[8]BADER M,BODE A,GERNDT M,et al.Locality Optimization on a NUMA Architecture for Hybrid LU Factorization[C]//Advances in Parallel Computing,25.IOS Press,2014:153-162.
[9]HAIDAR A,DONG T X,LUSZCZEK P,et al.Batched MatrixComputations on Hardware Accelerators Based on GPUs[J].The International Journal of High Performance Computing Applications,2015,29(2):193-208.
[10]DONG T X,HAIDAR A,LUSZCZEK P,et al.LU Factorization of Small Matrices:Accelerating Batched DGETRF on the GPU[C]//HPCC& CSS& ICESS.IEEE,2015:157-160.
[11]KWASNIEWSKI G,KABIC M,BEN-NUN T,et al.On the Pa-rallel I/O Optimality of Linear Algebra Kernels:Near-Optimal Matrix Factorizations[C]//Proceedings of the International Conference for High Performance Computing,Networking,Stora-ge and Analysis(SC'21).IEEE Computer Society,2021:1-14.
[12]AZZAM H,AHMAD A,MAWUSSI Z,et al.A Guide for Achieving High Performance with Very Small Matrices on GPU:A Case Study of Batched LU and Cholesky Factorizations[J].IEEE Transactions on Parallel and Distributed Systems,2018,29(5):973-984.
[13]CHOW E,PATEL A.Fine-Grained Parallel Incomplete LU Factorization[J].SIAM Journal on Scientific Computing,2015,37(2):C169-C193.
[14]SERRE F,PUSCHEL M.Generalizing Block LU Factorization:A Lower-Upper Block Triangular Decomposition with Minimal Off-Diagonal Ranks[J].Linear Algebra and its Applications,2016,509:114-142.
[15]LAWSON C,HANSON R J.Basic Linear Algebra Subprograms for Fortran Usage[J].ACM Transactions on Mathematical Software,1979,5(3):308-323.
[16]DONGARRA J J,DU CROZ J.An Extended Set of FortranBasic Linear Algebra Subprogram-s[J].ACM Transactions on Mathematical Software,1988,14(1):1-17.
[17]ASHCRAFT C,GRIMES R G,LEWIS J G.Accurate Symmetric Indefinite Linear Equation Solvers[J].SIAM Journal on Matrix Analysis and Applications,1998,20(2):513-561.
[18]GRIMES R G,LEWIS J G,SIMON H D.A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Generalized Eigenproblems[J].SIAM Journal on Matrix Analysis and Applications,1994,15(1):228-272.
[19]DU CROZ J,MAYES P,RADICATI G.Factorizations of Band Matrices Using Level 3 BLAS[C]//CONPAR 90/VAPP IV:Proceedings of the Joint International Conference on Vector and Parallel Processing.Berlin:Springer-Verlag,1990:223-231.
[20]KWASNIEWSKI G,KABIC M,BEN G N,et al.On the Parallel I/O Optimality of Linear Algebra Kernels[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis(SC'21).IEEE,2021:1-14.
[21]ABDELFATTAH A,TOMOV S,LUSZCZEK P,et al.GPU-based LU Factorization and Solve on Batches of Matrices with BandStructure [C]//Workshops of the International Conference for High Performance Computing,Networking,Storage and Analysis(SCW 2023).IEEE Computer Society,2023:1672-1679.
[1] LI Fei, LIU Song, GUO Songjian, LIU Jiazheng, ZHANG Ying, HONG Longwei, ZHANG Boxuan. High-performance Image Preprocessing Operators for Cambricon MLU Accelerator Card [J]. Computer Science, 2026, 53(6): 193-202.
[2] JI Liguang, ZHOU Bei, YANG Hongru, ZHOU Yuchang, CUI Mengqi, XU Jinchen. Parallel Detection Method of Maximum Floating-point Error Based on Gridding Particle SwarmOptimization Algorithm [J]. Computer Science, 2026, 53(2): 124-132.
[3] XIE Zhenjie, LIU Yiming, CAI Ruijie, LUO Youqiang. Performance Optimization Method for Domestic Cryptographic Algorithm SM9 [J]. Computer Science, 2025, 52(6): 390-396.
[4] LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing [J]. Computer Science, 2025, 52(5): 41-49.
[5] HUANG Chenxi, LI Jiahui, YAN Hui, ZHONG Ying, LU Yutong. Investigation on Load Balancing Strategies for Lattice Boltzmann Method with Local Grid Refinement [J]. Computer Science, 2025, 52(5): 101-108.
[6] LI Qing, JIA Haipeng, ZHANG Yunquan, ZHANG Sijia. Input-aware Generalized Matrix-Vector Product Algorithm for Adaptative PerformanceOptimization of Hygon DCU [J]. Computer Science, 2025, 52(4): 291-300.
[7] ZHANG Manjing, HE Yulin, LI Xu, HUANG Zhexue. Distributed Two-stage Clustering Method Based on Node Sampling [J]. Computer Science, 2025, 52(2): 134-144.
[8] XU He, ZHOU Tao, LI Peng, QIN Fangfang, JI Yimu. LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor [J]. Computer Science, 2024, 51(9): 51-58.
[9] ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi. Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices [J]. Computer Science, 2024, 51(12): 129-136.
[10] HE Weilong, SU Lingli, GUO Bingxuan, LI Maosen, HAO Yan. Research and Implementation of Dynamic Scene 3D Perception Technology Based on BinocularEstimation [J]. Computer Science, 2024, 51(11A): 240300045-8.
[11] PENG Weidong, GUO Wei, WEI Lin. Reconfigurable Computing System for Parallel Implementation of SVM Training Based on FPGA [J]. Computer Science, 2024, 51(11A): 231100120-7.
[12] WANG Xiaozhong, ZHANG Zuyu. Multi Level Parallel Computing for SW26010 Discontinuous Galerkin Finite Element Algorithm [J]. Computer Science, 2024, 51(11A): 240700055-5.
[13] LI Siyao, LI Shanglin, LUO Jingzhi. Parallel Computing of Reentry Vehicle Trajectory by Multiple Shooting Method Based onOPENMP [J]. Computer Science, 2024, 51(11A): 231000019-6.
[14] ZHAI Xulun, ZHANG Yongguang, JIN Anzhao, QIANG Wei, LI Mengbing. Parallel DVB-RCS2 Turbo Decoding on Multi-core CPU [J]. Computer Science, 2023, 50(6): 22-28.
[15] FENG Chen, GU Jingjing. Efficient Distributed Training Framework for Federated Learning [J]. Computer Science, 2023, 50(11): 317-326.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!