Computer Science ›› 2026, Vol. 53 ›› Issue (6): 117-127.doi: 10.11896/jsjkx.251000118
• High Performance Computing • Previous Articles Next Articles
SUN Xiaoxue1,2, JIA Haipeng2, ZHANG Yunquan2, YU Yue1,2, QIN Pinle1
CLC Number:
| [1]OWENS J D,LUEBKE D,GOVINDARAJU N,et al.A Survey of General-Purpose Computation on Graphics Hardware[J].Computer Graphics Forum,2007,26(1):80-113. [2]MANAVSKI S A.CUDA Compatible GPU as an EfficientHardware Accelerator for AES Cryptography[C]//InternationalConference on Signal Processing and Communications.IEEE,2007:65-68. [3]BLANCHARD P,HIGHAM N J.Mixed precision block fused multiply-add:error analysis and application to GPU tensor cores[J].SIAM Journal on Scientific Computing,2020,42(3):C124-C141. [4]HUSBANDS P,YELICK K.Multi-Threading and One-SidedCommunication in Parallel LU Factorization[C]//Proceedings of the 2007 ACM/IEEE Conference on Supercomputing.ACM,2007:1-10. [5]TOMOV S,NATH R,LTAIEF H,et al.Dense Linear Algebra Solvers for Multicore with GPU Accelerators[C]//Proceedings of the 2010 IEEE International Symposium on Parallel & Dis-tributed Processing,Workshops and PhD Forum(I-PDPSW).IEEE,2010:1-8. [6]AGULLO E,AUGONNET C,DONGARRA J,et al.A Hybridi-zation Methodology for High-Performance Linear Algebra Software for GPUs[M]// GPU Computing Gems Jade Edition.Waltham:Morgan Kaufmann,2012:473-484. [7]DONGARRA J,HAIDAR A,KURZAK J,et al.Model-DrivenOne-Sided Factorizations on Multicore Accelerated Systems[J].Supercomputing Frontiers and Innovations,2014,1(1):85-115. [8]BADER M,BODE A,GERNDT M,et al.Locality Optimization on a NUMA Architecture for Hybrid LU Factorization[C]//Advances in Parallel Computing,25.IOS Press,2014:153-162. [9]HAIDAR A,DONG T X,LUSZCZEK P,et al.Batched MatrixComputations on Hardware Accelerators Based on GPUs[J].The International Journal of High Performance Computing Applications,2015,29(2):193-208. [10]DONG T X,HAIDAR A,LUSZCZEK P,et al.LU Factorization of Small Matrices:Accelerating Batched DGETRF on the GPU[C]//HPCC& CSS& ICESS.IEEE,2015:157-160. [11]KWASNIEWSKI G,KABIC M,BEN-NUN T,et al.On the Pa-rallel I/O Optimality of Linear Algebra Kernels:Near-Optimal Matrix Factorizations[C]//Proceedings of the International Conference for High Performance Computing,Networking,Stora-ge and Analysis(SC'21).IEEE Computer Society,2021:1-14. [12]AZZAM H,AHMAD A,MAWUSSI Z,et al.A Guide for Achieving High Performance with Very Small Matrices on GPU:A Case Study of Batched LU and Cholesky Factorizations[J].IEEE Transactions on Parallel and Distributed Systems,2018,29(5):973-984. [13]CHOW E,PATEL A.Fine-Grained Parallel Incomplete LU Factorization[J].SIAM Journal on Scientific Computing,2015,37(2):C169-C193. [14]SERRE F,PUSCHEL M.Generalizing Block LU Factorization:A Lower-Upper Block Triangular Decomposition with Minimal Off-Diagonal Ranks[J].Linear Algebra and its Applications,2016,509:114-142. [15]LAWSON C,HANSON R J.Basic Linear Algebra Subprograms for Fortran Usage[J].ACM Transactions on Mathematical Software,1979,5(3):308-323. [16]DONGARRA J J,DU CROZ J.An Extended Set of FortranBasic Linear Algebra Subprogram-s[J].ACM Transactions on Mathematical Software,1988,14(1):1-17. [17]ASHCRAFT C,GRIMES R G,LEWIS J G.Accurate Symmetric Indefinite Linear Equation Solvers[J].SIAM Journal on Matrix Analysis and Applications,1998,20(2):513-561. [18]GRIMES R G,LEWIS J G,SIMON H D.A Shifted Block Lanczos Algorithm for Solving Sparse Symmetric Generalized Eigenproblems[J].SIAM Journal on Matrix Analysis and Applications,1994,15(1):228-272. [19]DU CROZ J,MAYES P,RADICATI G.Factorizations of Band Matrices Using Level 3 BLAS[C]//CONPAR 90/VAPP IV:Proceedings of the Joint International Conference on Vector and Parallel Processing.Berlin:Springer-Verlag,1990:223-231. [20]KWASNIEWSKI G,KABIC M,BEN G N,et al.On the Parallel I/O Optimality of Linear Algebra Kernels[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis(SC'21).IEEE,2021:1-14. [21]ABDELFATTAH A,TOMOV S,LUSZCZEK P,et al.GPU-based LU Factorization and Solve on Batches of Matrices with BandStructure [C]//Workshops of the International Conference for High Performance Computing,Networking,Storage and Analysis(SCW 2023).IEEE Computer Society,2023:1672-1679. |
| [1] | LI Fei, LIU Song, GUO Songjian, LIU Jiazheng, ZHANG Ying, HONG Longwei, ZHANG Boxuan. High-performance Image Preprocessing Operators for Cambricon MLU Accelerator Card [J]. Computer Science, 2026, 53(6): 193-202. |
| [2] | JI Liguang, ZHOU Bei, YANG Hongru, ZHOU Yuchang, CUI Mengqi, XU Jinchen. Parallel Detection Method of Maximum Floating-point Error Based on Gridding Particle SwarmOptimization Algorithm [J]. Computer Science, 2026, 53(2): 124-132. |
| [3] | XIE Zhenjie, LIU Yiming, CAI Ruijie, LUO Youqiang. Performance Optimization Method for Domestic Cryptographic Algorithm SM9 [J]. Computer Science, 2025, 52(6): 390-396. |
| [4] | LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing [J]. Computer Science, 2025, 52(5): 41-49. |
| [5] | HUANG Chenxi, LI Jiahui, YAN Hui, ZHONG Ying, LU Yutong. Investigation on Load Balancing Strategies for Lattice Boltzmann Method with Local Grid Refinement [J]. Computer Science, 2025, 52(5): 101-108. |
| [6] | LI Qing, JIA Haipeng, ZHANG Yunquan, ZHANG Sijia. Input-aware Generalized Matrix-Vector Product Algorithm for Adaptative PerformanceOptimization of Hygon DCU [J]. Computer Science, 2025, 52(4): 291-300. |
| [7] | ZHANG Manjing, HE Yulin, LI Xu, HUANG Zhexue. Distributed Two-stage Clustering Method Based on Node Sampling [J]. Computer Science, 2025, 52(2): 134-144. |
| [8] | XU He, ZHOU Tao, LI Peng, QIN Fangfang, JI Yimu. LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor [J]. Computer Science, 2024, 51(9): 51-58. |
| [9] | ZHONG Zhenyu, LIN Yongliang, WANG Haotian, LI Dongwen, SUN Yufei, ZHANG Yuzhi. Automatic Pipeline Parallel Training Framework for General-purpose Computing Devices [J]. Computer Science, 2024, 51(12): 129-136. |
| [10] | HE Weilong, SU Lingli, GUO Bingxuan, LI Maosen, HAO Yan. Research and Implementation of Dynamic Scene 3D Perception Technology Based on BinocularEstimation [J]. Computer Science, 2024, 51(11A): 240300045-8. |
| [11] | PENG Weidong, GUO Wei, WEI Lin. Reconfigurable Computing System for Parallel Implementation of SVM Training Based on FPGA [J]. Computer Science, 2024, 51(11A): 231100120-7. |
| [12] | WANG Xiaozhong, ZHANG Zuyu. Multi Level Parallel Computing for SW26010 Discontinuous Galerkin Finite Element Algorithm [J]. Computer Science, 2024, 51(11A): 240700055-5. |
| [13] | LI Siyao, LI Shanglin, LUO Jingzhi. Parallel Computing of Reentry Vehicle Trajectory by Multiple Shooting Method Based onOPENMP [J]. Computer Science, 2024, 51(11A): 231000019-6. |
| [14] | ZHAI Xulun, ZHANG Yongguang, JIN Anzhao, QIANG Wei, LI Mengbing. Parallel DVB-RCS2 Turbo Decoding on Multi-core CPU [J]. Computer Science, 2023, 50(6): 22-28. |
| [15] | FENG Chen, GU Jingjing. Efficient Distributed Training Framework for Federated Learning [J]. Computer Science, 2023, 50(11): 317-326. |
|
||