Computer Science ›› 2026, Vol. 53 ›› Issue (6): 137-144.doi: 10.11896/jsjkx.251000114

• High Performance Computing • Previous Articles     Next Articles

Kokkos-based Direct Solver and Its Implementation on Heterogeneous Platform

LI Zhenjia, WANG Wu   

  1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
    University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2025-10-27 Revised:2025-12-22 Online:2026-06-15 Published:2026-06-09
  • About author:LI Zhenjia,born in 2000,postgraduate.Her main research interests include parallel computing and the solution of dense linear systems.
    WANG Wu,born in 1982,Ph.D,associate researcher,is a member of CCF(No.96330M).His main research interests include parallel algorithm and high performance computing.
  • Supported by:
    National Key R & D Program--High Performance Computing Project(2025YFB3003305) and Youth Fund of Computer Network Information Center of Chinese Academy of Sciences(25YF07).

Abstract: A high-performance parallel LU direct solver based on the Kokkos framework is developed for solving large complex dense linear systems arising from the method of moments(MoM)in electromagnetic simulations on accelerator-based heterogeneous systems.The impedance matrix and excitation vector are efficiently filled in parallel on the GPU using Kokkos∷parallel_for.Based on the computed solution,the radar cross section(RCS)is obtained on the host by traversing observation angles and synthesizing the scattered field.The overall workflow is efficient and demonstrates good scalability.Performance evaluation is conducted on the “ORISE” supercomputing platform equipped with deep computing unit(DCU)accelerators.The impact of two-dimensional processor grid strategies on performance and communication overhead is analyzed.Under 16 processes,increasing the number of processors per row from 1 to 4 results in a 40.7% performance improvement and a reduction in communication overhead from 64.71% to 55.07%.Proper processor grid configuration effectively balances computation and communication,significantly reducing communication overhead and enhancing overall parallel efficiency.With 64 DCUs,the solver achieves a peak performance of 16 655.73 GFLOP/s,and when scaled to 2 048 DCUs,the performance increases to 58 338.90 GFLOP/s,showing good scalability.In the weak scalability test,the solver attains a parallel efficiency of 24.45% when scaling from 4 to 2 048 DCUs.These results indicate that the Kokkos-based direct solver delivers strong performance and is well-suited for large-scale electromagnetic simulation on heterogeneous high-performance computing platforms.

Key words: Kokkos, Method of moments, Parallel LU decomposition, Deep computing unit, Heterogeneous computing

CLC Number: 

  • TP391
[1]HARRINGTON R F.Field Computation by Moment Methods[M].Wiley-IEEE Press,1993.
[2]CHEN Y,ZUO S,ZHANG Y,et al.Large-scale parallel method of moments on CPU/MIC heterogeneous clusters [J].IEEE Transactions on Antennas and Propagation,2017,65(7):3782-3787.
[3]MIRHOSSEINI A,SADROSADATI M,SOLTANI B,et al.BiNoCHS:Bimodal network-on-chip for CPU-GPU heterogeneous systems [C]//Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip.IEEE,2017:1-8.
[4]TOP500.November 2024 | TOP500 [EB/OL].(2024-11)[2025-10-23].https://top500.org/lists/top500/2024/11/.
[5]JIA R P,LIN Z C,ZUO S,et al.Research on large-scale parallel method of moments for domestic heterogeneous DCU platform [J].Journal of Xidian University,2023,51(2):76-83.
[6]SHUI C Y,YU X Z,WANG Y S,et al.Optimization and analysis of HPL on domestic heterogeneous systems [J].Journal of Software,2021,32(8):2319-2328.
[7]HE W J,KONG Y N,HE K F,et al.Massively parallel ap-proach of multilevel fast multipole algorithm on DCU clusters for large electromagnetic scattering problems [C]//2021 International Applied Computational Electromagnetics Society(ACES-China)Symposium.IEEE,2021:1-2.
[8]EDWARDS H C,TROTT C R,SUNDERLAND D.Kokkos:Enabling manycore performance portability through polymorphic memory access patterns [J].Journal of Parallel and Distributed Computing,2014,74(12):3202-3216.
[9]DANG V Q,KOTULSKI J D,RAJAMANICKAM S.ADE-LUS:A performance-portable dense LU solver for distributed-memory hardware-accelerated systems [C]//International Workshop on Accelerator Programming Using Directives.Cham:Springer,2020:80-101.
[10]LIANG Z,LI K,ZHANG X,et al.Development of performance portable solver based on Kokkos template metaprogramming [J].Frontiers of Data and Computing,2024,6(1):12-20.
[11]BALANIS C A.Balanis' Advanced Engineering Electromagne-tics [M].John Wiley & Sons,2024.
[12]PETERSON A F,RAY S L,MITTRA R,et al.Computational Methods for Electromagnetics [M].New York:IEEE,1998.
[13]CHEN C,FANG J,TANG T,et al.LU factorization on heterogeneous systems:an energy-efficient approach towards high performance [J].Computing,2017,99:791-811.
[14]WANG C,CHEN L.Performance Portability Analysis of CFD Solver Based on Kokkos[J].Computer Systems Applications,2025,34(4):248-255.
[15]RAJAMANICKAM S,ACER S,BERGER-VERGIAT L,et al.Kokkos kernels:Performance portable sparse/dense linear algebra and graph kernels [J].arXiv:2103.11991,2021.
[16]TOLEDO S.Locality of reference in LU decomposition withpartial pivoting [J].SIAM Journal on Matrix Analysis and Applications,1997,18(4):1065-1081.
[17]HOCKNEY R W.The communication challenge for MPP:Intel Paragon and Meiko CS-2 [J].Parallel Computing,1994,20(3):389-398.
[18]IRONY D,TOLEDO S,TISKIN A.Communication lowerbounds for distributed-memory matrix multiplication [J].Journal of Parallel and Distributed Computing,2004,64(9):1017-1026.
[1] WANG Enliang, XIA Jun, SUN Zhixin. Improved Hippopotamus Algorithm for Energy Efficiency Optimization of HeterogeneousIntelligent Storage Computing [J]. Computer Science, 2026, 53(5): 376-387.
[2] ZHAO Chuan, HE Zhangzhao, WANG Hao, KONG Fanxing, ZHAO Shengnan, JING Shan. Lightweight Heterogeneous Secure Function Computing Acceleration Framework [J]. Computer Science, 2025, 52(4): 301-309.
[3] LIU Xiaonan, LIAN Demeng, DU Shuaiqi, LIU Zhengyu. Simulation of Limited Entangled Quantum Fourier Transform Based on Matrix Product State [J]. Computer Science, 2024, 51(9): 80-86.
[4] XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na. Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System [J]. Computer Science, 2021, 48(12): 36-42.
[5] YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le, CAI Qin-yun. Survey of Heterogeneous Hybrid Parallel Computing [J]. Computer Science, 2020, 47(8): 5-16.
[6] ZHANG Long-xin, ZHOU Li-qian, WEN Hong, XIAO Man-sheng, DENG Xiao-jun. Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems [J]. Computer Science, 2020, 47(8): 112-118.
[7] ZHANG Shuai, XU Shun, LIU Qian, JIN Zhong. Cell Verlet Algorithm of Molecular Dynamics Simulation Based on GPU and Its Parallel Performance Analysis [J]. Computer Science, 2018, 45(10): 291-294.
[8] WEI Jian-wen, XU Zhi-geng, WANG Bing-qiang, Simon SEE and James LIN. Accelerating Gene Clustering on Heterogeneous Clusters [J]. Computer Science, 2017, 44(3): 20-22.
[9] ZENG Zhiping, XIAO Haidong and ZHANG Xinpeng. Construction Heterogeneous Computing Platforms and Sensitive Data Protection Based on Domestic X86 Processors [J]. Computer Science, 2015, 42(Z11): 317-322.
[10] HAO Shui-xia,ZENG Guo-sun,MA Xiao-xin and XU Jin-chao. Similarity-driven Fine-grained Parallel Task Reconfigurable Algorithm [J]. Computer Science, 2013, 40(9): 44-50.
[11] . Research and Implementation of Column-based Database Schedule [J]. Computer Science, 2013, 40(3): 142-146.
[12] . Architecture-aware Parallel Task Clustering Policy in Heterogeneous Computing [J]. Computer Science, 2013, 40(3): 121-125.
[13] YU Li-hua,ZENG Guo-sun. Executing Method of Time and Energy Optimization in Heterogeneous Computing [J]. Computer Science, 2011, 38(10): 285-290.
[14] . [J]. Computer Science, 2006, 33(6): 260-263.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!