计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 51-58.doi: 10.11896/jsjkx.230900079

• 高性能计算 • 上一篇    下一篇

基于鲲鹏处理器的LU并行分解优化算法

徐鹤1,2, 周涛1,2, 李鹏1,2, 秦芳芳3, 季一木1,2   

  1. 1 南京邮电大学计算机学院、软件学院、网络空间安全学院 南京 2100232 江苏省高性能计算与智能处理工程研究中心 南京 210023
    3 南京邮电大学理学院 南京 210023
  • 收稿日期:2023-09-14 修回日期:2023-12-22 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 李鹏(lipeng@njupt.edu.cn)
  • 作者简介:(xuhe@njupt.edu.cn)
  • 基金资助:
    国家自然科学基金(62102194,62102196);江苏省六大人才高峰高层次人才项目(RJFW-111);江苏省研究生实践创新计划(SJCX22_0267,SJCX22_0275);华为鲲鹏众智计划(2022外241,2022外243)

LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor

XU He1,2, ZHOU Tao1,2, LI Peng1,2, QIN Fangfang3, JI Yimu1,2   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Jiangsu HPC and Intelligent Processing Engineer Research Center,Nanjing 210023,China
    3 College of Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2023-09-14 Revised:2023-12-22 Online:2024-09-15 Published:2024-09-10
  • About author:XU He,born in 1985,Ph.D,professor,master supervisor,is a senior member of CCF(No.19957S).His main research interests include big data and Internet of Things technology.
    LI Peng,born in 1979,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.48573M).His main research interests include computer communication networks,cloud computing and information security.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China(62102194,62102196),Six Talent Peaks Project of Jiangsu Province(RJFW-111),Postgraduate Research and Practice Innovation Program of Jiangsu Province(SJCX22_0267,SJCX22_0275) and Huawei Kunpeng Zhongzhi Plan(2022w241,2022w243).

摘要: ScaLAPACK(Scalable Linear Algebra PACKage)是并行计算软件包,适用于分布式存储的 MIMD(Multiple Instruction,Multiple Data)并行计算机,被广泛应用于基于线性代数运算的并行应用程序开发。然而在进行LU分解过程中,ScaLAPACK库中的例程并不是通信最优的,没有充分利用当前的并行架构。针对上述问题,提出一种基于鲲鹏处理器的LU并行分解优化算法(Parallel LU Factorization,PLF),实现了负载均衡,适配国产鲲鹏环境。PLF对不同进程的不同分区的数据进行差异化处理,并将每个进程所拥有的部分数据分配给根进程进行计算,之后再由根进程散播回各个子进程,这有利于充分利用CPU资源,实现负载均衡。在单节点Intel 9320R处理器以及鲲鹏(Kunpeng) 920处理器环境中进行测试,其中,Intel平台下使用Intel MKL(Math Kernel Library),Kunpeng平台下使用PLF算法。对比两个平台关于不同规模的方程组求解的性能发现,Kunpeng平台的求解性能有显著优势。在NUMA数进程和单线程的情况下,优化后的计算效率在小规模平均达到4.35%,相比Intel的1.38%提升了215%;中规模平均达到4.24%,相比Intel平台的1.86%提升了118%;大规模平均达到4.24%,相比Intel的1.99%提升了113%。

关键词: ScaLAPACK, LU分解, 并行计算, MKL

Abstract: Scalable linear algebra PACKage(ScaLAPACK) is a parallel computing package suitable for MIMD(multiple instruction,multiple data) parallel computers with distributed storage.It is widely used in parallel application program development based on linear algebra operation.However,during the LU decomposition process,the routines in the ScaLAPACK library are not communication optimal and do not take full advantage of the current parallel architecture.To solve the above problems,a parallel LU factorization optimization algorithm(PLF) based on Kunpeng processor is proposed to achieve load balancing and adapt to domestic Kunpeng environment.PLF processes the data of different partitions of different processes differently.PLF allocates part of the data of each process to the root process for calculation.After the calculation is completed,the root process spreads the data back to each sub-process,which helps to fully utilize CPU resources and achieve load balancing.Tests are performed on single-node Intel 9320R processors and Kunpeng 920 processors.Intel MKL(Math Kernel Library) is used on the Intel platform,and PLF algorithm is used on the Kunpeng platform.After comparing the performance of solving equations of different scales on two platforms,it is found that the performance of solving equations on Kunpeng platform has a significant advantage compared with Intel platform.In the case of NUMA process and single thread,the optimized computing efficiency reaches 4.35% on a small scale on average,which is 215% higher than Intel's 1.38%.The average size of the medium scale reaches 4.24%,compared with 1.86% of Intel platform,an increase of 118%.The large-scale average reaches 4.24%,compared to Intel's 1.99%,an increase of 113%.

Key words: ScaLAPACK, LU factorization, Parallel computing, MKL

中图分类号: 

  • TP311
[1]DEMMEL J.LAPACK:a portable linear algebra library for supercomputers[C]//IEEE Control Systems Society Workshop on Computer-Aided Control System Design.USA:IEEE,1989:1-7.
[2]BULUÇ A,GILBERT J R.The Combinatorial BLAS:Design,implementation,and applications[J].The International Journal of High Performance Computing Applications,2011,25(4):496-509.
[3]CHOI J,DONGARRA J J,POZO R,et al.ScaLAPACK:A sca-lable linear algebra library for distributed memory concurrent computers[C]//The Fourth Symposium on the Frontiers of Massively Parallel Computation.USA:IEEE Computer Society,1992:120-127.
[4]CHOI J,DONGARRA J J,OSTROUCHOVL S,et al.Design and Implementation of the ScaLAPACK LU,QR,and Cholesky Factorization Routines [J].Scientific Programming,1996,5(3):173-184.
[5]CHOI J,DONGARRAJ J,OSTROUCHOV L S,et al.A Pro-posal for a set of parallel basic linear algebra subprograms[C]//Applied Parallel Computing Computations in Physics,Chemistry and Engineering Science:Second International Workshop.Denmark:Springer,1996:107-114.
[6]DONGARRAJ J,HAMMARLING S,HIGHAM N J,et al.The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems [J].Procedia Computer Science,2017,108:495-504.
[7]CHEN G L,SUN G Z,ZHANG Y Q,et al.Study on ParallelComputing [J].Journal of Computer Science and Technology,2006,21(5):665-673.
[8]CHOI J,WALKER D W,DONGARRA J J.Pumma:Paralleluniversal matrix multiplication algorithms on distributed memory concurrent computers [J].Concurrency Practice & Experience,1994,6(7):543-570.
[9]LUO H W,WU Y J,SHANG H H.Many-core OptimizationMethod for the Calculation of Ab initio Polarizability[J].Computer Science,2023,50(6):1-9.
[10]LIU L,LIU Q K,SONG X Y.Study on Parallel of GaussianElimination [J].Computer Engineering,2011,37(8):40-42.
[11]MEAD J L,RENAUT R A,WELFERT B D.Stability of a Pivoting Strategy for Parallel Gaussian Elimination [J].BIT Nume-rical Mathematics,2001,41(3):633-639.
[12]LI W Q.A Chasing Method for Solving Cyclic Tridiagonal Equations [J].Science & Technology Review,2009,27(14):69-72.
[13]KONGHUA G,HU X,ZHANG L.A new iteration method for the matrix equation AX=B [J].Applied Mathematics and Computation,2007,187(2):1434-1441.
[14]LIU J B,HE M.A Hybrid Parallelization Technique Based on OpenMP and VALU Acceleration for the Method of Moments Solution of the Surface Integral Equations [J].Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology,2014,34(1):50-55.
[15]BILMES J,ASANOVIC K,CHIN C W,et al.Author retrospec-tive for optimizing matrix multiply using PHiPAC:a portable high-performance ANSI C coding methodology[C]//ACM International Conference on Supercomputing 25th Anniversary Volume.Germany:Association for Computing Machinery,2014:42-44.
[16]WHALEY R C,PETITET A,DONGARRA J J.Automated empirical optimizations of software and the ATLAS project [J].Parallel Computing,2001,27(1):3-35.
[17]KIM R,CHOI J,LEE M.Optimizing parallel GEMM routinesusing auto-tuning with Intel AVX-512[C]//Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region.China:Association for Computing Machine-ry,2019:101-110.
[18]HASSAN S A,HEMEIDA A M,MAHMOUD M M M.Per-formance Evaluation of Matrix-Matrix Multiplications Using Intel's Advanced Vector Extensions(AVX) [J].Microprocessors &Microsystems,2016:47(B):369-374.
[19]LIM R,LEE Y,KIM R,et al.OpenMP-based parallel implemen-tation of matrix-matrix multiplication on the intel knights landing[C]//Proceedings of Workshops of HPC Asia.Tokyo:Association for Computing Machinery,2018:63-66.
[20]GUNEY M E,GOTO K,COSTA T B,et al.Optimizing Matrix Multiplication on Intel© Xeon Phi TH x200 Architecture[C]//2017 IEEE 24th Symposium on Computer Arithmetic(ARITH).UK:IEEE,2017:144-145.
[21]ZHAI X L,ZHANG Y G,JIN A Z,et al.Parallel DVB-RCS2Turbo Decoding on Multi-core CPU[J].Computer Science,2023,50(6):22-28.
[22]ZHANG Y C,DU K,HUANG T J.Heuristic Tree-Partition-Based Parallel Method for Biophysically Detailed Neuron Simulation [J].Neural computation,2023,35(4):627-644.
[23]GNANASEKARAN A,DARVE E.Scalable low-rank factorization using a task-based runtime system with distributed memory[C]//Proceedings of the Platform for Advanced Scientific Computing Conference.Switzerland:Association for Computing Machinery,2022:Article 8.
[24]DONGARRA J J,FAVERGE M,LTAIEF H,et al.Achievingnumerical accuracy and high performance using recursive tile LU factorization with partial pivoting [J].Concurrency and Computation:Practice and Experience,2014,26(7):1408-1431.
[25]KWASNIEWSKI G,KABIC M,BEN-NUN T,et al.On the pa-rallel I/O optimality of linear algebra kernels:near-optimal LU factorization[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Republic of Korea:Association for Computing Machinery,2021:463-464.
[26]YU T,WANG L S,QIN X L.Lock-free Parallel Semi-naive Algorithm Based on Multi-core CPU[J].Computer Science,2023,50(6):29-35.
[27]XIA J,CHENG C,ZHOU X,et al.Kunpeng 920:The first 7-nm chiplet-based 64-core arm soc for cloud services [J].IEEE Micro,2021,41(5):67-75.
[28]MOLDOVANOVA O V,KURNOSOV M G.Auto-vectorization of loops on Intel 64 and Intel Xeon Phi:Analysis and evaluation[C]//International Conference on Parallel Computing Techno-logies.Switzerland:Springer International Publishing,2017:143-150.
[29]FAN L L,QIAO Y H,LI J F,et al.CP2K Software Porting and Optimization Based on Domestic c86 Processor[J].Computer Science,2023,50(6):58-65.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!