计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 51-58.doi: 10.11896/jsjkx.230900079
徐鹤1,2, 周涛1,2, 李鹏1,2, 秦芳芳3, 季一木1,2
XU He1,2, ZHOU Tao1,2, LI Peng1,2, QIN Fangfang3, JI Yimu1,2
摘要: ScaLAPACK(Scalable Linear Algebra PACKage)是并行计算软件包,适用于分布式存储的 MIMD(Multiple Instruction,Multiple Data)并行计算机,被广泛应用于基于线性代数运算的并行应用程序开发。然而在进行LU分解过程中,ScaLAPACK库中的例程并不是通信最优的,没有充分利用当前的并行架构。针对上述问题,提出一种基于鲲鹏处理器的LU并行分解优化算法(Parallel LU Factorization,PLF),实现了负载均衡,适配国产鲲鹏环境。PLF对不同进程的不同分区的数据进行差异化处理,并将每个进程所拥有的部分数据分配给根进程进行计算,之后再由根进程散播回各个子进程,这有利于充分利用CPU资源,实现负载均衡。在单节点Intel 9320R处理器以及鲲鹏(Kunpeng) 920处理器环境中进行测试,其中,Intel平台下使用Intel MKL(Math Kernel Library),Kunpeng平台下使用PLF算法。对比两个平台关于不同规模的方程组求解的性能发现,Kunpeng平台的求解性能有显著优势。在NUMA数进程和单线程的情况下,优化后的计算效率在小规模平均达到4.35%,相比Intel的1.38%提升了215%;中规模平均达到4.24%,相比Intel平台的1.86%提升了118%;大规模平均达到4.24%,相比Intel的1.99%提升了113%。
中图分类号:
[1]DEMMEL J.LAPACK:a portable linear algebra library for supercomputers[C]//IEEE Control Systems Society Workshop on Computer-Aided Control System Design.USA:IEEE,1989:1-7. [2]BULUÇ A,GILBERT J R.The Combinatorial BLAS:Design,implementation,and applications[J].The International Journal of High Performance Computing Applications,2011,25(4):496-509. [3]CHOI J,DONGARRA J J,POZO R,et al.ScaLAPACK:A sca-lable linear algebra library for distributed memory concurrent computers[C]//The Fourth Symposium on the Frontiers of Massively Parallel Computation.USA:IEEE Computer Society,1992:120-127. [4]CHOI J,DONGARRA J J,OSTROUCHOVL S,et al.Design and Implementation of the ScaLAPACK LU,QR,and Cholesky Factorization Routines [J].Scientific Programming,1996,5(3):173-184. [5]CHOI J,DONGARRAJ J,OSTROUCHOV L S,et al.A Pro-posal for a set of parallel basic linear algebra subprograms[C]//Applied Parallel Computing Computations in Physics,Chemistry and Engineering Science:Second International Workshop.Denmark:Springer,1996:107-114. [6]DONGARRAJ J,HAMMARLING S,HIGHAM N J,et al.The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems [J].Procedia Computer Science,2017,108:495-504. [7]CHEN G L,SUN G Z,ZHANG Y Q,et al.Study on ParallelComputing [J].Journal of Computer Science and Technology,2006,21(5):665-673. [8]CHOI J,WALKER D W,DONGARRA J J.Pumma:Paralleluniversal matrix multiplication algorithms on distributed memory concurrent computers [J].Concurrency Practice & Experience,1994,6(7):543-570. [9]LUO H W,WU Y J,SHANG H H.Many-core OptimizationMethod for the Calculation of Ab initio Polarizability[J].Computer Science,2023,50(6):1-9. [10]LIU L,LIU Q K,SONG X Y.Study on Parallel of GaussianElimination [J].Computer Engineering,2011,37(8):40-42. [11]MEAD J L,RENAUT R A,WELFERT B D.Stability of a Pivoting Strategy for Parallel Gaussian Elimination [J].BIT Nume-rical Mathematics,2001,41(3):633-639. [12]LI W Q.A Chasing Method for Solving Cyclic Tridiagonal Equations [J].Science & Technology Review,2009,27(14):69-72. [13]KONGHUA G,HU X,ZHANG L.A new iteration method for the matrix equation AX=B [J].Applied Mathematics and Computation,2007,187(2):1434-1441. [14]LIU J B,HE M.A Hybrid Parallelization Technique Based on OpenMP and VALU Acceleration for the Method of Moments Solution of the Surface Integral Equations [J].Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology,2014,34(1):50-55. [15]BILMES J,ASANOVIC K,CHIN C W,et al.Author retrospec-tive for optimizing matrix multiply using PHiPAC:a portable high-performance ANSI C coding methodology[C]//ACM International Conference on Supercomputing 25th Anniversary Volume.Germany:Association for Computing Machinery,2014:42-44. [16]WHALEY R C,PETITET A,DONGARRA J J.Automated empirical optimizations of software and the ATLAS project [J].Parallel Computing,2001,27(1):3-35. [17]KIM R,CHOI J,LEE M.Optimizing parallel GEMM routinesusing auto-tuning with Intel AVX-512[C]//Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region.China:Association for Computing Machine-ry,2019:101-110. [18]HASSAN S A,HEMEIDA A M,MAHMOUD M M M.Per-formance Evaluation of Matrix-Matrix Multiplications Using Intel's Advanced Vector Extensions(AVX) [J].Microprocessors &Microsystems,2016:47(B):369-374. [19]LIM R,LEE Y,KIM R,et al.OpenMP-based parallel implemen-tation of matrix-matrix multiplication on the intel knights landing[C]//Proceedings of Workshops of HPC Asia.Tokyo:Association for Computing Machinery,2018:63-66. [20]GUNEY M E,GOTO K,COSTA T B,et al.Optimizing Matrix Multiplication on Intel© Xeon Phi TH x200 Architecture[C]//2017 IEEE 24th Symposium on Computer Arithmetic(ARITH).UK:IEEE,2017:144-145. [21]ZHAI X L,ZHANG Y G,JIN A Z,et al.Parallel DVB-RCS2Turbo Decoding on Multi-core CPU[J].Computer Science,2023,50(6):22-28. [22]ZHANG Y C,DU K,HUANG T J.Heuristic Tree-Partition-Based Parallel Method for Biophysically Detailed Neuron Simulation [J].Neural computation,2023,35(4):627-644. [23]GNANASEKARAN A,DARVE E.Scalable low-rank factorization using a task-based runtime system with distributed memory[C]//Proceedings of the Platform for Advanced Scientific Computing Conference.Switzerland:Association for Computing Machinery,2022:Article 8. [24]DONGARRA J J,FAVERGE M,LTAIEF H,et al.Achievingnumerical accuracy and high performance using recursive tile LU factorization with partial pivoting [J].Concurrency and Computation:Practice and Experience,2014,26(7):1408-1431. [25]KWASNIEWSKI G,KABIC M,BEN-NUN T,et al.On the pa-rallel I/O optimality of linear algebra kernels:near-optimal LU factorization[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Republic of Korea:Association for Computing Machinery,2021:463-464. [26]YU T,WANG L S,QIN X L.Lock-free Parallel Semi-naive Algorithm Based on Multi-core CPU[J].Computer Science,2023,50(6):29-35. [27]XIA J,CHENG C,ZHOU X,et al.Kunpeng 920:The first 7-nm chiplet-based 64-core arm soc for cloud services [J].IEEE Micro,2021,41(5):67-75. [28]MOLDOVANOVA O V,KURNOSOV M G.Auto-vectorization of loops on Intel 64 and Intel Xeon Phi:Analysis and evaluation[C]//International Conference on Parallel Computing Techno-logies.Switzerland:Springer International Publishing,2017:143-150. [29]FAN L L,QIAO Y H,LI J F,et al.CP2K Software Porting and Optimization Based on Domestic c86 Processor[J].Computer Science,2023,50(6):58-65. |
|