Computer Science ›› 2024, Vol. 51 ›› Issue (9): 51-58.doi: 10.11896/jsjkx.230900079

• High Performance Computing • Previous Articles     Next Articles

LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor

XU He1,2, ZHOU Tao1,2, LI Peng1,2, QIN Fangfang3, JI Yimu1,2   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Jiangsu HPC and Intelligent Processing Engineer Research Center,Nanjing 210023,China
    3 College of Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
  • Received:2023-09-14 Revised:2023-12-22 Online:2024-09-15 Published:2024-09-10
  • About author:XU He,born in 1985,Ph.D,professor,master supervisor,is a senior member of CCF(No.19957S).His main research interests include big data and Internet of Things technology.
    LI Peng,born in 1979,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.48573M).His main research interests include computer communication networks,cloud computing and information security.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China(62102194,62102196),Six Talent Peaks Project of Jiangsu Province(RJFW-111),Postgraduate Research and Practice Innovation Program of Jiangsu Province(SJCX22_0267,SJCX22_0275) and Huawei Kunpeng Zhongzhi Plan(2022w241,2022w243).

Abstract: Scalable linear algebra PACKage(ScaLAPACK) is a parallel computing package suitable for MIMD(multiple instruction,multiple data) parallel computers with distributed storage.It is widely used in parallel application program development based on linear algebra operation.However,during the LU decomposition process,the routines in the ScaLAPACK library are not communication optimal and do not take full advantage of the current parallel architecture.To solve the above problems,a parallel LU factorization optimization algorithm(PLF) based on Kunpeng processor is proposed to achieve load balancing and adapt to domestic Kunpeng environment.PLF processes the data of different partitions of different processes differently.PLF allocates part of the data of each process to the root process for calculation.After the calculation is completed,the root process spreads the data back to each sub-process,which helps to fully utilize CPU resources and achieve load balancing.Tests are performed on single-node Intel 9320R processors and Kunpeng 920 processors.Intel MKL(Math Kernel Library) is used on the Intel platform,and PLF algorithm is used on the Kunpeng platform.After comparing the performance of solving equations of different scales on two platforms,it is found that the performance of solving equations on Kunpeng platform has a significant advantage compared with Intel platform.In the case of NUMA process and single thread,the optimized computing efficiency reaches 4.35% on a small scale on average,which is 215% higher than Intel's 1.38%.The average size of the medium scale reaches 4.24%,compared with 1.86% of Intel platform,an increase of 118%.The large-scale average reaches 4.24%,compared to Intel's 1.99%,an increase of 113%.

Key words: ScaLAPACK, LU factorization, Parallel computing, MKL

CLC Number: 

  • TP311
[1]DEMMEL J.LAPACK:a portable linear algebra library for supercomputers[C]//IEEE Control Systems Society Workshop on Computer-Aided Control System Design.USA:IEEE,1989:1-7.
[2]BULUÇ A,GILBERT J R.The Combinatorial BLAS:Design,implementation,and applications[J].The International Journal of High Performance Computing Applications,2011,25(4):496-509.
[3]CHOI J,DONGARRA J J,POZO R,et al.ScaLAPACK:A sca-lable linear algebra library for distributed memory concurrent computers[C]//The Fourth Symposium on the Frontiers of Massively Parallel Computation.USA:IEEE Computer Society,1992:120-127.
[4]CHOI J,DONGARRA J J,OSTROUCHOVL S,et al.Design and Implementation of the ScaLAPACK LU,QR,and Cholesky Factorization Routines [J].Scientific Programming,1996,5(3):173-184.
[5]CHOI J,DONGARRAJ J,OSTROUCHOV L S,et al.A Pro-posal for a set of parallel basic linear algebra subprograms[C]//Applied Parallel Computing Computations in Physics,Chemistry and Engineering Science:Second International Workshop.Denmark:Springer,1996:107-114.
[6]DONGARRAJ J,HAMMARLING S,HIGHAM N J,et al.The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems [J].Procedia Computer Science,2017,108:495-504.
[7]CHEN G L,SUN G Z,ZHANG Y Q,et al.Study on ParallelComputing [J].Journal of Computer Science and Technology,2006,21(5):665-673.
[8]CHOI J,WALKER D W,DONGARRA J J.Pumma:Paralleluniversal matrix multiplication algorithms on distributed memory concurrent computers [J].Concurrency Practice & Experience,1994,6(7):543-570.
[9]LUO H W,WU Y J,SHANG H H.Many-core OptimizationMethod for the Calculation of Ab initio Polarizability[J].Computer Science,2023,50(6):1-9.
[10]LIU L,LIU Q K,SONG X Y.Study on Parallel of GaussianElimination [J].Computer Engineering,2011,37(8):40-42.
[11]MEAD J L,RENAUT R A,WELFERT B D.Stability of a Pivoting Strategy for Parallel Gaussian Elimination [J].BIT Nume-rical Mathematics,2001,41(3):633-639.
[12]LI W Q.A Chasing Method for Solving Cyclic Tridiagonal Equations [J].Science & Technology Review,2009,27(14):69-72.
[13]KONGHUA G,HU X,ZHANG L.A new iteration method for the matrix equation AX=B [J].Applied Mathematics and Computation,2007,187(2):1434-1441.
[14]LIU J B,HE M.A Hybrid Parallelization Technique Based on OpenMP and VALU Acceleration for the Method of Moments Solution of the Surface Integral Equations [J].Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology,2014,34(1):50-55.
[15]BILMES J,ASANOVIC K,CHIN C W,et al.Author retrospec-tive for optimizing matrix multiply using PHiPAC:a portable high-performance ANSI C coding methodology[C]//ACM International Conference on Supercomputing 25th Anniversary Volume.Germany:Association for Computing Machinery,2014:42-44.
[16]WHALEY R C,PETITET A,DONGARRA J J.Automated empirical optimizations of software and the ATLAS project [J].Parallel Computing,2001,27(1):3-35.
[17]KIM R,CHOI J,LEE M.Optimizing parallel GEMM routinesusing auto-tuning with Intel AVX-512[C]//Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region.China:Association for Computing Machine-ry,2019:101-110.
[18]HASSAN S A,HEMEIDA A M,MAHMOUD M M M.Per-formance Evaluation of Matrix-Matrix Multiplications Using Intel's Advanced Vector Extensions(AVX) [J].Microprocessors &Microsystems,2016:47(B):369-374.
[19]LIM R,LEE Y,KIM R,et al.OpenMP-based parallel implemen-tation of matrix-matrix multiplication on the intel knights landing[C]//Proceedings of Workshops of HPC Asia.Tokyo:Association for Computing Machinery,2018:63-66.
[20]GUNEY M E,GOTO K,COSTA T B,et al.Optimizing Matrix Multiplication on Intel© Xeon Phi TH x200 Architecture[C]//2017 IEEE 24th Symposium on Computer Arithmetic(ARITH).UK:IEEE,2017:144-145.
[21]ZHAI X L,ZHANG Y G,JIN A Z,et al.Parallel DVB-RCS2Turbo Decoding on Multi-core CPU[J].Computer Science,2023,50(6):22-28.
[22]ZHANG Y C,DU K,HUANG T J.Heuristic Tree-Partition-Based Parallel Method for Biophysically Detailed Neuron Simulation [J].Neural computation,2023,35(4):627-644.
[23]GNANASEKARAN A,DARVE E.Scalable low-rank factorization using a task-based runtime system with distributed memory[C]//Proceedings of the Platform for Advanced Scientific Computing Conference.Switzerland:Association for Computing Machinery,2022:Article 8.
[24]DONGARRA J J,FAVERGE M,LTAIEF H,et al.Achievingnumerical accuracy and high performance using recursive tile LU factorization with partial pivoting [J].Concurrency and Computation:Practice and Experience,2014,26(7):1408-1431.
[25]KWASNIEWSKI G,KABIC M,BEN-NUN T,et al.On the pa-rallel I/O optimality of linear algebra kernels:near-optimal LU factorization[C]//Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Republic of Korea:Association for Computing Machinery,2021:463-464.
[26]YU T,WANG L S,QIN X L.Lock-free Parallel Semi-naive Algorithm Based on Multi-core CPU[J].Computer Science,2023,50(6):29-35.
[27]XIA J,CHENG C,ZHOU X,et al.Kunpeng 920:The first 7-nm chiplet-based 64-core arm soc for cloud services [J].IEEE Micro,2021,41(5):67-75.
[28]MOLDOVANOVA O V,KURNOSOV M G.Auto-vectorization of loops on Intel 64 and Intel Xeon Phi:Analysis and evaluation[C]//International Conference on Parallel Computing Techno-logies.Switzerland:Springer International Publishing,2017:143-150.
[29]FAN L L,QIAO Y H,LI J F,et al.CP2K Software Porting and Optimization Based on Domestic c86 Processor[J].Computer Science,2023,50(6):58-65.
[1] ZHAI Xulun, ZHANG Yongguang, JIN Anzhao, QIANG Wei, LI Mengbing. Parallel DVB-RCS2 Turbo Decoding on Multi-core CPU [J]. Computer Science, 2023, 50(6): 22-28.
[2] DING Yue, XU Chuanfu, QIU Haozhong, DAI Weixi, WANG Qingsong, LIN Yongzhen, WANG Zhenghua. Study on Cross-platform Heterogeneous Parallel Computing for Lattice Boltzmann Multi-phase Flow Simulations Based on SYCL [J]. Computer Science, 2023, 50(11): 32-40.
[3] FENG Chen, GU Jingjing. Efficient Distributed Training Framework for Federated Learning [J]. Computer Science, 2023, 50(11): 317-326.
[4] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[5] ZHU Ruo-chen, YANG Chang-chun, ZHANG Deng-hui. EGOS-DST:Efficient Schema-guided Approach to One-step Dialogue State Tracking for Diverse Expressions [J]. Computer Science, 2022, 49(11A): 210900246-7.
[6] HUANG Jia-wei, LI Xiao-peng, LING Cheng. Scalable Parallel Computing Method for Conditional Likelihood Probability of Nucleotide Molecular Phylogenetic Tree Based on GPU [J]. Computer Science, 2022, 49(11A): 210800189-7.
[7] FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen. Performance Skeleton Analysis Method Towards Component-based Parallel Applications [J]. Computer Science, 2021, 48(6): 1-9.
[8] HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han. Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform [J]. Computer Science, 2021, 48(6): 34-40.
[9] LI Fan, YAN Xing, ZHANG Xiao-yu. Optimization of GPU-based Eigenface Algorithm [J]. Computer Science, 2021, 48(4): 197-204.
[10] HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[11] MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.
[12] CHEN Guo-liang, ZHANG Yu-jie, . Development of Parallel Computing Subject [J]. Computer Science, 2020, 47(8): 1-4.
[13] YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le, CAI Qin-yun. Survey of Heterogeneous Hybrid Parallel Computing [J]. Computer Science, 2020, 47(8): 5-16.
[14] YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming [J]. Computer Science, 2020, 47(4): 36-41.
[15] DENG Ding-sheng. Application of Improved DBSCAN Algorithm on Spark Platform [J]. Computer Science, 2020, 47(11A): 425-429.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!