面向国产飞腾多核NUMA架构服务器的SP应用多层次优化研究

doi:10.11896/jsjkx.251200067

Computer Science ›› 2026, Vol. 53 ›› Issue (6): 185-192.doi: 10.11896/jsjkx.251200067

• High Performance Computing • Previous Articles Next Articles

Research on Multi-level Optimization of SP Applications for Domestic Phytium Multi-core NUMAArchitecture Servers

REN Rongyao^1,2,5, MA Baiwei^3,4,5, DENG Guanghua², DU Qi⁶, WANG Yueli⁴, LI Shiyan^3,4,5

1 College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China
2 Tianjin Advanced Technology Research Institute,Tianjin 300450,China
3 Tianjin Qisuo Precision Electromechanical Technology Co.,Ltd.,Tianjin 300131,China
4 Tianjin Navigation Instruments Research Institute,Tianjin 300131,China
5 Tianjin Key Laboratory of Special Severe Environment Computer,Tianjin 300450,China
6 Phytium Technology Co.,Ltd.,Tianjin 300450,China

Received:2025-12-10 Revised:2026-03-20 Online:2026-06-15 Published:2026-06-09
About author:REN Rongyao,born in 1983,Ph.D candidate,researcher.His main research interests include computer architecture,sonar signal processing and machine learning.
DENG Guanghua,born in 1994,master,assistant engineer.His main research interests include high performance computing and software development.
Supported by:
Tianjin Key Laboratory of Special Severe Environment Computer Open Foundation(202503).

Abstract

Abstract: This paper addresses the application bottlenecks of domestic Phytium multi-core NUMA architecture server platforms in high-performance computing scenarios,conducting multi-level optimization research around the SP benchmark application.The paper proposes and implements optimization strategies at four levels:compilation,memory allocation,NUMA topology,and vectorized reduction.Experiments analyze the execution time,parallel efficiency,and speedup for different dataset sizes running on 1 MPI process and 8 MPI processes with varying numbers of parallel cores.The analysis shows that the optimized computation time is significantly reduced.With 8 MPI processes and 128 parallel cores,the performance of medium to large datasets improves by 3 to 5 times,and performance degradation under high concurrency is alleviated.The optimized parallel efficiency is more linear,achieving multiple-fold improvements for small datasets with 8 MPI processes,while medium to large datasets maintain approximately 90%,85%,and 64%~71% efficiency at 32,64,and 128 parallel cores respectively,delaying the onset of performance saturation.In terms of speedup,8 MPI processes show better performance gains than 1 MPI process under high concurrency.Experimental results demonstrate that the proposed multi-level optimization strategies can effectively enhance the computational performance of SP applications on the target server architecture.Especially as the number of cores increases and NUMA effects become more pronounced,the optimization scheme exhibits strong scalability advantages,providing an optimization path for numerical simulation and scientific computing on domestic high-performance computing platforms.

Key words: SP applications, Non-uniform memory access, MPI, OpenMP, NEON

CLC Number:

TP311

REN Rongyao, MA Baiwei, DENG Guanghua, DU Qi, WANG Yueli, LI Shiyan. Research on Multi-level Optimization of SP Applications for Domestic Phytium Multi-core NUMAArchitecture Servers[J].Computer Science, 2026, 53(6): 185-192.

References

[1]ALMEIDA F,OKON E.Assessing the impact of high-perfor-mance computing on digital transformation:benefits,challenges,and size-dependent differences[J].The Journal of Supercompu-ting,2025,81(6):795.
[2]TAN S,JIANG Q,AN H.Uncovering the performance bottleneck of modern HPC processor with static code analyzer:a case study on Kunpeng 920[J].CCF Transactions on High Perfor-mance Computing,2024,6(3):343-364.
[3]JIN H,VAN D W R F.Performance characteristics of the multi-zone NAS parallel benchmarks[J].Journal of Parallel and Distributed Computing,2006,66(5):674-685.
[4]STONE C P,ELTON B H.Accelerating the multi-zone scalar pentadiagonal CFD algorithm with OpenACC[C]//Proceedings of the Second Workshop on Accelerator Programming Using Directives.2015:1-7.
[5]LASZLO E,GILES M,APPLEYARD J.Manycore algorithms for batch scalar and block tridiagonal solvers[J].ACM Transactions on Mathematical Software,2016,42(4):1-36.
[6]JESUD R,WEILAND M.Evaluating and optimising compilercode generation for NVIDIA Grace[C]//Proceedings of the 53rd International Conference on Parallel Processing.2024:691-700.
[7]CEDRON F,ALVAREZ-GONZALEZ S,RIBAS-RODRIGUEZ A,et al.Efficient Implementation of Multilayer Perceptrons:Reducing Execution Time and Memory Consumption[J].Applied Sciences,2024,14(17):8020.
[8]LICKER N.Low-level cross-language post-link optimisation[D].Cambridge:University of Cambridge,2022.
[9]DURNER D,LEIS V,NEUMANN T.On the impact of memory allocation on high-performance query processing[C]//Procee-dings of the 15th International Workshop on Data Management on New Hardware.2019:1-3.
[10]EVANS J.A scalable concurrent malloc(3) implementation for FreeBSD[C]//Proceedings of the BSDCan Conference.2006.
[11]LAMETER C.NUMA(Non-Uniform Memory Access):AnOverview:NUMA becomes more common because memory controllers get close to execution units on microprocessors[J].Queue,2013,11(7):40-51.
[12]PAN X,MUELLER F.NUMA-aware memory coloring for multicore real-time systems[J].Journal of Systems Architecture,2021,118:102188.
[13]TIKIR M M,HOLLINGSWORTH J K.Hardware monitors for dynamic page migration[J].Journal of Parallel and Distributed Computing,2008,68(9):1186-1200.
[14]KANDIAH V,LUSTIG D,VILLA O,et al.Parsimony:Ena-bling SIMD/Vector Programming in Standard Compiler Flows[C]//Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization.2023:186-198.

Related Articles 15

[1]	WANG Yipin, CAI Chenghuan, XU Jiabin, ZHOU Xuegong, ZHANG Fengzhe, CAO Wei, ZHANG Fan, YU Xinsheng. Study on Compilation Technology of Neural Network Accelerator Based on RISC-V InstructionExtension [J]. Computer Science, 2026, 53(6): 128-136.
[2]	ZHU Pengzhi, HUANG Chun, SHEN Jie, CHEN Cheng, XU Haoran, LONG Biao. Research on Fortran Compiler Implementation Technology on CPU-DSP Heterogeneous Processor [J]. Computer Science, 2026, 53(6): 145-152.
[3]	AN Yuanke, WANG Lei, WANG Shengyuan. Clock Analysis and/or Check in L2C Trusted Compiler and Investigation on Its Verification Framework [J]. Computer Science, 2026, 53(6): 388-395.
[4]	WANG Bixuan, CHEN Shiming, GAO Zhizezhang, FENG Jun, WANG Huiya. Survey of Learning Trajectories [J]. Computer Science, 2026, 53(5): 13-21.
[5]	HAN Lin, SHAO Jingjing, NIE Kai, LI Haoran, LIU Haohao, CHEN Mengyao. Loop Splitting Based on Conditional Statement Invariance Analysis [J]. Computer Science, 2026, 53(2): 117-123.
[6]	XU Jinlong, WANG Gengwu, HAN Lin, NIE Kai, LI Haoran, CHEN Mengyao, LIU Haohao. Research on Parallel Scheduling Strategy Optimization Technology Based on Sunway Compiler [J]. Computer Science, 2025, 52(9): 137-143.
[7]	HAN Lin, DING Yongqiang, CUI Pingfei, LIU Haohao, LI Haoran, CHEN Mengyao. SLP Vectorization Across Basic Blocks Based on Region Partitioning [J]. Computer Science, 2025, 52(9): 186-194.
[8]	LIU Zhengyu, ZHANG Fan, QI Xiaofeng, GAO Yanzhao, SONG Yijing, FAN Wang. Review of Research on Deep Learning Compiler [J]. Computer Science, 2025, 52(8): 29-44.
[9]	LIU Mengzhen, ZHOU Qinglei, HAN Lin, NIE Kai, LI Haoran, CHEN Mengyao, LIU Haohao. Research on Automatic Vectorization Benefit Evaluation Model Based on Particle SwarmAlgorithm [J]. Computer Science, 2025, 52(7): 248-254.
[10]	LI Yingjian, WANG Yongsheng, LIU Xiaojun, REN Yuan. Cloud Platform Load Data Forecasting Method Based on Spatiotemporal Graph AttentionNetwork [J]. Computer Science, 2025, 52(6A): 240700178-8.
[11]	JIANG Jun, ZHAI Yanhe, ZENG Zhiheng, GU Yichao, HUANG Liangming. Loop-invariant Code Motion Algorithm Based on Loop Cost Analysis [J]. Computer Science, 2025, 52(6): 44-51.
[12]	CAI Chunhao, LIANG Shuping, JIANG Jun, SHAO Ningyuan. Pre-selection Optimization for Spill Heuristic on Shenwei Platform [J]. Computer Science, 2025, 52(6): 82-87.
[13]	GAO Wei, WANG Lei, LI Jianan, LI Shuailong, HAN Lin. Operator Fusion Optimization for Deep Learning Compiler TVM [J]. Computer Science, 2025, 52(5): 58-66.
[14]	OU Guiliang, HE Yulin, ZHANG Manjing, HUANG Zhexue , Philippe FOURNIER-VIGER. Risk Minimization-Based Weighted Naive Bayesian Classifier [J]. Computer Science, 2025, 52(3): 137-151.
[15]	LIN Yongzhen, XU Chuanfu, QIU Haozhong, WANG Qingsong, WANG Zhenghua, YANG Fuxiang, LI Jie. Heterogeneous Parallel Computing and Performance Optimization for DSMC/PIC Coupled Simulation Based on MPI+CUDA [J]. Computer Science, 2024, 51(9): 31-39.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Research on Multi-level Optimization of SP Applications for Domestic Phytium Multi-core NUMAArchitecture Servers

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0