计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 11-24.doi: 10.11896/jsjkx.240500103

• 高性能计算 • 上一篇    下一篇

面向国产超算的操作系统评测与优化

高亦沁, 罗智宇, 王一超, 林新华   

  1. 上海交通大学网络信息中心 上海 200240
  • 收稿日期:2024-05-23 修回日期:2024-09-05 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 林新华(james@sjtu.edu.cn)
  • 作者简介:(gaoyiqin95@sjtu.edu.cn)
  • 基金资助:
    国家重点研发计划(2023YFB3002001)

Performance Evaluation and Optimization of Operating System for Domestic Supercomputer

GAO Yiqin, LUO Zhiyu, WANG Yichao, LIN Xinhua   

  1. Network & Information Center,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2024-05-23 Revised:2024-09-05 Online:2025-05-15 Published:2025-05-12
  • About author:GAO Yiqin,born in 1995,Ph.D,engineer,is a member of CCF(No. L2004M).Her main research interests include high performance computing and task scheduling algorithm designing.
    LIN Xinhua,born in 1979,Ph.D,senior engineer,is a member of CCF(No.23737D).His main research interests include high performance computing and computer architecture.
  • Supported by:
    National Key Research and Development Program of China(2023YFB3002001).

摘要: 超级计算机是“国之重器”,我国在“十四五”期间建设后E级国产超算,支撑关系国计民生的重大计算应用。操作系统作为超算核心系统软件之一,其开销将影响超算整机的运行性能,因此操作系统测评成为新一代国产超算技术路线的重要研究课题之一。openEuler在搭载了鲲鹏处理器的系统上有良好的性能与兼容性,但尚未在超算领域有过大规模应用,因此需要对其性能进行全面评测,并对存在的性能瓶颈进行优化。文中的工作分为两个部分。1)对openEuler在超算系统上的性能开展了评测,并以CentOS为参考对象进行了对比。结果表明,在运行非集合通信密集型应用时,openEuler的性能与CentOS相当。然而,在使用OpenMPI进行Allreduce等集合通信操作时,openEuler的性能会降低最多76.83%,并导致千核规模下通信密集型应用的性能降低最多23.01%。2)基于在评测过程中发现的MPI集合通信性能问题,提出了一种性能建模与优化方法。该方法基于点对点通信的霍克尼模型,为集合通信各实现算法进行建模,以预测不同进程数量和消息大小下的通信时间,从而选择合适的集合通信实现算法。所提方法可通过OpenMPI的MCA接口在运行时动态调整实现算法的选择。优化后,openEuler上的科学计算应用性能提升显著,运行时间最多缩短了26%。

关键词: 高性能计算, 国产超级计算机, 操作系统, 性能评测, 集合通信性能

Abstract: Supercomputers play a crucial role in supporting scientific computing applications.During these five years,our country is developing post-exascale domestic supercomputers.As one of the core components of supercomputers,the operating system's overhead will impact the performance of the supercomputer system.Therefore,the evaluation of the OS is one of the important subjects in supercomputer research.Among existing domestic OSs,openEuler offers high performance and compatibility on systems equipped with Kunpeng processors.However,openEuler has not been extensively applied to supercomputers.Therefore,it is necessary to evaluate its performance on supercomputers,and optimize the existing performance bottlenecks.Our work can be divided into two parts.1)We evaluate the compatibility of openEuler and its performance when running HPC applications.CentOS is used as a reference for comparison.The evaluation results show that when running non-communication-intensive applications,the performance of openEuler is comparable to CentOS.However,when using OpenMPI for collective communication operations such as Allreduce,the performance on openEuler decreases by up to 76.83%.Additionally,under thousand-core scale,the parallel efficiency of communication-intensive applications on openEuler decreases by up to 23.01%.2)Based on the performance issues with MPI collective communication identified during the evaluation process,we propose a performance modeling and optimization method.This method relies on the Hockney model of point-to-point communication to model the performance of various collective communication algorithm implementations.It predicts communication time under different numbers of processes and message sizes,enabling the selection of suitable collective communication algorithm implementations.Utilizing the MCA interface of OpenMPI,this method allows for dynamic adjustment of algorithm implementations at runtime.After optimization,the perfor-mance of HPC applications on openEuler has been significantly improved,with a maximum reduction in running time of 26%.

Key words: High-performance computing, Domestic supercomputer, Operating system, Performance evaluation, Collective communication performance

中图分类号: 

  • TP316
[1]Home-|TOP500[EB/OL].[2024-05-20].https://www.top500.org/.
[2]CentOS Project shifts focus to CentOS Stream - Blog.CentOS.org[EB/OL].[2024-05-20].https://blog.centos.org/2020/12/future-is-centos-stream/.
[3]NI G.The essence of information security is autonomous and controllable[J].China Economy & Informatization,2013(5):18-19.
[4]ZHOU M,HU X,XIONG W.openEuler:Advancing a Hardware and Software Application Ecosystem[J].IEEE Software,2022,39(2):101-105.
[5]WANG R,WANG Q,HU Y,et al.Industry practice of configuration auto-tuning for cloud applications and services[C]//Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.New York,NY,USA:Association for Computing Machinery,2022:1555-1565.
[6]UOS Community[EB/OL].[2024-05-20].https://www.chi-nauos.com/.
[7]GitHub - HPC-SJTU/Performance_Evaluation_of_openEuler at dev[EB/OL].[2024-05-20].https://github.com/HPC-SJTU/Performance_Evaluation_of_openEuler/tree/dev.
[8]GEROFI B,ISHIKAWA Y,RIESEN R,et al.Operating Systems for Supercomputers and High Performance Computing[M].Singapore:Springer Singapore,2019.
[9]RIESEN R,WHEAT S R,MACCABE A B.Active messagesversus explicit message passing under SUNMOS:SAND-94-1582C;CONF-9406205-3[R].Sandia National Labs.,Albuquerque,NM(United States),1994.
[10]SHIMIZU M,UKAI T,SANPEI H,et al.HSFS:Hitachi striping file system for super technical server SR11000[C]//Forum on Information Technology Letters.2005.
[11]YOSHII K,ISKRA K,NAIK H,et al.Performance and Scalability Evaluation of ‘Big Memory' on Blue Gene Linux[J].The International Journal of High Performance Computing Applications,2011,25(2):148-160.
[12]NEC SX-Aurora TSUBASA[EB/OL].[2024-05-20].https://www.nec.com/en/global/solutions/hpc/sx/index.html.
[13]GIAMPAPA M,GOODING T,INGLETT T,et al.Experiences with a Lightweight Supercomputer Kernel:Lessons Learned from Blue Gene's CNK[C]//Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing.Networking,Storage and Analysis.2010:1-10.
[14]GSCHWIND M.Blue Gene/Q:design for sustained multi-petaflop computing[C]//Proceedings of the 26th ACM International Conference on Supercomputing.New York,NY,USA:Association for Computing Machinery,2012:245-246.
[15]GEROFI B,TARUMIZU K,ZHANG L,et al.Linux vs.lightweight multi-kernels for high performance computing:experiences at pre-exascale[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.New York,NY,USA:Association for Computing Machinery,2021:1-13.
[16]SHIMOSAWA T,GEROFI B,TAKAGI M,et al.Interface for heterogeneous kernels:A framework to enable hybrid OS designs targeting high performance computing on manycore architectures[C]//2014 21st International Conference on High Performance Computing(HiPC).2014:1-10.
[17]WISNIEWSKI R W,INGLETT T,KEPPEL P,et al.mOS:an architecture for extreme-scale operating systems[C]//Procee-dings of the 4th International Workshop on Runtime and Opera-ting Systems for Supercomputers.New York,NY,USA:Association for Computing Machinery,2014:1-8.
[18]KLIMIANKOU Y.Towards practical multikernel OSes withMySyS[C]//Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems.New York,NY,USA:Association for Computing Machinery,2022:29-37.
[19]Rocky Linux[EB/OL].[2024-05-20].https://rockylinux.org/.
[20]AlmaLinux OS - Forever-Free Enterprise-Grade Operating System[EB/OL].[2024-05-20].https://almalinux.org/.
[21]LI J Q,LIAO X K,MA J.A Typical Commercial Application for Kylin Operating System[C]//CSMA 2017.2017.
[22]About Anolis OS 8[EB/OL].[2024-05-20].https://openanolis.cn/anolisos/.
[23]Advantech has completed product compatibility mutual certification with UOS and Kirin OS based on Zhaoxin platform industrial motherboard[J].Microcontrollers & Embedded Systems,2022,22(11):96.
[24]GROPP W,LUSK E,DOSS N,et al.A high-performance,portable implementation of the MPI message passing interface standard[J].Parallel Computing,1996,22(6):789-828.
[25]LIU J,JIANG W,WYCKOFF P,et al.Design and implementation of MPICH2 over InfiniBand with RDMA support[C]//18th International Parallel and Distributed Processing Symposium,2004.Proceedings.2004.
[26]CHEN S,HE W,QI F,et al.Hybrid Approach to Optimize MPI Collectives by In-network-computation and Point-to-Point Messages[C]//2022 7th International Conference on Computer and Communication Systems(ICCCS).2022:773-783.
[27]MOODY A,FERNANDEZ J,PETRINI F,et al.Scalable NIC-based Reduction on Large-scale Clusters[C]//Proceedings of the 2003 ACM/IEEE conference on Supercomputing.New York,NY,USA:Association for Computing Machinery,2003.
[28]PETRINI F,COLL S,FRACHTENBERG E,et al.Hardware- and software-based collective communication on the Quadrics network[C]//Proceedings IEEE International Symposium on Network Computing and Applications.NCA 2001.2001:24-35.
[29]ALMASI G,ARCHER C,CASTANOS J G,et al.Design andimplementation of message-passing services for the Blue Gene/L supercomputer[J].IBM Journal of Research and Development,2005,49(2/3):393-406.
[30]WILKINS M,GUO Y,THAKUR R,et al.ACCLAiM:Advancing the Practicality of MPI Collective Communication Autotuning Using Machine Learning[C]//2022 IEEE International Conference on Cluster Computing(CLUSTER).2022:161-171.
[31]HASANOV K,LASTOVETSKY A.Hierarchical redesign ofclassic MPI reduction algorithms[J].The Journal of Supercomputing,2017,73(2):713-725.
[32]NURIYEV E,RICO-GALLEGO J A,LASTOVETSKY A.Model-based selection of optimal MPI broadcast algorithms for multi-core clusters[J].Journal of Parallel and Distributed Computing,2022,165:1-16.
[33]DONGARRA J J,LUSZCZEK P,PETITET A.The LINPACK Benchmark:past,present and future[J].Concurrency and Computation:Practice and Experience,2003,15(9):803-820.
[34]First Experiences in Performance Benchmarking with the NewSPEChpc 2021 Suites [J].arXiv:2203.06751,2021.
[35]SJTU HPC[EB/OL].[2024-05-20].https://hpc.sjtu.edu.cn/Item/Hardware.htm.
[36]XIA J,CHENG C,ZHOU X,et al.Kunpeng 920:The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services[J].IEEE Micro,2021,41(5):67-75.
[37]LANKES S,PICKARTZ S,BREITBART J.HermitCore:AUnikernel for Extreme Scale Computing[C]//Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers.New York,NY,USA:Association for Computing Machinery,2016:1-8.
[38]GEROFI B,RIESEN R,TAKAGI M,et al.Performance andScalability of Lightweight Multi-kernel Based Operating Systems[C]//2018 IEEE International Parallel and Distributed Processing Symposium(IPDPS).2018:116-125.
[39]CHA S J,JEON S H,JEONG Y J,et al.OS noise Analysis on Azalea-unikernel[C]//2022 24th International Conference on Advanced Communication Technology(ICACT).2022:81-84.
[40]XU H,HU Y,TAN B,et al.Fault Injection based Failure Analysis of three CentOS-like Operating Systems[J].arXiv:2210.08728,2023.
[41]CHUNDURI S,PARKER S,BALAJI P,et al.Characterization of MPI Usage on a Production Supercomputer[C]//SC18:International Conference for High Performance Computing,Networking,Storage and Analysis.2018:386-400.
[42]HOCKNEY R W.The communication challenge for MPP:IntelParagon and Meiko CS-2[J].Parallel Computing,1994,20(3):389-398.
[43]KARP R M,SAHAY A,SANTOS E E,et al.Optimal broadcast and summation in the LogP model[C]//Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms andArchite-ctures.New York,NY,USA:Association for Computing Machi-nery,1993:142-153.
[44]ALEXANDROV A,IONESCU M F,SCHAUSER K E,et al.LogGP:incorporating long messages into the LogP model-one step closer towards a realistic model for parallel computation[C]//Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures.New York,NY,USA:Association for Computing Machinery,1995:95-105.
[45]KIELMANN T,BAL H E,VERSTOEP K.Fast Measurement of LogP Parameters for Message Passing Platforms[C]//Parallel and Distributed Processing.Berlin,Heidelberg:Springer,2000:1176-1183.
[46]RICO-GALLEGO J A,DÍAZ-MARTÍN J C.τ-Lop:Modelingperformance of shared memory MPI[J].Parallel Computing,2015,46:14-31.
[47]THAKUR R,RABENSEIFNER R,GROPP W.Optimization of Collective Communication Operations in MPICH[J].The International Journal of High Performance Computing Applications,2005,19(1):49-66.
[48]CHAN E,HEIMLICH M,PURKAYASTHA A,et al.Collec-tive communication:theory,practice,and experience[J].Concurrency and Computation:Practice and Experience,2007,19(13):1749-1783.
[49]RABENSEIFNER R,TRÄFF J L.More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems[C]//Recent Advances in Parallel Virtual Machine and Message Passing Interface.Berlin,Heidelberg:Springer,2004:36-46.
[50]CULLER D,KARP R,PATTERSON D,et al.LogP:towards a realistic model of parallel computation[J].ACM SIGPLAN Notices,1993,28(7):1-12.
[51]VADHIYAR S S,FAGG G E,DONGARRA J.AutomaticallyTuned Collective Communications[C]//Proceedings of the 2000 ACM/IEEE Conference on Supercomputing.2000.
[52]THAKUR R,GROPP W D.Improving the Performance of Collective Operations in MPICH[C]//Recent Advances in Parallel Virtual Machine and Message Passing Interface.Berlin,Heidelberg:Springer,2003:257-267.
[53]PJEŠIVAC-GRBOVIĆ J,BOSILCA G,FAGG G E,et al.Decision Trees and MPI Collective Algorithm Selection Problem[C]//Euro-Par 2007 Parallel Processing.Berlin,Heidelberg:Springer,2007:107-117.
[54]QUINLAN J R.C4.5:Programs for Machine Learning[M].Elsevier,2014.
[55]HUNOLD S,STEINER S.OMPICollTune:Autotuning MPICollectives by Incremental Online Learning[C]//2022 IEEE/ACM International Workshop on Performance Modeling,Benchmarking and Simulation of High Performance Computer Systems(PMBS).2022:123-128.
[56]HUNOLD S,BHATELE A,BOSILCA G,et al.Predicting MPI Collective Communication Performance Using Machine Learning[C]//2020 IEEE International Conference on Cluster Computing(CLUSTER).2020:259-269.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!