计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 40-50.doi: 10.11896/jsjkx.231000221

• 高性能计算 • 上一篇    下一篇

基于CRIU的高性能计算容器检查点技术研究

陈轶阳1,2, 王小宁1, 闫晓婷1,2, 李冠龙1,2, 赵一宁1, 卢莎莎1, 肖海力1   

  1. 1 中国科学院计算机网络信息中心 北京 100190
    2 中国科学院大学计算机科学与技术学院 北京 100049
  • 收稿日期:2023-10-31 修回日期:2024-03-28 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 王小宁(wxn@sccas.cn)
  • 作者简介:(chenyiyang@cnic.cn)
  • 基金资助:
    国家重点研发计划青年项目(2021YFB0300800)

Study on High Performance Computing Container Checkpoint Technology Based on CRIU

CHEN Yiyang1,2, WANG Xiaoning1, YAN Xiaoting1,2, LI Guanlong1,2 ZHAO Yining1, LU Shasha1, XIAO Haili1   

  1. 1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2023-10-31 Revised:2024-03-28 Online:2024-09-15 Published:2024-09-10
  • About author:CHEN Yiyang,born in 1997,postgra-duate.His main research interests include high performance computing and cloud service.
    WANG Xiaoning,born in 1981,Ph.D,associate professor,Ph.D supervisor,master supervisor.Her main research interests include high performance computing,grid computing and cloud ser-vice.
  • Supported by:
    Young Scientist Project of National Key R&D Program of China(2021YFB0300800).

摘要: 容错一直是高性能计算领域的热点和难点问题。检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复。容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更的情况下由于环境与资源变化而出现异常。因此,容器和检查点相结合能够更好地支撑任务迁移的研究与实现。文中围绕基于CRIU(Checkpoint/Restore In Userspace)的Singularity容器检查点方案的设计和优化展开,根据检查点技术在高性能计算容器应用中的特点,在CRIU安全使用、迁移性能优化、保持网络状态方面给出了有效的解决方案,基于这些方案拓展了Singularity容器检查点功能,并且实现了原型工具Migrator来验证容器迁移性能。期望本工作能为后续实现高性能计算任务迁移提供有效的支撑。

关键词: 容器, 检查点, 高性能计算, 热迁移, 容错

Abstract: Fault tolerance has always been a hot and difficult problem in the field of high performance computing.Checkpointing is a common technical means to solve the fault tolerance problem,which can dump the state of running processes into files and recover them.Containers have strong resource isolation capability,which can provide a more ideal running environment and carrier for checkpointing technology and avoid the abnormality caused by the change of environment and resources in the case of node change after migration.Therefore,the combination of container and checkpointing can better support the research and implementation of task migration.This paper focuses on the design and optimization of Singularity checkpointing scheme based on CRIU(Checkpoint/Restore In Userspace).Based on the characteristics of checkpointing technology in HPC container applications,effective solutions are given in terms of safe use of CRIU,migration performance optimization,and maintaining network status.The paper extends the checkpointing function to Singularity and implements the prototype tool Migrator to verify the container migration performance.It can provide support for the subsequent implementation of HPC task migration.

Key words: Container, Checkpoint, High performance computing, Live migration, Fault tolerance

中图分类号: 

  • TP311.1
[1]YANG X,WANG Z,XUE J,et al.The Reliability Wall forExascale Supercomputing [J].IEEE Transactions on Compu-ters,2012,61(6):767-779.
[2]BOZYIGIT M,WASIQ M.User-level process checkpoint andrestore for migration [J].ACM SIGOPS Operating Systems Review,2001,35(2):86-96.
[3]PEARCE M,ZEADALLY S,HUNT R.Virtualization:Issues,security threats,and solutions [J].ACM Computing Surveys,2013,45(2):1-39.
[4]LAADAN O,NIEH J.Operating system virtualization:practice and experience[C]//Proceedings of the 3rd Annual Haifa Experimental Systems Conference.2010:1-12.
[5]XAVIER M G,NEVES M V,ROSSI F D,et al.PerformanceEvaluation of Container-Based Virtualization for High Perfor-mance Computing Environments[C]//2013 21st Euromicro International Conference on Parallel,Distributed,and Network-Based Processing.2013:233-240.
[6]LIU P N,GUITART J.Performance comparison of multi-container deployment schemes for HPC workloads:an empirical study [J].Journal of Supercomputing,2021,77(6):6273-6312.
[7]ABRAHAM S,PAUL A K,KHAN R I S,et al.On the Use of Containers in High Performance Computing Environments [C]//IEEE 13th International Conference on Cloud Computing(CLOUD).2020:284-293.
[8]JAERYUN L,CHAE Y,TAK B.Comparative Analysis of Container for High Performance Computing [J].Journal of the Korea Society of Computer and Information,2020,25(9):11-20.
[9]TORREZ A,RANDLES T,PRIEDHORSKY R,et al.HPC container runtimes have minimal or no performance impact [C]//1st IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC(CANOPIE-HPC).2019:37-42.
[10]YONG C H,LEE G W,HUH E N.Proposal of Container-Based HPC Structures and Performance Analysis [J].Journal of Information Processing Systems,2018,14(6):1398-1404.
[11]ZHANG J,LU X Y,PANDA D K,et al.Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? [C]//10th International Conference on Utility and Cloud Computing(UCC)/4th International Conference on Big Data Computing,Applications and Technologies(BDCAT).2017:151-160.
[12]MERKEL D.Docker:lightweight Linux containers for consis-tent development and deployment [J].Linux Journal,2014,2014(239):76-91.
[13]JARAMILLO D,NGUYEN D V,SMART R.Leveraging mi-croservices architecture by using Docker technology[C]//Southeast Conference 2016.2016:1-5.
[14]MICHAEL K.namespaces(7) — Linux manual page [EB/OL].(2021-08-27)[2023-06-02].https://man7.org/linux/man-pages/man7/namespaces.7.html.
[15]MICHAEL K.cgroups(7)—Linux manual page [EB/OL].(2021-08-27)[2023-06-02].https://man7.org/linux/man-pages/man7/cgroups.7.html.
[16]LEITE L,ROCHA C,KON F,et al.A survey of DevOps concepts and challenges [J].ACM Computing Surveys(CSUR),2019,52(6):1-35.
[17]DEBAB R,HIDOUCI W K.Containers Runtimes War:A Comparative Study [C]//Proceedings of the Future Technologies Conference.Springer,2020:135-161.
[18]KURTZER G M,SOCHAT V,BAUER M W.Singularity:Scientific containers for mobility of compute [J].PLOS ONE,2017,12(5):e0177459.
[19]Shane Canon and Douglas Jacobsen Revision.shifter[EB/OL].[2023-6-22].https://shifter.readthedocs.io/en/latest/.
[20]REID P,TIM R.Charliecloud:unprivileged containers for user-defined software stacks in HPC[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Association for Computing Machi-nery,2017:1-10.
[21]BENEDICIC L,CRUZ F A,MADONNA A,et al.Sarus:Highly Scalable Docker Containers for HPC Systems [C]//IEEE International Conference on High Performance Computing,Data,and Analytics.2019:46-60.
[22]BRIAN A,WAHID B,TINA B,et al.2014 NERSC Workload Analysis [EB/OL].(2014-11-05)[2023-06-02].https://portal.nersc.gov/project/mpccc/baustin/NERSC_2014_Workload_Analysis_v1.1.pdf.
[23]AUSTIN B.NERSC-10 Workload Analysis(Data from 2018) [EB/OL].(2020-04-01)[2023-06-02].https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf.
[24]MICHELLE M.Enabling User Defined Software Stacks withSingularity [EB/OL].(2021-11-15)[2023-06-02].https://www.nas.nasa.gov/hecc/support/kb/enabling-user-defined-software-stacks-with-singularity_637.html.
[25]Pi supercomputing cluster user documentation [EB/OL].(2020-08-15)[2023-06-02].https://docs.hpc.sjtu.edu.cn/container/index.html.
[26]FRANCESCO D,ENRICO M U,LUCA C,et al.AWS ParallelCluster [EB/OL].https://github.com/aws/aws-parallelcluster.
[27]CHRISTIAN K.HPC Container Engine State-of-Art [EB/OL].(2021-02-06)[2023-06-02].https://containers-on-pcluster.workshop.aws/.
[28]ALIBABA-CLOUD.Elastic High Performance Computing [EB/OL].https://www.alibabacloud.com/product/ehpc.
[29]ALIBABA-CLOUD.E-HPC:Use high-performance containerapplications [EB/OL].https://help.aliyun.com/document_detail/102579.html?spm=5176.21213303.J_6704733920.37.5dab3eda2NILb8&scm=20140722.S_help%40%40%E6%96%87%E6%A1%A3%40%40102579.S_0%2Bos0.ID_102579-RL_singularity-OR_helpmain-V_2-P0_6.
[30]Voluntary registry of Singularity installations [EB/OL].ht-tps://docs.google.com/spreadsheets/d/1Vc_1prq_1WHGf0LWtpUBY-tfKdLLM_TErjnCe1mY5m0/pub?gid=1407658660&single=true&output=pdf.
[31]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//IEEE International Symposium on Parallel & Distributed Processing.IEEE,2009:1-12.
[32]PICKARTZ S,EILING N,LANKES S,et al.Migrating LinuX containers using CRIU[C]//International Conference on High Performance Computing.Springer,2016:674-684.
[33]ZHAO Y,XIA N,TIAN C,et al.Performance of container networking technologies[C]//Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems.2017:1-6.
[34]ZHOU D,TAMIR Y.Fault-tolerant containers using nilicon[C]//2020 IEEE International Parallel and Distributed Proces-sing Symposium(IPDPS).IEEE,2020:1082-1091.
[35]STOYANOV R,KOLLINGBAUM M J.Efficient live migration of linux containers[C]//High Performance Computing:ISC High Performance 2018 International Workshops.Springer,2018:184-193.
[36]SNYDER P.tmpfs:A virtual memory file system:[C]//Proceedings of the autumn 1990 EUUG Conference.1990:241-248.
[37]GROMACS development team.gromacs[EB/OL].https://manual.gromacs.org/documentation/2019/index.html.
[38]THOMPSON A P,AKTULGA H M,BERGER R,et al.LAMMPS-a flexible simulation tool for particle-based mate-rials modeling at the atomic,meso,and continuum scales[J].Computer Physics Communications,2022,271:108171.
[39]TROTT O,OLSON A J.AutoDock Vina:improving the speed and accuracy of docking with a new scoring function,efficient optimization,and multithreading[J].Journal of Computational Chemistry,2010,31(2):455-461.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!