Computer Science ›› 2024, Vol. 51 ›› Issue (9): 40-50.doi: 10.11896/jsjkx.231000221

• High Performance Computing • Previous Articles     Next Articles

Study on High Performance Computing Container Checkpoint Technology Based on CRIU

CHEN Yiyang1,2, WANG Xiaoning1, YAN Xiaoting1,2, LI Guanlong1,2 ZHAO Yining1, LU Shasha1, XIAO Haili1   

  1. 1 Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
    2 School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2023-10-31 Revised:2024-03-28 Online:2024-09-15 Published:2024-09-10
  • About author:CHEN Yiyang,born in 1997,postgra-duate.His main research interests include high performance computing and cloud service.
    WANG Xiaoning,born in 1981,Ph.D,associate professor,Ph.D supervisor,master supervisor.Her main research interests include high performance computing,grid computing and cloud ser-vice.
  • Supported by:
    Young Scientist Project of National Key R&D Program of China(2021YFB0300800).

Abstract: Fault tolerance has always been a hot and difficult problem in the field of high performance computing.Checkpointing is a common technical means to solve the fault tolerance problem,which can dump the state of running processes into files and recover them.Containers have strong resource isolation capability,which can provide a more ideal running environment and carrier for checkpointing technology and avoid the abnormality caused by the change of environment and resources in the case of node change after migration.Therefore,the combination of container and checkpointing can better support the research and implementation of task migration.This paper focuses on the design and optimization of Singularity checkpointing scheme based on CRIU(Checkpoint/Restore In Userspace).Based on the characteristics of checkpointing technology in HPC container applications,effective solutions are given in terms of safe use of CRIU,migration performance optimization,and maintaining network status.The paper extends the checkpointing function to Singularity and implements the prototype tool Migrator to verify the container migration performance.It can provide support for the subsequent implementation of HPC task migration.

Key words: Container, Checkpoint, High performance computing, Live migration, Fault tolerance

CLC Number: 

  • TP311.1
[1]YANG X,WANG Z,XUE J,et al.The Reliability Wall forExascale Supercomputing [J].IEEE Transactions on Compu-ters,2012,61(6):767-779.
[2]BOZYIGIT M,WASIQ M.User-level process checkpoint andrestore for migration [J].ACM SIGOPS Operating Systems Review,2001,35(2):86-96.
[3]PEARCE M,ZEADALLY S,HUNT R.Virtualization:Issues,security threats,and solutions [J].ACM Computing Surveys,2013,45(2):1-39.
[4]LAADAN O,NIEH J.Operating system virtualization:practice and experience[C]//Proceedings of the 3rd Annual Haifa Experimental Systems Conference.2010:1-12.
[5]XAVIER M G,NEVES M V,ROSSI F D,et al.PerformanceEvaluation of Container-Based Virtualization for High Perfor-mance Computing Environments[C]//2013 21st Euromicro International Conference on Parallel,Distributed,and Network-Based Processing.2013:233-240.
[6]LIU P N,GUITART J.Performance comparison of multi-container deployment schemes for HPC workloads:an empirical study [J].Journal of Supercomputing,2021,77(6):6273-6312.
[7]ABRAHAM S,PAUL A K,KHAN R I S,et al.On the Use of Containers in High Performance Computing Environments [C]//IEEE 13th International Conference on Cloud Computing(CLOUD).2020:284-293.
[8]JAERYUN L,CHAE Y,TAK B.Comparative Analysis of Container for High Performance Computing [J].Journal of the Korea Society of Computer and Information,2020,25(9):11-20.
[9]TORREZ A,RANDLES T,PRIEDHORSKY R,et al.HPC container runtimes have minimal or no performance impact [C]//1st IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC(CANOPIE-HPC).2019:37-42.
[10]YONG C H,LEE G W,HUH E N.Proposal of Container-Based HPC Structures and Performance Analysis [J].Journal of Information Processing Systems,2018,14(6):1398-1404.
[11]ZHANG J,LU X Y,PANDA D K,et al.Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? [C]//10th International Conference on Utility and Cloud Computing(UCC)/4th International Conference on Big Data Computing,Applications and Technologies(BDCAT).2017:151-160.
[12]MERKEL D.Docker:lightweight Linux containers for consis-tent development and deployment [J].Linux Journal,2014,2014(239):76-91.
[13]JARAMILLO D,NGUYEN D V,SMART R.Leveraging mi-croservices architecture by using Docker technology[C]//Southeast Conference 2016.2016:1-5.
[14]MICHAEL K.namespaces(7) — Linux manual page [EB/OL].(2021-08-27)[2023-06-02].https://man7.org/linux/man-pages/man7/namespaces.7.html.
[15]MICHAEL K.cgroups(7)—Linux manual page [EB/OL].(2021-08-27)[2023-06-02].https://man7.org/linux/man-pages/man7/cgroups.7.html.
[16]LEITE L,ROCHA C,KON F,et al.A survey of DevOps concepts and challenges [J].ACM Computing Surveys(CSUR),2019,52(6):1-35.
[17]DEBAB R,HIDOUCI W K.Containers Runtimes War:A Comparative Study [C]//Proceedings of the Future Technologies Conference.Springer,2020:135-161.
[18]KURTZER G M,SOCHAT V,BAUER M W.Singularity:Scientific containers for mobility of compute [J].PLOS ONE,2017,12(5):e0177459.
[19]Shane Canon and Douglas Jacobsen Revision.shifter[EB/OL].[2023-6-22].https://shifter.readthedocs.io/en/latest/.
[20]REID P,TIM R.Charliecloud:unprivileged containers for user-defined software stacks in HPC[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Association for Computing Machi-nery,2017:1-10.
[21]BENEDICIC L,CRUZ F A,MADONNA A,et al.Sarus:Highly Scalable Docker Containers for HPC Systems [C]//IEEE International Conference on High Performance Computing,Data,and Analytics.2019:46-60.
[22]BRIAN A,WAHID B,TINA B,et al.2014 NERSC Workload Analysis [EB/OL].(2014-11-05)[2023-06-02].https://portal.nersc.gov/project/mpccc/baustin/NERSC_2014_Workload_Analysis_v1.1.pdf.
[23]AUSTIN B.NERSC-10 Workload Analysis(Data from 2018) [EB/OL].(2020-04-01)[2023-06-02].https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf.
[24]MICHELLE M.Enabling User Defined Software Stacks withSingularity [EB/OL].(2021-11-15)[2023-06-02].https://www.nas.nasa.gov/hecc/support/kb/enabling-user-defined-software-stacks-with-singularity_637.html.
[25]Pi supercomputing cluster user documentation [EB/OL].(2020-08-15)[2023-06-02].https://docs.hpc.sjtu.edu.cn/container/index.html.
[26]FRANCESCO D,ENRICO M U,LUCA C,et al.AWS ParallelCluster [EB/OL].https://github.com/aws/aws-parallelcluster.
[27]CHRISTIAN K.HPC Container Engine State-of-Art [EB/OL].(2021-02-06)[2023-06-02].https://containers-on-pcluster.workshop.aws/.
[28]ALIBABA-CLOUD.Elastic High Performance Computing [EB/OL].https://www.alibabacloud.com/product/ehpc.
[29]ALIBABA-CLOUD.E-HPC:Use high-performance containerapplications [EB/OL].https://help.aliyun.com/document_detail/102579.html?spm=5176.21213303.J_6704733920.37.5dab3eda2NILb8&scm=20140722.S_help%40%40%E6%96%87%E6%A1%A3%40%40102579.S_0%2Bos0.ID_102579-RL_singularity-OR_helpmain-V_2-P0_6.
[30]Voluntary registry of Singularity installations [EB/OL].ht-tps://docs.google.com/spreadsheets/d/1Vc_1prq_1WHGf0LWtpUBY-tfKdLLM_TErjnCe1mY5m0/pub?gid=1407658660&single=true&output=pdf.
[31]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//IEEE International Symposium on Parallel & Distributed Processing.IEEE,2009:1-12.
[32]PICKARTZ S,EILING N,LANKES S,et al.Migrating LinuX containers using CRIU[C]//International Conference on High Performance Computing.Springer,2016:674-684.
[33]ZHAO Y,XIA N,TIAN C,et al.Performance of container networking technologies[C]//Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems.2017:1-6.
[34]ZHOU D,TAMIR Y.Fault-tolerant containers using nilicon[C]//2020 IEEE International Parallel and Distributed Proces-sing Symposium(IPDPS).IEEE,2020:1082-1091.
[35]STOYANOV R,KOLLINGBAUM M J.Efficient live migration of linux containers[C]//High Performance Computing:ISC High Performance 2018 International Workshops.Springer,2018:184-193.
[36]SNYDER P.tmpfs:A virtual memory file system:[C]//Proceedings of the autumn 1990 EUUG Conference.1990:241-248.
[37]GROMACS development team.gromacs[EB/OL].https://manual.gromacs.org/documentation/2019/index.html.
[38]THOMPSON A P,AKTULGA H M,BERGER R,et al.LAMMPS-a flexible simulation tool for particle-based mate-rials modeling at the atomic,meso,and continuum scales[J].Computer Physics Communications,2022,271:108171.
[39]TROTT O,OLSON A J.AutoDock Vina:improving the speed and accuracy of docking with a new scoring function,efficient optimization,and multithreading[J].Journal of Computational Chemistry,2010,31(2):455-461.
[1] YAN Xiaoting, WANG Xiaoning, DONG Sheng, ZHAO Yining, XIAO Haili. Review on the Development and Application of Checkpointing Technology in High-performanceComputing [J]. Computer Science, 2024, 51(9): 1-14.
[2] CHENG Andong, XIE Sijiang, LIU Ang, FENG Yimeng. Efficient Quantum-secure Byzantine Fault Tolerance Consensus Mechanism Based on HotStuff [J]. Computer Science, 2024, 51(8): 429-439.
[3] LI Yuanxin, GUO Zhongfeng, YANG Junlin. Container Lock Hole Recognition Algorithm Based on Lightweight YOLOv5s [J]. Computer Science, 2024, 51(6A): 230900021-6.
[4] LIU Daoqing, HU Hongchao, HUO Shumin. N-variant Architecture for Container Runtime Security Threats [J]. Computer Science, 2024, 51(6): 399-408.
[5] DONG Hao, ZHAO Hengtai, WANG Ziyao, YUAN Ye, ZHANG Aoqian. Parallel Transaction Execution Models Under Permissioned Blockchains [J]. Computer Science, 2024, 51(1): 124-132.
[6] DENG Guanghong, ZHANG Qiheng. Container-based Scheduling Architecture for Mixed-Criticality Systems [J]. Computer Science, 2023, 50(6A): 220800215-5.
[7] LIU Wei, GUO Lingbei, XIA Yujie, SHE Wei, TIAN Zhao. Raft Consensus Algorithm Based on Credit Evaluation Model [J]. Computer Science, 2023, 50(6): 322-329.
[8] WANG Zhuang, WANG Pinghui, WANG Bincheng, WU Wenbo, WANG Bin, CONG Pengyu. GPU Shared Scheduling System Under Deep Learning Container Cloud Platform [J]. Computer Science, 2023, 50(6): 86-91.
[9] XIE Yongsheng, HUANG Xiangheng, CHEN Ningjiang. Self-balanced Scheduling Strategy for Container Cluster Based on Improved DQN Algorithm [J]. Computer Science, 2023, 50(4): 233-240.
[10] YANG Pengfei, CAI Ruijie, GUO Shichen, LIU Shengli. Container-based Intrusion Detection Method for Cisco IOS-XE [J]. Computer Science, 2023, 50(4): 298-307.
[11] CHEN Yiyang, WANG Xiaoning, LU Shasha, XIAO Haili. Survey of Container Technology for High-performance Computing System [J]. Computer Science, 2023, 50(2): 353-363.
[12] LIU Rixin, QIN Wei, XU Hongwei. Improved Metaheuristics for Single Container Loading Problem with Complex Constraints [J]. Computer Science, 2023, 50(11A): 221200091-10.
[13] LENG Dian-dian, DU Peng, CHEN Jian-ting, XIANG Yang. Automated Container Terminal Oriented Travel Time Estimation of AGV [J]. Computer Science, 2022, 49(9): 208-214.
[14] CHEN Yan-bing, ZHONG Chao-ran, ZHOU Chao-ran, XUE Ling-yan, HUANG Hai-ping. Design of Cross-domain Authentication Scheme Based on Medical Consortium Chain [J]. Computer Science, 2022, 49(6A): 537-543.
[15] LI Bo, XIANG Hai-yun, ZHANG Yu-xiang, LIAO Hao-de. Application Research of PBFT Optimization Algorithm for Food Traceability Scenarios [J]. Computer Science, 2022, 49(6A): 723-728.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!