计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 40-50.doi: 10.11896/jsjkx.231000221
陈轶阳1,2, 王小宁1, 闫晓婷1,2, 李冠龙1,2, 赵一宁1, 卢莎莎1, 肖海力1
CHEN Yiyang1,2, WANG Xiaoning1, YAN Xiaoting1,2, LI Guanlong1,2 ZHAO Yining1, LU Shasha1, XIAO Haili1
摘要: 容错一直是高性能计算领域的热点和难点问题。检查点是解决容错问题的一种常用技术手段,它能够将运行进程的状态转储成文件并恢复。容器具有较强的资源隔离能力,可以为检查点技术提供更理想的运行环境与载体,避免迁移后任务在节点变更的情况下由于环境与资源变化而出现异常。因此,容器和检查点相结合能够更好地支撑任务迁移的研究与实现。文中围绕基于CRIU(Checkpoint/Restore In Userspace)的Singularity容器检查点方案的设计和优化展开,根据检查点技术在高性能计算容器应用中的特点,在CRIU安全使用、迁移性能优化、保持网络状态方面给出了有效的解决方案,基于这些方案拓展了Singularity容器检查点功能,并且实现了原型工具Migrator来验证容器迁移性能。期望本工作能为后续实现高性能计算任务迁移提供有效的支撑。
中图分类号:
[1]YANG X,WANG Z,XUE J,et al.The Reliability Wall forExascale Supercomputing [J].IEEE Transactions on Compu-ters,2012,61(6):767-779. [2]BOZYIGIT M,WASIQ M.User-level process checkpoint andrestore for migration [J].ACM SIGOPS Operating Systems Review,2001,35(2):86-96. [3]PEARCE M,ZEADALLY S,HUNT R.Virtualization:Issues,security threats,and solutions [J].ACM Computing Surveys,2013,45(2):1-39. [4]LAADAN O,NIEH J.Operating system virtualization:practice and experience[C]//Proceedings of the 3rd Annual Haifa Experimental Systems Conference.2010:1-12. [5]XAVIER M G,NEVES M V,ROSSI F D,et al.PerformanceEvaluation of Container-Based Virtualization for High Perfor-mance Computing Environments[C]//2013 21st Euromicro International Conference on Parallel,Distributed,and Network-Based Processing.2013:233-240. [6]LIU P N,GUITART J.Performance comparison of multi-container deployment schemes for HPC workloads:an empirical study [J].Journal of Supercomputing,2021,77(6):6273-6312. [7]ABRAHAM S,PAUL A K,KHAN R I S,et al.On the Use of Containers in High Performance Computing Environments [C]//IEEE 13th International Conference on Cloud Computing(CLOUD).2020:284-293. [8]JAERYUN L,CHAE Y,TAK B.Comparative Analysis of Container for High Performance Computing [J].Journal of the Korea Society of Computer and Information,2020,25(9):11-20. [9]TORREZ A,RANDLES T,PRIEDHORSKY R,et al.HPC container runtimes have minimal or no performance impact [C]//1st IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC(CANOPIE-HPC).2019:37-42. [10]YONG C H,LEE G W,HUH E N.Proposal of Container-Based HPC Structures and Performance Analysis [J].Journal of Information Processing Systems,2018,14(6):1398-1404. [11]ZHANG J,LU X Y,PANDA D K,et al.Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? [C]//10th International Conference on Utility and Cloud Computing(UCC)/4th International Conference on Big Data Computing,Applications and Technologies(BDCAT).2017:151-160. [12]MERKEL D.Docker:lightweight Linux containers for consis-tent development and deployment [J].Linux Journal,2014,2014(239):76-91. [13]JARAMILLO D,NGUYEN D V,SMART R.Leveraging mi-croservices architecture by using Docker technology[C]//Southeast Conference 2016.2016:1-5. [14]MICHAEL K.namespaces(7) — Linux manual page [EB/OL].(2021-08-27)[2023-06-02].https://man7.org/linux/man-pages/man7/namespaces.7.html. [15]MICHAEL K.cgroups(7)—Linux manual page [EB/OL].(2021-08-27)[2023-06-02].https://man7.org/linux/man-pages/man7/cgroups.7.html. [16]LEITE L,ROCHA C,KON F,et al.A survey of DevOps concepts and challenges [J].ACM Computing Surveys(CSUR),2019,52(6):1-35. [17]DEBAB R,HIDOUCI W K.Containers Runtimes War:A Comparative Study [C]//Proceedings of the Future Technologies Conference.Springer,2020:135-161. [18]KURTZER G M,SOCHAT V,BAUER M W.Singularity:Scientific containers for mobility of compute [J].PLOS ONE,2017,12(5):e0177459. [19]Shane Canon and Douglas Jacobsen Revision.shifter[EB/OL].[2023-6-22].https://shifter.readthedocs.io/en/latest/. [20]REID P,TIM R.Charliecloud:unprivileged containers for user-defined software stacks in HPC[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Association for Computing Machi-nery,2017:1-10. [21]BENEDICIC L,CRUZ F A,MADONNA A,et al.Sarus:Highly Scalable Docker Containers for HPC Systems [C]//IEEE International Conference on High Performance Computing,Data,and Analytics.2019:46-60. [22]BRIAN A,WAHID B,TINA B,et al.2014 NERSC Workload Analysis [EB/OL].(2014-11-05)[2023-06-02].https://portal.nersc.gov/project/mpccc/baustin/NERSC_2014_Workload_Analysis_v1.1.pdf. [23]AUSTIN B.NERSC-10 Workload Analysis(Data from 2018) [EB/OL].(2020-04-01)[2023-06-02].https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf. [24]MICHELLE M.Enabling User Defined Software Stacks withSingularity [EB/OL].(2021-11-15)[2023-06-02].https://www.nas.nasa.gov/hecc/support/kb/enabling-user-defined-software-stacks-with-singularity_637.html. [25]Pi supercomputing cluster user documentation [EB/OL].(2020-08-15)[2023-06-02].https://docs.hpc.sjtu.edu.cn/container/index.html. [26]FRANCESCO D,ENRICO M U,LUCA C,et al.AWS ParallelCluster [EB/OL].https://github.com/aws/aws-parallelcluster. [27]CHRISTIAN K.HPC Container Engine State-of-Art [EB/OL].(2021-02-06)[2023-06-02].https://containers-on-pcluster.workshop.aws/. [28]ALIBABA-CLOUD.Elastic High Performance Computing [EB/OL].https://www.alibabacloud.com/product/ehpc. [29]ALIBABA-CLOUD.E-HPC:Use high-performance containerapplications [EB/OL].https://help.aliyun.com/document_detail/102579.html?spm=5176.21213303.J_6704733920.37.5dab3eda2NILb8&scm=20140722.S_help%40%40%E6%96%87%E6%A1%A3%40%40102579.S_0%2Bos0.ID_102579-RL_singularity-OR_helpmain-V_2-P0_6. [30]Voluntary registry of Singularity installations [EB/OL].ht-tps://docs.google.com/spreadsheets/d/1Vc_1prq_1WHGf0LWtpUBY-tfKdLLM_TErjnCe1mY5m0/pub?gid=1407658660&single=true&output=pdf. [31]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//IEEE International Symposium on Parallel & Distributed Processing.IEEE,2009:1-12. [32]PICKARTZ S,EILING N,LANKES S,et al.Migrating LinuX containers using CRIU[C]//International Conference on High Performance Computing.Springer,2016:674-684. [33]ZHAO Y,XIA N,TIAN C,et al.Performance of container networking technologies[C]//Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems.2017:1-6. [34]ZHOU D,TAMIR Y.Fault-tolerant containers using nilicon[C]//2020 IEEE International Parallel and Distributed Proces-sing Symposium(IPDPS).IEEE,2020:1082-1091. [35]STOYANOV R,KOLLINGBAUM M J.Efficient live migration of linux containers[C]//High Performance Computing:ISC High Performance 2018 International Workshops.Springer,2018:184-193. [36]SNYDER P.tmpfs:A virtual memory file system:[C]//Proceedings of the autumn 1990 EUUG Conference.1990:241-248. [37]GROMACS development team.gromacs[EB/OL].https://manual.gromacs.org/documentation/2019/index.html. [38]THOMPSON A P,AKTULGA H M,BERGER R,et al.LAMMPS-a flexible simulation tool for particle-based mate-rials modeling at the atomic,meso,and continuum scales[J].Computer Physics Communications,2022,271:108171. [39]TROTT O,OLSON A J.AutoDock Vina:improving the speed and accuracy of docking with a new scoring function,efficient optimization,and multithreading[J].Journal of Computational Chemistry,2010,31(2):455-461. |
|