高性能计算检查点技术发展与应用综述

doi:10.11896/jsjkx.231000220

Abstract

Abstract: As high-performance computers grow in size and complexity,the fault tolerance of applications becomes one of the key challenges facing exascale computing.Checkpointing technology is one of the main means used to achieve fault-tolerance of applications,enabling fault recovery by periodically saving the execution state of applications.This paper conducts a review study on the development and application of checkpointing techniques for high performance computing.First,the development of checkpointing technology in the field of high performance computing is compiled.Then,the system-level checkpointing and application-level checkpointing work are described according to the different operation levels,including the mainstream tool software,available checkpointing techniques,and the application scenarios used.The application of checkpoint technology in four aspects:fault tolerance and resilience in parallel computing,scheduling and migration of HPC,FPGA debugging,and fault tole-rance and faithful replay in deep learning,is discussed.Finally,further research directions of checkpointing technology in the field of high-performance computing are proposed.

Key words: Checkpointing, High performance computing, Fault tolerance, Scheduling, Job migration

CLC Number:

TP311.1

YAN Xiaoting, WANG Xiaoning, DONG Sheng, ZHAO Yining, XIAO Haili. Review on the Development and Application of Checkpointing Technology in High-performanceComputing[J].Computer Science, 2024, 51(9): 1-14.

References

[1]KUMAR M,MOLLA A R.On the Message Complexity ofFault-Tolerant Computation:Leader Election and Agreement[C]//Proceedings of the 24th International Conference on Distributed Computing and Networking.Kharagpur India:ACM,2023:294-295.
[2]LIN L,HUANG Y,LIN Y,et al.FFNLFD:Fault Diagnosis of Multiprocessor Systems at Local Node with Fault-Free Neighbors under PMC Model and MM* Model[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(7):1739-1751.
[3]YOUNESS H,OMAR A,MONESS M.An Optimized Weighted Average Makespan in Fault-Tolerant Heterogeneous MPSoCs[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(8):1933-1946.
[4]PSISTAKIS A,CHRYSOS N,CHAI X F,et al.Optimized Page Fault Handling During RDMA[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):3990-4005.
[5]LEE Y L,LIANG D,WANG W J.Optimal Online LivenessFault Detection for Multilayer Cloud Computing Systems[J].IEEE Transactions on Dependable and Secure Computing,2022,19(5):3464-3477.
[6]PALAZZI L,LI G,FANG B,et al.Improving the Accuracy of IR-Level Fault Injection[J].IEEE Transactions on Dependable and Secure Computing,2022,19(1):243-258.
[7]ANSARI M,SAFARI S,KHDR H,et al.Power-Aware Checkpointing for Multicore Embedded Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):4410-4424.
[8]KHORGUANI A,ROPARS T,DE PALMA N.ResPCT:fastcheckpointing in non-volatile memory for multi-threaded applications[C]//Proceedings of the Seventeenth European Confe-rence on Computer Systems.Rennes France:ACM,2022:525-540.
[9]BEIGI M V,CAO Y,GURUMURTHI S,et al.A SystematicStudy of DDR4 DRAM Faults in the Field[C]//2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).Montreal,QC,Canada:IEEE,2023:991-1002.
[10]ROJAS E,PEREZ D,CALHOUN J C,et al.Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration[C]//2021 IEEE International Conference on Cluster Computing(CLUSTER).Portland,OR,USA:IEEE,2021:492-503.
[11]BORGHESI A,MOLAN M,MILANO M,et al.Anomaly Detection and Anticipation in High Performance Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(4):739-750.
[12]ZHAO K,DI S,LI S,et al.FT-CNN:Algorithm-Based Fault Tolerance for Convolutional Neural Networks[J].IEEE Tran-sactions on Parallel and Distributed Systems 2021,32(7):1677-1689.
[13]ZOU A,LI J,GILL C D,et al.RTGPU:Real-Time GPU Sche-duling of Hard Deadline Parallel Tasks With Fine-Grain Utilization[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(5):1450-1465.
[14]MAURYA A,NICOLAE B,RAFIQUE M M,et al.TowardsEfficient I/O Scheduling for Collaborative Multi-Level Checkpointing[C]//2021 29th International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems(MASCOTS).Houston,TX,USA:IEEE,2021:1-8.
[15]LITZKOW M,TANNENBAUM T,BASNEY J,et al.Checkpoint and migration of UNIX processes in the Condor distributed processing system[R].University of Wisconsin-Madison Department of Computer Sciences,1997.
[16]HARGROVE P H,DUELL J C.Berkeley lab checkpoint/restart(BLCR) for Linux clusters[J].Journal of Physics:Conference Series,2006,46:494-499.
[17]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//2009 IEEE International Symposium on Parallel & Distributed Processing.Rome:IEEE,2009:1-12.
[18]CRIU.CRIU[EB/OL].[2023-07-16].https://criu.org/Main_Page.
[19]SCR.Scalable Checkpoint/Restart(SCR)User Guide[EB/OL].[2023-7-16].https://scr.readthedocs.io/en/latest/#.
[20]NICOLAE B,MOODY A,GONSIOROWSKI E,et al.VeloC:Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale[C]//2019 IEEE International Parallel and Distributed Processing Symposium(IPDPS).Rio de Janeiro,Brazil:IEEE,2019:911-920.
[21]The Open MPI Community.User-Level Fault Mitigation(ULFM) [EB/OL].[2023-07-16].https://docs.open-mpi.org/en/v5.0.x/features/ulfm.html.
[22]WEEKS N,LUECKE G,MARIS P,et al.Challenges in Developing MPI Fault-Tolerant Fortran Applications[C]//2018 IEEE International Conference on Cluster Computing(CLUSTER).Belfast:IEEE,2018:524-531.
[23]GAMELL M,KATZ D S,KOLLA H,et al.Exploring Automa-tic,Online Failure Recovery for Scientific Applications at Extreme Scales[C]//SC14:International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans,LA,USA:IEEE,2014:895-906.
[24]GAMELL M,TERANISHI K,HEROUX M A,et al.Local recovery and failure masking for stencil-based applications at extreme scales[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Austin Texas:ACM,2015:1-12.
[25]LEÓN B,FRANCO D,REXACHS D,et al.Analysis of parallel application checkpoint storage for system configuration[J].The Journal of Supercomputing,2021,77(5):4582-4617.
[26]WANG X,LEIDEL J D,WILLIAMS B,et al.xBGAS:A GlobalAddress Space Extension on RISC-V for High Performance Computing[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).Portland,OR,USA:IEEE,2021:454-463.
[27]DU Y,MARCHAL L,PALLEZ G,et al.Optimal Checkpointing Strategies for Iterative Applications[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(3):507-522.
[28]SIGDEL P,YUAN X,TZENG N F.Realizing Best Checkpoin-ting Control in Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(2):315-329.
[29]RAYBON G,ADAMIECKI A,CHO J,et al.Single-carrier all-ETDM 1.08-Terabit/s line rate PDM-64-QAM transmitter using a high-speed 3-bit multiplexing DAC[C]//2015 IEEE Photonics Conference(IPC).Reston,VA:IEEE,2015:1-2.
[30]DEY T,SATO K,NICOLAE B,et al.Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning[C]//2020 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).New Orleans,LA,USA:IEEE,2020:1036-1043.
[31]MO Y,XING L,LIN Y K,et al.Efficient Analysis of Repairable Computing Systems Subject to Scheduled Checkpointing[J].IEEE Transactions on Dependable and Secure Computing,2021,18(1):1-14.
[32]LEÓN B,MÉNDEZ S,FRANCO D,et al.A model of checkpoint behavior for applications that have I/O[J].The Journal of Supercomputing,2022,78(13):15404-15436.
[33]ZHOU T,GAO L,GUAN X.A Fault-Tolerant DistributedFramework for Asynchronous Iterative Computations[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(8):2062-2073.
[34]MAURYA A,NICOLAE B,RAFIQUE M M,et al.TowardsEfficient Cache Allocation for High-Frequency Checkpointing[C]//2022 IEEE 29th International Conference on High Performance Computing,Data,and Analytics(HiPC).Bengaluru,India:IEEE,2022:262-271.
[35]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//2009 IEEE International Symposium on Parallel & Distributed Processing.Rome:IEEE,2009:1-12.
[36]PLANK J S,BECK M,KINGSLEY G,et al.Libckpt:Transpa-rent checkpointing under unix[M].Computer Science Department,1994.
[37]HONG G,AHN S J,HAN S C,et al.Kckpt:checkpoint and re-covery facility on unixware kernel[C]//Proceedings of the 15th International Conference on Computers and Their Applications(ISCA).2000.
[38]TAKIZAWA H,SATO K,KOMATSU K,et al.CheCUDA:A Checkpoint/Restart Tool for CUDA Applications[C]//2009 International Conference on Parallel and Distributed Computing,Applications and Technologies.Higashi Hiroshima,Japan:IEEE,2009:408-413.
[39]NUKADA A,TAKIZAWA H,MATSUOKA S.NVCR:ATransparent Checkpoint-Restart Library for NVIDIA CUDA[C]//2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.Anchorage,AK,USA:IEEE,2011:104-113.
[40]SHI L,CHEN H,LI T.Hybrid CPU/GPU Checkpoint forGPU-Based Heterogeneous Systems[M]//Parallel Computational Fluid Dynamics:Vol.405.Berlin,Heidelberg:Springer,2014:470-481.
[41]LOSADA N,CORES I,MARTÍN M J,et al.Resilient MPI applications using an application-level checkpointing framework and ULFM[J].The Journal of Supercomputing,2017,73(1):100-113.
[42]LOSADA N,MARTÍN M J,GONZÁLEZ P.Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications[J].The Journal of Supercomputing,2017,73(1):316-329.
[43]PARASYRIS K,KELLER K,BAUTISTA-GOMEZ L,et al.Checkpoint Restart Support for Heterogeneous HPC Applications[C]//2020 20th IEEE/ACM International Symposium on Cluster,Cloud and Internet Computing(CCGRID).Melbourne,Australia:IEEE,2020:242-251.
[44]ZHANG Y,GUO X,JIANG H,et al.A Checkpoint/RestartScheme for CUDA Applications with Complex Memory Hierarchy[C]//2013 14th ACIS International Conference on Software Engineering,Artificial Intelligence,Networking and Parallel/Distributed Computing.Honolulu,HI,USA:IEEE,2013:247-252.
[45]POURGHASSEMI B,CHANDRAMOWLISHWARAN A.cu-daCR:An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs[C]//2017 IEEE Interna-tional Conference on Cluster Computing(CLUSTER).Honolulu,HI,USA:IEEE,2017:725-732.
[46]GARG R,MOHAN A,SULLIVAN M,et al.CRUM:Check-point-Restart Support for CUDA’s Unified Memory[C]//2018 IEEE International Conference on Cluster Computing(CLUSTER).Belfast:IEEE,2018:302-313.
[47]CHIU M T,YOU Y P.CLPKM:A checkpoint-based preemptive multitasking framework for OpenCL kernels[J].Journal of Systems Architecture,2019,98:53-62.
[48]CHEN G,ZHANG J,ZHU Z,et al.CRState:checkpoint/restart of OpenCL program for in-kernel applications[J].The Journal of Supercomputing,2021,77(6):5426-5467.
[49]MOHROR K,MOODY A,BRONEVETSKY G,et al.Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System[J].IEEE Transactions on Parallel and Distributed Systems,2014,25(9):2255-2263.
[50]MOODY A,BRONEVETSKY G,MOHROR K,et al.Design,Modeling,and Evaluation of a Scalable Multi-level Checkpoin-ting System[C]//2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans,LA,USA:IEEE,2010:1-11.
[51]DUARTE A,REXACHS D,LUQUE E.Increasing the cluster availability using RADIC[C]//2006 IEEE International Confe-rence on Cluster Computing.Barcelona:IEEE,2006:1-8.
[52]CASTRO-LEÓN M,MEYER H,REXACHS D,et al.Fault to-lerance at system level based on RADIC architecture[J].Journal of Parallel and Distributed Computing,2015,86:98-111.
[53]WONG A,HEYMANN E,REXACHS D,et al.Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(2):254-268.
[54]WHITLOCK M,MORALES N,BOSILCA G,et al.Integrating process,control-flow,and data resiliency layers using a hybrid Fenix/Kokkos approach[C]//2022 IEEE International Confe-rence on Cluster Computing(CLUSTER).Heidelberg,Germany:IEEE,2022:418-428.
[55]PARASYRIS K,GEORGAKOUDIS G,BAUTISTA-GOMEZL,et al.Co-Designing Multi-Level Checkpoint Restart for MPI Applications[C]//2021 IEEE/ACM 21st International Symposium on Cluster,Cloud and Internet Computing(CCGrid).Melbourne,Australia:IEEE,2021:103-112.
[56]RODRÍGUEZ-PASCUAL M,CAO J,MORÍÑIGO J A,et al.Job migration in HPC clusters by means of checkpoint/restart[J].The Journal of Supercomputing,2019,75(10):6517-6541.
[57]XU H,DE VECIANA G,LAU W C,et al.Online Job Scheduling with Redundancy and Opportunistic Checkpointing:A Speedup-Function-Based Analysis[J].IEEE Transactions on Parallel and Distributed Systems,2019,30(4):897-909.
[58]PRADES J,SILLA F.GPU-Job Migration:The rCUDA Case[J].IEEE Transactions on Parallel and Distributed Systems,2019,30(12):2718-2729.
[59]CHAUDHARY S,RAMJEE R,SIVATHANU M,et al.Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning[C]//Proceedings of the Fifteenth European Conference on Computer Systems.Heraklion Greece:ACM,2020:1-16.
[60]FOSTER H.Wilson Research Group functional verificationstudy:IC/ASIC functional verification trend report[R].White Paper.Wilson Research Group and Mentor,A Siemens Business,2020.
[61]ASAAD S,BELLOFATTO R,BREZZO B,et al.A cycle-accurate,cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation[C]//Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays.Monterey California USA:ACM,2012:153-162.
[62]ATTIA S,BETZ V.StateMover:Combining simulation andhardware execution for efficient FPGA debugging[C]//Procee-dings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.2020:175-185.
[63]ATTIA S,BETZ V.Toward Software-like Debugging for FP-GAs via Checkpointing and Transaction-based Co-Simulation[J].ACM Transactions on Reconfigurable Technology and Systems,2023,16(2):1-24.
[64]ANTHONY Q,DAI D.Evaluating Multi-Level Checkpointingfor Distributed Deep Neural Network Training[C]//2021 SC Workshops Supplementary Proceedings(SCWS).St.Louis,MO,USA:IEEE,2021:60-67.
[65]XU X,LIU H,TAO G,et al.Checkpointing and deterministictraining for deep learning[C]//Proceedings of the 1st International Conference on AI Engineering:Software Engineering for AI.Pittsburgh Pennsylvania:ACM,2022:65-76.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Review on the Development and Application of Checkpointing Technology in High-performanceComputing

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

[1]	CHEN Yiyang, WANG Xiaoning, YAN Xiaoting, LI Guanlong ZHAO Yining, LU Shasha, XIAO Haili. Study on High Performance Computing Container Checkpoint Technology Based on CRIU [J]. Computer Science, 2024, 51(9): 40-50.
[2]	CHEN Yali, PAN Youlin, LIU Genggeng. Assembly Job Shop Scheduling Algorithm Based on Discrete Variable Neighborhood Mayfly Optimization [J]. Computer Science, 2024, 51(9): 283-289.
[3]	ZHOU Wenhui, PENG Qinghua, XIE Lei. Study on Adaptive Cloud-Edge Collaborative Scheduling Methods for Multi-object State Perception [J]. Computer Science, 2024, 51(9): 319-330.
[4]	REN Meixuan, DENG Peng, ZHAO Yue, WANG Xiaoyu, WANG Chao, DAI Haipeng, WU Libing. Safe Placement of Multi-antenna Wireless Chargers [J]. Computer Science, 2024, 51(8): 345-353.
[5]	CHENG Andong, XIE Sijiang, LIU Ang, FENG Yimeng. Efficient Quantum-secure Byzantine Fault Tolerance Consensus Mechanism Based on HotStuff [J]. Computer Science, 2024, 51(8): 429-439.
[6]	YANG Heng, LIU Qinrang, FAN Wang, PEI Xue, WEI Shuai, WANG Xuan. Study on Deep Learning Automatic Scheduling Optimization Based on Feature Importance [J]. Computer Science, 2024, 51(7): 22-28.
[7]	HUANG Fei, LI Yongfu, GAO Yang, XIA Lei, LIAO Qinglong, DAI Jian, XIANG Hong. Scheduling Optimization Method for Household Electricity Consumption Based on Improved Genetic Algorithm [J]. Computer Science, 2024, 51(6A): 230600096-6.
[8]	XU Haitao, CHENG Haiyan, TONG Mingwen. Study on Genetic Algorithm of Course Scheduling Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(6A): 230600062-8.
[9]	LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8.
[10]	WANG Tian, SHEN Wei, ZHANG Gongxuan, XU Linli, WANG Zhen, YUN Yu. Soft Real-time Cloud Service Request Scheduling and Multiserver System Configuration for ProfitOptimization [J]. Computer Science, 2024, 51(6A): 230900099-10.
[11]	DONG Hao, ZHAO Hengtai, WANG Ziyao, YUAN Ye, ZHANG Aoqian. Parallel Transaction Execution Models Under Permissioned Blockchains [J]. Computer Science, 2024, 51(1): 124-132.
[12]	LIU Chenwei, SUN Jian, LEI Bingbing, XU Tao, WU Zhuiwei. Task Scheduling Strategy for Energy Consumption Optimization of Cloud Data Center Based on Improved Particle Swarm Algorithm [J]. Computer Science, 2023, 50(7): 246-253.
[13]	WANG Jiaxing, YANG Sijin, ZHUANG Lei, SONG Yu, YANG Xinyu. Multi-objective Online Hybrid Traffic Scheduling Algorithm in Time-sensitive Networks [J]. Computer Science, 2023, 50(7): 286-292.
[14]	DENG Guanghong, ZHANG Qiheng. Container-based Scheduling Architecture for Mixed-Criticality Systems [J]. Computer Science, 2023, 50(6A): 220800215-5.
[15]	DENG Shengnan, LUO Taiyu, HUANG Jingcai, REN Yuqing, SONG Wei, SU Chang, LEI Lili, HU Guanghui, XU Hong. Design and Implementation of Natural Gas Intelligent Scheduling Computer Platform System [J]. Computer Science, 2023, 50(6A): 220700258-7.