Computer Science ›› 2024, Vol. 51 ›› Issue (9): 1-14.doi: 10.11896/jsjkx.231000220
• High Performance Computing • Previous Articles Next Articles
YAN Xiaoting1,2, WANG Xiaoning1, DONG Sheng1,2, ZHAO Yining1, XIAO Haili1
CLC Number:
[1]KUMAR M,MOLLA A R.On the Message Complexity ofFault-Tolerant Computation:Leader Election and Agreement[C]//Proceedings of the 24th International Conference on Distributed Computing and Networking.Kharagpur India:ACM,2023:294-295. [2]LIN L,HUANG Y,LIN Y,et al.FFNLFD:Fault Diagnosis of Multiprocessor Systems at Local Node with Fault-Free Neighbors under PMC Model and MM* Model[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(7):1739-1751. [3]YOUNESS H,OMAR A,MONESS M.An Optimized Weighted Average Makespan in Fault-Tolerant Heterogeneous MPSoCs[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(8):1933-1946. [4]PSISTAKIS A,CHRYSOS N,CHAI X F,et al.Optimized Page Fault Handling During RDMA[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):3990-4005. [5]LEE Y L,LIANG D,WANG W J.Optimal Online LivenessFault Detection for Multilayer Cloud Computing Systems[J].IEEE Transactions on Dependable and Secure Computing,2022,19(5):3464-3477. [6]PALAZZI L,LI G,FANG B,et al.Improving the Accuracy of IR-Level Fault Injection[J].IEEE Transactions on Dependable and Secure Computing,2022,19(1):243-258. [7]ANSARI M,SAFARI S,KHDR H,et al.Power-Aware Checkpointing for Multicore Embedded Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(12):4410-4424. [8]KHORGUANI A,ROPARS T,DE PALMA N.ResPCT:fastcheckpointing in non-volatile memory for multi-threaded applications[C]//Proceedings of the Seventeenth European Confe-rence on Computer Systems.Rennes France:ACM,2022:525-540. [9]BEIGI M V,CAO Y,GURUMURTHI S,et al.A SystematicStudy of DDR4 DRAM Faults in the Field[C]//2023 IEEE International Symposium on High-Performance Computer Architecture(HPCA).Montreal,QC,Canada:IEEE,2023:991-1002. [10]ROJAS E,PEREZ D,CALHOUN J C,et al.Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration[C]//2021 IEEE International Conference on Cluster Computing(CLUSTER).Portland,OR,USA:IEEE,2021:492-503. [11]BORGHESI A,MOLAN M,MILANO M,et al.Anomaly Detection and Anticipation in High Performance Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(4):739-750. [12]ZHAO K,DI S,LI S,et al.FT-CNN:Algorithm-Based Fault Tolerance for Convolutional Neural Networks[J].IEEE Tran-sactions on Parallel and Distributed Systems 2021,32(7):1677-1689. [13]ZOU A,LI J,GILL C D,et al.RTGPU:Real-Time GPU Sche-duling of Hard Deadline Parallel Tasks With Fine-Grain Utilization[J].IEEE Transactions on Parallel and Distributed Systems,2023,34(5):1450-1465. [14]MAURYA A,NICOLAE B,RAFIQUE M M,et al.TowardsEfficient I/O Scheduling for Collaborative Multi-Level Checkpointing[C]//2021 29th International Symposium on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems(MASCOTS).Houston,TX,USA:IEEE,2021:1-8. [15]LITZKOW M,TANNENBAUM T,BASNEY J,et al.Checkpoint and migration of UNIX processes in the Condor distributed processing system[R].University of Wisconsin-Madison Department of Computer Sciences,1997. [16]HARGROVE P H,DUELL J C.Berkeley lab checkpoint/restart(BLCR) for Linux clusters[J].Journal of Physics:Conference Series,2006,46:494-499. [17]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//2009 IEEE International Symposium on Parallel & Distributed Processing.Rome:IEEE,2009:1-12. [18]CRIU.CRIU[EB/OL].[2023-07-16].https://criu.org/Main_Page. [19]SCR.Scalable Checkpoint/Restart(SCR)User Guide[EB/OL].[2023-7-16].https://scr.readthedocs.io/en/latest/#. [20]NICOLAE B,MOODY A,GONSIOROWSKI E,et al.VeloC:Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale[C]//2019 IEEE International Parallel and Distributed Processing Symposium(IPDPS).Rio de Janeiro,Brazil:IEEE,2019:911-920. [21]The Open MPI Community.User-Level Fault Mitigation(ULFM) [EB/OL].[2023-07-16].https://docs.open-mpi.org/en/v5.0.x/features/ulfm.html. [22]WEEKS N,LUECKE G,MARIS P,et al.Challenges in Developing MPI Fault-Tolerant Fortran Applications[C]//2018 IEEE International Conference on Cluster Computing(CLUSTER).Belfast:IEEE,2018:524-531. [23]GAMELL M,KATZ D S,KOLLA H,et al.Exploring Automa-tic,Online Failure Recovery for Scientific Applications at Extreme Scales[C]//SC14:International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans,LA,USA:IEEE,2014:895-906. [24]GAMELL M,TERANISHI K,HEROUX M A,et al.Local recovery and failure masking for stencil-based applications at extreme scales[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage and Analysis.Austin Texas:ACM,2015:1-12. [25]LEÓN B,FRANCO D,REXACHS D,et al.Analysis of parallel application checkpoint storage for system configuration[J].The Journal of Supercomputing,2021,77(5):4582-4617. [26]WANG X,LEIDEL J D,WILLIAMS B,et al.xBGAS:A GlobalAddress Space Extension on RISC-V for High Performance Computing[C]//2021 IEEE International Parallel and Distributed Processing Symposium(IPDPS).Portland,OR,USA:IEEE,2021:454-463. [27]DU Y,MARCHAL L,PALLEZ G,et al.Optimal Checkpointing Strategies for Iterative Applications[J].IEEE Transactions on Parallel and Distributed Systems,2022,33(3):507-522. [28]SIGDEL P,YUAN X,TZENG N F.Realizing Best Checkpoin-ting Control in Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(2):315-329. [29]RAYBON G,ADAMIECKI A,CHO J,et al.Single-carrier all-ETDM 1.08-Terabit/s line rate PDM-64-QAM transmitter using a high-speed 3-bit multiplexing DAC[C]//2015 IEEE Photonics Conference(IPC).Reston,VA:IEEE,2015:1-2. [30]DEY T,SATO K,NICOLAE B,et al.Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning[C]//2020 IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW).New Orleans,LA,USA:IEEE,2020:1036-1043. [31]MO Y,XING L,LIN Y K,et al.Efficient Analysis of Repairable Computing Systems Subject to Scheduled Checkpointing[J].IEEE Transactions on Dependable and Secure Computing,2021,18(1):1-14. [32]LEÓN B,MÉNDEZ S,FRANCO D,et al.A model of checkpoint behavior for applications that have I/O[J].The Journal of Supercomputing,2022,78(13):15404-15436. [33]ZHOU T,GAO L,GUAN X.A Fault-Tolerant DistributedFramework for Asynchronous Iterative Computations[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(8):2062-2073. [34]MAURYA A,NICOLAE B,RAFIQUE M M,et al.TowardsEfficient Cache Allocation for High-Frequency Checkpointing[C]//2022 IEEE 29th International Conference on High Performance Computing,Data,and Analytics(HiPC).Bengaluru,India:IEEE,2022:262-271. [35]ANSEL J,ARYA K,COOPERMAN G.DMTCP:Transparentcheckpointing for cluster computations and the desktop[C]//2009 IEEE International Symposium on Parallel & Distributed Processing.Rome:IEEE,2009:1-12. [36]PLANK J S,BECK M,KINGSLEY G,et al.Libckpt:Transpa-rent checkpointing under unix[M].Computer Science Department,1994. [37]HONG G,AHN S J,HAN S C,et al.Kckpt:checkpoint and re-covery facility on unixware kernel[C]//Proceedings of the 15th International Conference on Computers and Their Applications(ISCA).2000. [38]TAKIZAWA H,SATO K,KOMATSU K,et al.CheCUDA:A Checkpoint/Restart Tool for CUDA Applications[C]//2009 International Conference on Parallel and Distributed Computing,Applications and Technologies.Higashi Hiroshima,Japan:IEEE,2009:408-413. [39]NUKADA A,TAKIZAWA H,MATSUOKA S.NVCR:ATransparent Checkpoint-Restart Library for NVIDIA CUDA[C]//2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.Anchorage,AK,USA:IEEE,2011:104-113. [40]SHI L,CHEN H,LI T.Hybrid CPU/GPU Checkpoint forGPU-Based Heterogeneous Systems[M]//Parallel Computational Fluid Dynamics:Vol.405.Berlin,Heidelberg:Springer,2014:470-481. [41]LOSADA N,CORES I,MARTÍN M J,et al.Resilient MPI applications using an application-level checkpointing framework and ULFM[J].The Journal of Supercomputing,2017,73(1):100-113. [42]LOSADA N,MARTÍN M J,GONZÁLEZ P.Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications[J].The Journal of Supercomputing,2017,73(1):316-329. [43]PARASYRIS K,KELLER K,BAUTISTA-GOMEZ L,et al.Checkpoint Restart Support for Heterogeneous HPC Applications[C]//2020 20th IEEE/ACM International Symposium on Cluster,Cloud and Internet Computing(CCGRID).Melbourne,Australia:IEEE,2020:242-251. [44]ZHANG Y,GUO X,JIANG H,et al.A Checkpoint/RestartScheme for CUDA Applications with Complex Memory Hierarchy[C]//2013 14th ACIS International Conference on Software Engineering,Artificial Intelligence,Networking and Parallel/Distributed Computing.Honolulu,HI,USA:IEEE,2013:247-252. [45]POURGHASSEMI B,CHANDRAMOWLISHWARAN A.cu-daCR:An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs[C]//2017 IEEE Interna-tional Conference on Cluster Computing(CLUSTER).Honolulu,HI,USA:IEEE,2017:725-732. [46]GARG R,MOHAN A,SULLIVAN M,et al.CRUM:Check-point-Restart Support for CUDA’s Unified Memory[C]//2018 IEEE International Conference on Cluster Computing(CLUSTER).Belfast:IEEE,2018:302-313. [47]CHIU M T,YOU Y P.CLPKM:A checkpoint-based preemptive multitasking framework for OpenCL kernels[J].Journal of Systems Architecture,2019,98:53-62. [48]CHEN G,ZHANG J,ZHU Z,et al.CRState:checkpoint/restart of OpenCL program for in-kernel applications[J].The Journal of Supercomputing,2021,77(6):5426-5467. [49]MOHROR K,MOODY A,BRONEVETSKY G,et al.Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System[J].IEEE Transactions on Parallel and Distributed Systems,2014,25(9):2255-2263. [50]MOODY A,BRONEVETSKY G,MOHROR K,et al.Design,Modeling,and Evaluation of a Scalable Multi-level Checkpoin-ting System[C]//2010 ACM/IEEE International Conference for High Performance Computing,Networking,Storage and Analysis.New Orleans,LA,USA:IEEE,2010:1-11. [51]DUARTE A,REXACHS D,LUQUE E.Increasing the cluster availability using RADIC[C]//2006 IEEE International Confe-rence on Cluster Computing.Barcelona:IEEE,2006:1-8. [52]CASTRO-LEÓN M,MEYER H,REXACHS D,et al.Fault to-lerance at system level based on RADIC architecture[J].Journal of Parallel and Distributed Computing,2015,86:98-111. [53]WONG A,HEYMANN E,REXACHS D,et al.Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints[J].IEEE Transactions on Parallel and Distributed Systems,2021,32(2):254-268. [54]WHITLOCK M,MORALES N,BOSILCA G,et al.Integrating process,control-flow,and data resiliency layers using a hybrid Fenix/Kokkos approach[C]//2022 IEEE International Confe-rence on Cluster Computing(CLUSTER).Heidelberg,Germany:IEEE,2022:418-428. [55]PARASYRIS K,GEORGAKOUDIS G,BAUTISTA-GOMEZL,et al.Co-Designing Multi-Level Checkpoint Restart for MPI Applications[C]//2021 IEEE/ACM 21st International Symposium on Cluster,Cloud and Internet Computing(CCGrid).Melbourne,Australia:IEEE,2021:103-112. [56]RODRÍGUEZ-PASCUAL M,CAO J,MORÍÑIGO J A,et al.Job migration in HPC clusters by means of checkpoint/restart[J].The Journal of Supercomputing,2019,75(10):6517-6541. [57]XU H,DE VECIANA G,LAU W C,et al.Online Job Scheduling with Redundancy and Opportunistic Checkpointing:A Speedup-Function-Based Analysis[J].IEEE Transactions on Parallel and Distributed Systems,2019,30(4):897-909. [58]PRADES J,SILLA F.GPU-Job Migration:The rCUDA Case[J].IEEE Transactions on Parallel and Distributed Systems,2019,30(12):2718-2729. [59]CHAUDHARY S,RAMJEE R,SIVATHANU M,et al.Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning[C]//Proceedings of the Fifteenth European Conference on Computer Systems.Heraklion Greece:ACM,2020:1-16. [60]FOSTER H.Wilson Research Group functional verificationstudy:IC/ASIC functional verification trend report[R].White Paper.Wilson Research Group and Mentor,A Siemens Business,2020. [61]ASAAD S,BELLOFATTO R,BREZZO B,et al.A cycle-accurate,cycle-reproducible multi-FPGA system for accelerating multi-core processor simulation[C]//Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays.Monterey California USA:ACM,2012:153-162. [62]ATTIA S,BETZ V.StateMover:Combining simulation andhardware execution for efficient FPGA debugging[C]//Procee-dings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.2020:175-185. [63]ATTIA S,BETZ V.Toward Software-like Debugging for FP-GAs via Checkpointing and Transaction-based Co-Simulation[J].ACM Transactions on Reconfigurable Technology and Systems,2023,16(2):1-24. [64]ANTHONY Q,DAI D.Evaluating Multi-Level Checkpointingfor Distributed Deep Neural Network Training[C]//2021 SC Workshops Supplementary Proceedings(SCWS).St.Louis,MO,USA:IEEE,2021:60-67. [65]XU X,LIU H,TAO G,et al.Checkpointing and deterministictraining for deep learning[C]//Proceedings of the 1st International Conference on AI Engineering:Software Engineering for AI.Pittsburgh Pennsylvania:ACM,2022:65-76. |
[1] | CHEN Yiyang, WANG Xiaoning, YAN Xiaoting, LI Guanlong ZHAO Yining, LU Shasha, XIAO Haili. Study on High Performance Computing Container Checkpoint Technology Based on CRIU [J]. Computer Science, 2024, 51(9): 40-50. |
[2] | CHEN Yali, PAN Youlin, LIU Genggeng. Assembly Job Shop Scheduling Algorithm Based on Discrete Variable Neighborhood Mayfly Optimization [J]. Computer Science, 2024, 51(9): 283-289. |
[3] | ZHOU Wenhui, PENG Qinghua, XIE Lei. Study on Adaptive Cloud-Edge Collaborative Scheduling Methods for Multi-object State Perception [J]. Computer Science, 2024, 51(9): 319-330. |
[4] | REN Meixuan, DENG Peng, ZHAO Yue, WANG Xiaoyu, WANG Chao, DAI Haipeng, WU Libing. Safe Placement of Multi-antenna Wireless Chargers [J]. Computer Science, 2024, 51(8): 345-353. |
[5] | CHENG Andong, XIE Sijiang, LIU Ang, FENG Yimeng. Efficient Quantum-secure Byzantine Fault Tolerance Consensus Mechanism Based on HotStuff [J]. Computer Science, 2024, 51(8): 429-439. |
[6] | YANG Heng, LIU Qinrang, FAN Wang, PEI Xue, WEI Shuai, WANG Xuan. Study on Deep Learning Automatic Scheduling Optimization Based on Feature Importance [J]. Computer Science, 2024, 51(7): 22-28. |
[7] | HUANG Fei, LI Yongfu, GAO Yang, XIA Lei, LIAO Qinglong, DAI Jian, XIANG Hong. Scheduling Optimization Method for Household Electricity Consumption Based on Improved Genetic Algorithm [J]. Computer Science, 2024, 51(6A): 230600096-6. |
[8] | XU Haitao, CHENG Haiyan, TONG Mingwen. Study on Genetic Algorithm of Course Scheduling Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(6A): 230600062-8. |
[9] | LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8. |
[10] | WANG Tian, SHEN Wei, ZHANG Gongxuan, XU Linli, WANG Zhen, YUN Yu. Soft Real-time Cloud Service Request Scheduling and Multiserver System Configuration for ProfitOptimization [J]. Computer Science, 2024, 51(6A): 230900099-10. |
[11] | DONG Hao, ZHAO Hengtai, WANG Ziyao, YUAN Ye, ZHANG Aoqian. Parallel Transaction Execution Models Under Permissioned Blockchains [J]. Computer Science, 2024, 51(1): 124-132. |
[12] | LIU Chenwei, SUN Jian, LEI Bingbing, XU Tao, WU Zhuiwei. Task Scheduling Strategy for Energy Consumption Optimization of Cloud Data Center Based on Improved Particle Swarm Algorithm [J]. Computer Science, 2023, 50(7): 246-253. |
[13] | WANG Jiaxing, YANG Sijin, ZHUANG Lei, SONG Yu, YANG Xinyu. Multi-objective Online Hybrid Traffic Scheduling Algorithm in Time-sensitive Networks [J]. Computer Science, 2023, 50(7): 286-292. |
[14] | DENG Guanghong, ZHANG Qiheng. Container-based Scheduling Architecture for Mixed-Criticality Systems [J]. Computer Science, 2023, 50(6A): 220800215-5. |
[15] | DENG Shengnan, LUO Taiyu, HUANG Jingcai, REN Yuqing, SONG Wei, SU Chang, LEI Lili, HU Guanghui, XU Hong. Design and Implementation of Natural Gas Intelligent Scheduling Computer Platform System [J]. Computer Science, 2023, 50(6A): 220700258-7. |
|