Computer Science ›› 2025, Vol. 52 ›› Issue (9): 128-136.doi: 10.11896/jsjkx.240700171

• High Performance Computing • Previous Articles     Next Articles

Node Failure and Anomaly Prediction Method for Supercomputing Systems

ZHAO Yining, WANG Xiaoning, NIU Tie, ZHAO Yi, XIAO Haili   

  1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2024-07-25 Revised:2024-10-18 Online:2025-09-15 Published:2025-09-11
  • About author:ZHAO Yining,born in 1983,Ph.D,se-nior engineer.His main research in-terests include data analysis,high-performance computing and distributed systems.
  • Supported by:
    Strategic Priority Research Program of Chinese Academy of Sciences(XDB0500103).

Abstract: As the scale of supercomputing systems continues to expand,the probability of computing node failures and anomalies also increases,seriously affecting the stability of systems.Traditional fault response methods mainly apply post event response and remedial policies,which can only partially recover the wastage.Predicting node failures and anomalies in advance can provide more response and processing time,thus has become a research hotspot.This paper proposes a node failure and anomaly prediction method for improving the stability of supercomputing systems and reducing the waste of computing resources.The method analyzes the historical running data of the system,and marks anomalies through unsupervised methods plus a small amount of manual assistance.These anomalies are used to find correlating pre-running features are discovered in the original dataset.Prediction models are then established using machine learning methods.This prediction method achieves the precision over 78% and the recall around 90% through cross validation over the original dataset,and it also ensures sufficient lead time.The dataset used in the evaluation comes from the raw running data of a real supercomputing system,proving the applicability of the proposed method.

Key words: Data analysis, Anomaly prediction, Running feature, Prediction model

CLC Number: 

  • TP311
[1]WANG X D,ZHAO Y N,XIAO H L,et al.LTmatch:A Method to Abstract Pattern from Unstructured Log[J].Applied Sciences,2021,11(11):5302.
[2]GAINARU A,CAPPELLO F,SNIR M,et al.Fault prediction under the microscope:A closer look into HPC systems[C]//Proceedings of the International Conference on High Perfor-mance Computing,Networking,Storage and Analysis(SC’12).2012.
[3]GAINARU A,CAPPELLO F,SNIR M,et al.Failure prediction for HPC systems and applications:Current situation and open issues[J].International Journal of High Performance Computing Applications,2013,27(3):273-282.
[4]DAS A,MUELLER F,SIEGEL C,et al.Desh:Deep Learningfor System Health Prediction of Lead Times to Failure in HPC[C]//The 27th ACM International Symposium on High-Performance Parallel and Distributed Computing(HPDC’18).2018.
[5]DAS A,MUELLER F,HARGROVE P,et al.Doomsday:Predicting Which Node Will Fail When on Supercomputers[C]//Proceedings of the International Conference on High Perfor-mance Computing,Networking,Storage and Analysis(SC’18).2018.
[6]DAS A,MUELLER F,ROUNTREE B.Aarohi:Making Real-Time Node Failure Prediction Feasible[C]//The 34th IEEE International Parallel & Distributed Processing Symposium(IPDPS2020).2020.
[7]ALHARTHI K A,JHUMKA A,DI S,et al.Time Machine:Generative Real-Time Model for Failure(and Lead Time) Prediction in HPC Systems[C]//The 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).2023.
[8]FRANK A,YANG D,BRINKMANN A,et al.Reducing False Node Failure Predictions in HPC[C]//The IEEE 26th International Conference on High Performance Computing,Data,and Analytics(HiPC).2019.
[9]MOHAMMED B,AWAN I,UGAIL H,et al.Failure prediction using machine learning in a virtualised HPC system and application[J].Cluster computing,2019,22:471-485.
[10]LI J,WANG R,ALI G,et al.Workload Failure Prediction for Data Centers[C]//The IEEE 16th International Conference on Cloud Computing(CLOUD).USA,2023.
[11]BANJONGKAN A,PONGSENA W,KERDPRASOP N,et al.A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System:Using Decision Tree Algorithms[J].Journal of Advances in Information Technology,2021,12(2):84-92.
[12]CHUAH E,JHUMKA A,MALEK M,et al.A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems[J].IEEE Access,2022,10:133487-133503.
[13]TAN Y M.Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems[D].North Carolina:North Carolina State University,2012.
[14]SHEN Q,LOU J G,ZHANG X T,et al.Failure prediction by regularized fuzzy learning with intelligent parameters selection[J].Applied Soft Computing Journal,2021,100:106952.
[15]WANG L F,LI D J.SFFDD:Deep Neural Network with En-riched Features for Failure Prediction with Its Application to Computer Disk Driver[J].arXiv:2109.09856,2021.
[16]JIA T,LI Y,WU Z H.Survey of State-of-the-art Log-based Failure Diagnosis[J].Journal of Software,2020,31(7):1997-2018.
[17]DU M,LI F,ZHENG G,et al.DeepLog:Anomaly Detection and Diagnosis from System Logs through Deep Learning[C]//Computer and Communications Security.ACM,2017.
[18]PENG W,LI T,MA S.Mining logs files for data-driven system management[J].ACM SIGKDD Explorations Newsletter,2005,7(1):44-51.
[19]GAO J G,ZHENG Y,YU K,et al.Runtime Fault LocationMethod for Sunway Supercomputer[J].Journal of Computer Research and Development,2024,61(1):86-97.
[20]AKSAR B,ZHANG Y J,ATES E,et al.Proctor:A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems[C]//International Conference on High Performance Computing.Cham:Springer,2021:195-214.
[21]HUANG S H,LIU Y,FUNG C,et al.HitAnomaly:HierarchicalTransformers for Anomaly Detection in System Log[J].IEEE Transactions on Network and Service Management,2020,17(4):2064-2076.
[22]ZHANG T Z,QIU H,CASTELLANO G,et al.System LogParsing:A Survey[J].arXiv:2212.14277,2022.
[23]WITTKOPP T,ACKER A,KAO O.Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis[J].arXiv:2312.14748,2023.
[24]LOSADA N,CORES I,MARTÍN M J,et al.Resilient MPI applications using an application-level checkpointing framework and ULFM[J].Journal of Supercomputing,2017(73):100-113.
[25]BENOIT A,CAVELAN A,CAPPELLO F,et al.Coping with silent and fail-stop errors at scale by combining replication and checkpointing[J].Journal of Parallel and Distributed Computing,2018,122:209-225.
[26]MA L L,YI S H,LI Q.Efficient service handoff across edgeservers via docker container migration[C]//Proceedings of the Second ACM/IEEE Symposium on Edge Computing.ACM,2017:1-13.
[27]ZHAO Q,XIE S Q,HAN K,et al.Container Migration Based on Combination of Remote Direct Memory Access and Check Point[J].Journal of Frontiers of Computer Science and Techno-logy,2019,13(12):1995-2007.
[28]LUO C,CUI Y,LIN Y S.Container Migration Method Based on Bandwidth Prediction and Adaptive Compression[J].Computer Engineering,2022,48(5):200-207.
[1] WANG Pu, GAO Zhanyun, WANG Zhenfei, SONG Zheli. BDBFT:A Consensus Protocol Based on Reputation Prediction Model for IoT Scenario [J]. Computer Science, 2025, 52(5): 366-374.
[2] WANG Chengzhang, BAI Xiaoming, TANG Wenying, CHEN Shuhan. Evolutionary CatBoost Based Housing Price Prediction Model [J]. Computer Science, 2024, 51(11A): 240300180-5.
[3] MAO Xin, LEI Zhanyao, QI Zhengwei. Automated Kaomoji Extraction Based on Large-scale Danmaku Texts [J]. Computer Science, 2024, 51(1): 284-294.
[4] YANG Heng, ZHU Yan. Analysis of Academic Network Based on Graph OLAP [J]. Computer Science, 2023, 50(6A): 220100237-5.
[5] LI Honghui, CHEN Bo, LU Shuyi, ZHANG Junwen. Study on Reliability Prediction Model Based on BASFPA-BP [J]. Computer Science, 2023, 50(5): 31-37.
[6] XU Xia, ZHANG Hui, YANG Chunming, LI Bo, ZHAO Xujian. Fair Method for Spectral Clustering to Improve Intra-cluster Fairness [J]. Computer Science, 2023, 50(2): 158-165.
[7] CONG Ying-nan, WANG Zhao-yu, ZHU Jin-qing. Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law [J]. Computer Science, 2022, 49(4): 74-79.
[8] YAN Rui, LIANG Zhi-yong, LI Jin-tao, REN Fei. Predicting Tumor-related Indicators Based on Deep Learning and H&E Stained Pathological Images:A Survey [J]. Computer Science, 2022, 49(2): 69-82.
[9] JIANG Hao-chen, WEI Zi-qi, LIU Lin, CHEN Jun. Imbalanced Data Classification:A Survey and Experiments in Medical Domain [J]. Computer Science, 2022, 49(1): 80-88.
[10] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[11] CHEN Hui-qin, GUO Guan-cheng, QIN Chao-xuan, LI Zhao-bi. Research on Elderly Population Prediction Based on GM-LSTM Model in Nanjing City [J]. Computer Science, 2021, 48(6A): 231-234.
[12] WU Guang-zhi, GUO Bin, DING Ya-san, CHENG Jia-hui, YU Zhi-wen. Cognitive Mechanisms of Fake News [J]. Computer Science, 2021, 48(6): 306-314.
[13] ZHANG Han-shuo, YANG Dong-ju. Technology Data Analysis Algorithm Based on Relational Graph [J]. Computer Science, 2021, 48(3): 174-179.
[14] WANG Bo-yu, WANG Zhong-qing, ZHOU Guo-dong. Dialogue Act Prediction Based on Response Generation [J]. Computer Science, 2021, 48(2): 212-216.
[15] HU Teng, WANG Yan-ping, ZHANG Xiao-song, NIU Wei-na. Data and Behavior Analysis of Blockchain-based DApp [J]. Computer Science, 2021, 48(11): 116-123.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!