计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 128-136.doi: 10.11896/jsjkx.240700171

• 高性能计算 • 上一篇    下一篇

面向超级计算系统的节点故障异常预测方法

赵一宁, 王小宁, 牛铁, 赵毅, 肖海力   

  1. 中国科学院计算机网络信息中心 北京 100190
  • 收稿日期:2024-07-25 修回日期:2024-10-18 出版日期:2025-09-15 发布日期:2025-09-11
  • 通讯作者: 赵一宁(zhaoyn@sccas.cn)
  • 基金资助:
    中国科学院战略性先导科技专项(XDB0500103)

Node Failure and Anomaly Prediction Method for Supercomputing Systems

ZHAO Yining, WANG Xiaoning, NIU Tie, ZHAO Yi, XIAO Haili   

  1. Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2024-07-25 Revised:2024-10-18 Online:2025-09-15 Published:2025-09-11
  • About author:ZHAO Yining,born in 1983,Ph.D,se-nior engineer.His main research in-terests include data analysis,high-performance computing and distributed systems.
  • Supported by:
    Strategic Priority Research Program of Chinese Academy of Sciences(XDB0500103).

摘要: 随着超级计算系统的规模不断扩大,其计算节点发生故障和异常的概率也随之上升,严重影响了计算系统的运行稳定性。传统的故障应对方法多采用事后响应和补救措施,只能一定程度地挽回损失,而对故障和异常进行事前预测则能够提供更多的反应和处理时间,因此逐渐成为故障响应方法的研究热点之一。对此,提出了一种面向超级计算系统的节点故障异常预测方法,旨在提升系统运行稳定性,减少计算资源的浪费。该方法首先分析系统历史运行数据,并通过无监督结合少量人工辅助的方法标记异常,基于这些异常在原始运行数据中发现关联的前置运行特征,随后基于机器学习方法建立节点故障异常的预测模型。该预测方法通过在原数据集上交叉验证获得了78%的精度和约90%的召回率,同时也保证了充分的提前时间。验证中使用的数据集是来自真实的超级计算系统的原始运行数据,证明了该方法具有可应用性。

关键词: 数据分析, 异常预测, 运行特征, 预测模型

Abstract: As the scale of supercomputing systems continues to expand,the probability of computing node failures and anomalies also increases,seriously affecting the stability of systems.Traditional fault response methods mainly apply post event response and remedial policies,which can only partially recover the wastage.Predicting node failures and anomalies in advance can provide more response and processing time,thus has become a research hotspot.This paper proposes a node failure and anomaly prediction method for improving the stability of supercomputing systems and reducing the waste of computing resources.The method analyzes the historical running data of the system,and marks anomalies through unsupervised methods plus a small amount of manual assistance.These anomalies are used to find correlating pre-running features are discovered in the original dataset.Prediction models are then established using machine learning methods.This prediction method achieves the precision over 78% and the recall around 90% through cross validation over the original dataset,and it also ensures sufficient lead time.The dataset used in the evaluation comes from the raw running data of a real supercomputing system,proving the applicability of the proposed method.

Key words: Data analysis, Anomaly prediction, Running feature, Prediction model

中图分类号: 

  • TP311
[1]WANG X D,ZHAO Y N,XIAO H L,et al.LTmatch:A Method to Abstract Pattern from Unstructured Log[J].Applied Sciences,2021,11(11):5302.
[2]GAINARU A,CAPPELLO F,SNIR M,et al.Fault prediction under the microscope:A closer look into HPC systems[C]//Proceedings of the International Conference on High Perfor-mance Computing,Networking,Storage and Analysis(SC’12).2012.
[3]GAINARU A,CAPPELLO F,SNIR M,et al.Failure prediction for HPC systems and applications:Current situation and open issues[J].International Journal of High Performance Computing Applications,2013,27(3):273-282.
[4]DAS A,MUELLER F,SIEGEL C,et al.Desh:Deep Learningfor System Health Prediction of Lead Times to Failure in HPC[C]//The 27th ACM International Symposium on High-Performance Parallel and Distributed Computing(HPDC’18).2018.
[5]DAS A,MUELLER F,HARGROVE P,et al.Doomsday:Predicting Which Node Will Fail When on Supercomputers[C]//Proceedings of the International Conference on High Perfor-mance Computing,Networking,Storage and Analysis(SC’18).2018.
[6]DAS A,MUELLER F,ROUNTREE B.Aarohi:Making Real-Time Node Failure Prediction Feasible[C]//The 34th IEEE International Parallel & Distributed Processing Symposium(IPDPS2020).2020.
[7]ALHARTHI K A,JHUMKA A,DI S,et al.Time Machine:Generative Real-Time Model for Failure(and Lead Time) Prediction in HPC Systems[C]//The 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).2023.
[8]FRANK A,YANG D,BRINKMANN A,et al.Reducing False Node Failure Predictions in HPC[C]//The IEEE 26th International Conference on High Performance Computing,Data,and Analytics(HiPC).2019.
[9]MOHAMMED B,AWAN I,UGAIL H,et al.Failure prediction using machine learning in a virtualised HPC system and application[J].Cluster computing,2019,22:471-485.
[10]LI J,WANG R,ALI G,et al.Workload Failure Prediction for Data Centers[C]//The IEEE 16th International Conference on Cloud Computing(CLOUD).USA,2023.
[11]BANJONGKAN A,PONGSENA W,KERDPRASOP N,et al.A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System:Using Decision Tree Algorithms[J].Journal of Advances in Information Technology,2021,12(2):84-92.
[12]CHUAH E,JHUMKA A,MALEK M,et al.A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems[J].IEEE Access,2022,10:133487-133503.
[13]TAN Y M.Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems[D].North Carolina:North Carolina State University,2012.
[14]SHEN Q,LOU J G,ZHANG X T,et al.Failure prediction by regularized fuzzy learning with intelligent parameters selection[J].Applied Soft Computing Journal,2021,100:106952.
[15]WANG L F,LI D J.SFFDD:Deep Neural Network with En-riched Features for Failure Prediction with Its Application to Computer Disk Driver[J].arXiv:2109.09856,2021.
[16]JIA T,LI Y,WU Z H.Survey of State-of-the-art Log-based Failure Diagnosis[J].Journal of Software,2020,31(7):1997-2018.
[17]DU M,LI F,ZHENG G,et al.DeepLog:Anomaly Detection and Diagnosis from System Logs through Deep Learning[C]//Computer and Communications Security.ACM,2017.
[18]PENG W,LI T,MA S.Mining logs files for data-driven system management[J].ACM SIGKDD Explorations Newsletter,2005,7(1):44-51.
[19]GAO J G,ZHENG Y,YU K,et al.Runtime Fault LocationMethod for Sunway Supercomputer[J].Journal of Computer Research and Development,2024,61(1):86-97.
[20]AKSAR B,ZHANG Y J,ATES E,et al.Proctor:A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems[C]//International Conference on High Performance Computing.Cham:Springer,2021:195-214.
[21]HUANG S H,LIU Y,FUNG C,et al.HitAnomaly:HierarchicalTransformers for Anomaly Detection in System Log[J].IEEE Transactions on Network and Service Management,2020,17(4):2064-2076.
[22]ZHANG T Z,QIU H,CASTELLANO G,et al.System LogParsing:A Survey[J].arXiv:2212.14277,2022.
[23]WITTKOPP T,ACKER A,KAO O.Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis[J].arXiv:2312.14748,2023.
[24]LOSADA N,CORES I,MARTÍN M J,et al.Resilient MPI applications using an application-level checkpointing framework and ULFM[J].Journal of Supercomputing,2017(73):100-113.
[25]BENOIT A,CAVELAN A,CAPPELLO F,et al.Coping with silent and fail-stop errors at scale by combining replication and checkpointing[J].Journal of Parallel and Distributed Computing,2018,122:209-225.
[26]MA L L,YI S H,LI Q.Efficient service handoff across edgeservers via docker container migration[C]//Proceedings of the Second ACM/IEEE Symposium on Edge Computing.ACM,2017:1-13.
[27]ZHAO Q,XIE S Q,HAN K,et al.Container Migration Based on Combination of Remote Direct Memory Access and Check Point[J].Journal of Frontiers of Computer Science and Techno-logy,2019,13(12):1995-2007.
[28]LUO C,CUI Y,LIN Y S.Container Migration Method Based on Bandwidth Prediction and Adaptive Compression[J].Computer Engineering,2022,48(5):200-207.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!