计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 128-136.doi: 10.11896/jsjkx.240700171
赵一宁, 王小宁, 牛铁, 赵毅, 肖海力
ZHAO Yining, WANG Xiaoning, NIU Tie, ZHAO Yi, XIAO Haili
摘要: 随着超级计算系统的规模不断扩大,其计算节点发生故障和异常的概率也随之上升,严重影响了计算系统的运行稳定性。传统的故障应对方法多采用事后响应和补救措施,只能一定程度地挽回损失,而对故障和异常进行事前预测则能够提供更多的反应和处理时间,因此逐渐成为故障响应方法的研究热点之一。对此,提出了一种面向超级计算系统的节点故障异常预测方法,旨在提升系统运行稳定性,减少计算资源的浪费。该方法首先分析系统历史运行数据,并通过无监督结合少量人工辅助的方法标记异常,基于这些异常在原始运行数据中发现关联的前置运行特征,随后基于机器学习方法建立节点故障异常的预测模型。该预测方法通过在原数据集上交叉验证获得了78%的精度和约90%的召回率,同时也保证了充分的提前时间。验证中使用的数据集是来自真实的超级计算系统的原始运行数据,证明了该方法具有可应用性。
中图分类号:
[1]WANG X D,ZHAO Y N,XIAO H L,et al.LTmatch:A Method to Abstract Pattern from Unstructured Log[J].Applied Sciences,2021,11(11):5302. [2]GAINARU A,CAPPELLO F,SNIR M,et al.Fault prediction under the microscope:A closer look into HPC systems[C]//Proceedings of the International Conference on High Perfor-mance Computing,Networking,Storage and Analysis(SC’12).2012. [3]GAINARU A,CAPPELLO F,SNIR M,et al.Failure prediction for HPC systems and applications:Current situation and open issues[J].International Journal of High Performance Computing Applications,2013,27(3):273-282. [4]DAS A,MUELLER F,SIEGEL C,et al.Desh:Deep Learningfor System Health Prediction of Lead Times to Failure in HPC[C]//The 27th ACM International Symposium on High-Performance Parallel and Distributed Computing(HPDC’18).2018. [5]DAS A,MUELLER F,HARGROVE P,et al.Doomsday:Predicting Which Node Will Fail When on Supercomputers[C]//Proceedings of the International Conference on High Perfor-mance Computing,Networking,Storage and Analysis(SC’18).2018. [6]DAS A,MUELLER F,ROUNTREE B.Aarohi:Making Real-Time Node Failure Prediction Feasible[C]//The 34th IEEE International Parallel & Distributed Processing Symposium(IPDPS2020).2020. [7]ALHARTHI K A,JHUMKA A,DI S,et al.Time Machine:Generative Real-Time Model for Failure(and Lead Time) Prediction in HPC Systems[C]//The 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).2023. [8]FRANK A,YANG D,BRINKMANN A,et al.Reducing False Node Failure Predictions in HPC[C]//The IEEE 26th International Conference on High Performance Computing,Data,and Analytics(HiPC).2019. [9]MOHAMMED B,AWAN I,UGAIL H,et al.Failure prediction using machine learning in a virtualised HPC system and application[J].Cluster computing,2019,22:471-485. [10]LI J,WANG R,ALI G,et al.Workload Failure Prediction for Data Centers[C]//The IEEE 16th International Conference on Cloud Computing(CLOUD).USA,2023. [11]BANJONGKAN A,PONGSENA W,KERDPRASOP N,et al.A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System:Using Decision Tree Algorithms[J].Journal of Advances in Information Technology,2021,12(2):84-92. [12]CHUAH E,JHUMKA A,MALEK M,et al.A Survey of Log-Correlation Tools for Failure Diagnosis and Prediction in Cluster Systems[J].IEEE Access,2022,10:133487-133503. [13]TAN Y M.Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems[D].North Carolina:North Carolina State University,2012. [14]SHEN Q,LOU J G,ZHANG X T,et al.Failure prediction by regularized fuzzy learning with intelligent parameters selection[J].Applied Soft Computing Journal,2021,100:106952. [15]WANG L F,LI D J.SFFDD:Deep Neural Network with En-riched Features for Failure Prediction with Its Application to Computer Disk Driver[J].arXiv:2109.09856,2021. [16]JIA T,LI Y,WU Z H.Survey of State-of-the-art Log-based Failure Diagnosis[J].Journal of Software,2020,31(7):1997-2018. [17]DU M,LI F,ZHENG G,et al.DeepLog:Anomaly Detection and Diagnosis from System Logs through Deep Learning[C]//Computer and Communications Security.ACM,2017. [18]PENG W,LI T,MA S.Mining logs files for data-driven system management[J].ACM SIGKDD Explorations Newsletter,2005,7(1):44-51. [19]GAO J G,ZHENG Y,YU K,et al.Runtime Fault LocationMethod for Sunway Supercomputer[J].Journal of Computer Research and Development,2024,61(1):86-97. [20]AKSAR B,ZHANG Y J,ATES E,et al.Proctor:A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems[C]//International Conference on High Performance Computing.Cham:Springer,2021:195-214. [21]HUANG S H,LIU Y,FUNG C,et al.HitAnomaly:HierarchicalTransformers for Anomaly Detection in System Log[J].IEEE Transactions on Network and Service Management,2020,17(4):2064-2076. [22]ZHANG T Z,QIU H,CASTELLANO G,et al.System LogParsing:A Survey[J].arXiv:2212.14277,2022. [23]WITTKOPP T,ACKER A,KAO O.Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis[J].arXiv:2312.14748,2023. [24]LOSADA N,CORES I,MARTÍN M J,et al.Resilient MPI applications using an application-level checkpointing framework and ULFM[J].Journal of Supercomputing,2017(73):100-113. [25]BENOIT A,CAVELAN A,CAPPELLO F,et al.Coping with silent and fail-stop errors at scale by combining replication and checkpointing[J].Journal of Parallel and Distributed Computing,2018,122:209-225. [26]MA L L,YI S H,LI Q.Efficient service handoff across edgeservers via docker container migration[C]//Proceedings of the Second ACM/IEEE Symposium on Edge Computing.ACM,2017:1-13. [27]ZHAO Q,XIE S Q,HAN K,et al.Container Migration Based on Combination of Remote Direct Memory Access and Check Point[J].Journal of Frontiers of Computer Science and Techno-logy,2019,13(12):1995-2007. [28]LUO C,CUI Y,LIN Y S.Container Migration Method Based on Bandwidth Prediction and Adaptive Compression[J].Computer Engineering,2022,48(5):200-207. |
|