计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 8-16.doi: 10.11896/jsjkx.210100149
王焘1,2, 张树东3, 李安1, 邵亚茹3, 张文博1,2
WANG Tao1,2, ZHANG Shu-dong3, LI An1, SHAO Ya-ru3, ZHANG Wen-bo1,2
摘要: 微服务软件架构将大型复杂应用软件拆分成多个可独立部署的相互之间通过轻量级通信机制协作的微服务,从而实现了应用软件的敏捷开发和持续交付。然而,应用软件的微服务数量众多,调用关系复杂,当某个微服务出现故障时会引发与之交互的微服务也出现异常,从而大幅增加了软件应用出现故障的可能性。面对众多异常微服务,考虑到异常的传播性,如何高效、准确地定位引发异常的故障微服务,成为亟待解决的问题。针对该问题,文中提出一种面向异常传播的微服务故障诊断方法。首先,监测微服务度量信息与微服务之间的调用行为;然后,基于回归分析构建度量与API调用之间的回归模型以检测异常微服务;同时,构建微服务依赖图以刻画微服务间的异常传播;最后,基于服务依赖图以及异常服务集合得到故障传播子图,并基于PageRank算法找出最有可能引发异常的根因,即故障微服务。实验结果表明,该方法能够有效检测异常服务,准确诊断故障微服务,同时具有较低的开销。
中图分类号:
[1]THÖNES J.Microservices[J].IEEE Software,2015,32(1): 116-127. [2]ZHOU X,PENG X,XIE T,et al.Fault Analysis and Debugging of Microservice Systems:Industrial Survey,Benchmark System,and Empirical Study[J].IEEE Transactions on Software Engineering,2021,41(2):243-260. [3]RAJAGOPALAN S,JAMJOOM H.App-Bisect:Autonomous Healing for Microservice-Based Apps[C]//Usenix Conference on Hot Topics in Cloud Computing.USENIX Association,2015:1-14. [4]HEORHIADI V,RAJAGOPALAN S,JAMJOOM H,et al. Gremlin:Systematic Resilience Testing of Microservices[C]//IEEE 36th International Conference on Distributed Computing Systems(ICDCS).Nara:IEEE Press,2016:57-66. [5]WANG T,ZHANG W B,XU J W,et al.A Survey of Fault Detection for Distributed Software Systems with Statistical Monitoring in Cloud Computing[J].Chinese Journal of Computers,2017,40(2):397-413. [6]MI H B,WANG H M,ZHOU Y F,et al.Toward Fine-Grained,Unsupervised,Scalable Performance Diagnosis for Production Cloud Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2013,24(6):1245-1255. [7]WANG Z Y,WANG T,ZHANG W B,et al.Fault Diagnosis for Microservices with Execution Trace Monitoring[J].Journal of Software,2017,28(6):1435-1454. [8]SHARMA B,JAYACHANDRAN P,VERMA A,et al.CloudPD:Problem determination and diagnosis in shared dynamic clouds[C]//IEEE/IFIP International Conference on Dependable Systems & Networks.Budapest:IEEE Computer Society Press,2013:1-12. [9]CHEN W X,XU H W,LI Z,et al.Unsupervised Anomaly Detection for Intricate KPIs via Adversarial Training of VAE[C]//IEEE Conference on Computer Communications.Paris:IEEE Press,2019:2641-9874. [10]CHERKASOVA L,OZONAT K,MI N,et al.Automated ano- maly detection and performance modeling of enterprise applications[J].ACM Transactions on Computer Systems,2009,27(3):1-32. [11]WANG T,WEI J,ZHANG W B,et al.Workload-Aware Ano- maly Detection for Web Applications[J].Journal of Systems and Software,2014,89(3):19-32. [12]WANG T,ZHANG W B,YE C Y,et al.FD4C:Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing[J].IEEE Transactions on Systems,Man and Cybernetics:Systems,2016,46(1):61-75. [13]WANG T,ZHANG W B,XU J W,et al.Adaptive Monitoring Based Fault Detection for Cloud Computing Systems[J].Chinese Journal of Computers,2018,41(6):1332-1345. [14]JAYATHILAKA H,KRINTZ C,WOLSKI R:Performance monitoring and root cause analysis for cloud-hosted web applications[C]//Proceedings of the 26th International Conference on World Wide Web.New York,2017:469-478. [15]CHEN P,QI Y,ZHENG P,et al.Causeinfer:automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems[C]//IEEE Conference on Computer Communications.Toronto:IEEE Press,2014:1887-1895. [16]LIN J,CHEN P,ZHENG Z.Microscope:Pinpoint Performance Issues with Causal Graphs in Micro-service Environments[C]//International Conference on Service-Oriented Computing.Cham:Springer,2018:3-20. [17]THALHEIM J,RODRIGUES A,AKKUS I E,et al.Sieve:actionable insights from monitored metrics in distributed systems[C]//Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference.New York:Association for Computing Machinery,2017:14-27. [18]MARIANI L,MONNI C,PEZZE M,et al.Localizing Faults in Cloud Systems[C]//IEEE 11th International Conference on Software Testing,Verification and Validation(ICST).Västerås:IEEE Press,2018:262-273. [19]NIE X,ZHAO Y,SUI K,et al.Mining causality graph for automatic web-based service diagnosis[C]//IEEE 35th International Performance Computing and Communications Conference(IPCCC).Las Vegas:IEEE Press,2016:1-8. [20]WANG T,WEI J,QIN F,et al.Detecting Performance Anomaly with Correlation Analysis for Internetware[J].Science China Information Sciences,2013,56(8):082104(15). [21]WANG T,CHEN W,LI J,et al.Association Mining Based Consistent Service Configuration[J].Journal of Computer Research and Development,2020,57(1):188-201. [22]HASTIE T,TIBSHIRANI R,FRIEDMAN J.The Elements of Statistical Learning[M].New York:Springer,2009:1-745. [23]Cambridge University.Network Science[EB/OL].https://www.sci.unich.it/~francesc/teaching/network/. [24]LANGVILLE A N,MEYER C D,HENDLER J.Google's PageRank and Beyond:The Science of Search Engine Rankings[M].Princeton University Press,2011. [25]JIANG M,MUNAWAR M A,REIDEMEISTER T,et al.Efficient fault detection and diagnosis in complex applications with information theoretic monitoring[J].IEEE Transactions on Dependable and Secure Computing,2011,8(4):510-522. [26]DEAN D J,NGUYEN H,WANG P,et al.A.Sailer and A.Kochut,PerfCompass:Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds[J].IEEE Transactions on Parallel and Distributed Systems,2016,27(6):1742-1755. [27]GAN Y,ZHANG Y Q,CHENG D L,et al.An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems[C]//Proceeding of 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS).NewYork:Association for Computing Machinery,2019:3-18. [28]ZHOU J,CHEN Z,WANG J,et al.A Data Set for User Request Trace-Oriented Monitoring and its Applications[J].IEEE Transactions on Services Computing,2018,11(4):699-712. [29]PHAM C,WANG L,TAK B C,et al.Failure Diagnosis for Distributed Systems using Targeted Fault Injection[J].IEEE Transactions on Parallel and Distributed Systems,2016,28(2):503-516. |
[1] | 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204 |
[2] | 陆懿帆, 曹芮浩, 王俊丽, 闫春钢. 一种基于微服务的检察业务服务封装方法 Method of Encapsulating Procuratorate Affair Services Based on Microservices 计算机科学, 2021, 48(2): 33-40. https://doi.org/10.11896/jsjkx.191100152 |
[3] | 江郑, 王俊丽, 曹芮浩, 闫春钢. 一种基于微服务架构的服务划分方法 Method of Service Decomposition Based on Microservice Architecture 计算机科学, 2021, 48(12): 17-23. https://doi.org/10.11896/jsjkx.210500078 |
[4] | 朱汉卿, 马武彬, 周浩浩, 吴亚辉, 黄宏斌. 基于改进多目标进化算法的微服务用户请求分配策略 Microservices User Requests Allocation Strategy Based on Improved Multi-objective Evolutionary Algorithms 计算机科学, 2021, 48(10): 343-350. https://doi.org/10.11896/jsjkx.201100009 |
[5] | 何志鹏, 李瑞琳, 牛北方. 高可用弹性宏基因组学计算平台 Highly Available Elastic Computing Platform for Metagenomics 计算机科学, 2021, 48(1): 326-332. https://doi.org/10.11896/jsjkx.191200030 |
[6] | 于曼, 黄凯, 张翔. 基于微服务架构的ETC系统设计 Design of ETC System Based on Microservice Architecture 计算机科学, 2020, 47(6A): 643-647. https://doi.org/10.11896/JsJkx.190800010 |
[7] | 朱晓玲, 李琨, 张长胜, 杜付鑫. 基于Gabor小波变换和多核支持向量机的电梯导靴故障诊断方法 Elevator Boot Fault Diagnosis Method Based on Gabor Wavelet Transform and Multi-coreSupport Vector Machine 计算机科学, 2020, 47(12): 258-261. https://doi.org/10.11896/jsjkx.200700039 |
[8] | 吴文峻, 于鑫, 蒲彦均, 汪群博, 于笑明. 微服务时代的复杂服务软件开发 Development of Complex Service Software in Microservice Era 计算机科学, 2020, 47(12): 11-17. https://doi.org/10.11896/jsjkx.200700181 |
[9] | 林毅, 吉鸿江, 韩佳佳, 张德平. 一种基于马氏距离的系统故障诊断方法 System Fault Diagnosis Method Based on Mahalanobis Distance Metric 计算机科学, 2020, 47(11A): 57-63. https://doi.org/10.11896/jsjkx.190900174 |
[10] | 吴斌烽. 基于微服务架构的物联网中间件设计 Design of IoT Middleware Based on Microservices Architecture 计算机科学, 2019, 46(6A): 580-584. |
[11] | 郭杨, 梁家荣, 刘峰, 谢敏. 一种基于超立方体网络的高效故障诊断并行算法 Novel Fault Diagnosis Parallel Algorithm for Hypercube Networks 计算机科学, 2019, 46(5): 73-76. https://doi.org/10.11896/j.issn.1002-137X.2019.05.011 |
[12] | 王岩, 罗倩, 邓辉. 基于变分贝叶斯的轴承故障诊断方法 Bearing Fault Diagnosis Method Based on Variational Bayes 计算机科学, 2019, 46(11): 323-327. https://doi.org/10.11896/jsjkx.180901719 |
[13] | 李文海, 彭鑫, 丁丹, 向麒麟, 郭晓峰, 周翔, 赵文耘. 基于日志可视化分析的微服务系统调试方法 Method of Microservice System Debugging Based on Log Visualization Analysis 计算机科学, 2019, 46(11): 145-155. https://doi.org/10.11896/jsjkx.181102210 |
[14] | 张刚, 高俊鹏, 李红威. 级联三稳态随机共振的特性研究及应用 Research on Stochastic Resonance Characteristics of Cascaded Three-steady-state and Its Application 计算机科学, 2018, 45(9): 146-151. https://doi.org/10.11896/j.issn.1002-137X.2018.09.023 |
[15] | 张斌,滕俊杰,满毅. 改进的并行fp-growth算法在工业设备故障诊断中的应用研究 Application Research of Improved Parallel Fp-growth Algorithm in Fault Diagnosis of Industrial Equipment 计算机科学, 2018, 45(6A): 508-512. |
|