计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 8-16.doi: 10.11896/jsjkx.210100149

• 计算机体系结构* 上一篇    下一篇

一种面向异常传播的微服务故障诊断方法

王焘1,2, 张树东3, 李安1, 邵亚茹3, 张文博1,2   

  1. 1 中国科学院软件研究所 北京100190
    2 中国科学院软件研究所计算机科学国家重点实验室 北京100190
    3 首都师范大学信息工程学院 北京100048
  • 收稿日期:2021-01-19 修回日期:2021-05-11 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 张树东(zsd@cnu.edu.cn)
  • 作者简介:wangtao@iscas.ac.cn
  • 基金资助:
    国家重点研发计划(2017YFB1400804);国家自然科学基金(61872344);北京市自然科学基金(4182070);中国科学院青年创新促进会人才专项(2018144)

Anomaly Propagation Based Fault Diagnosis for Microservices

WANG Tao1,2, ZHANG Shu-dong3, LI An1, SHAO Ya-ru3, ZHANG Wen-bo1,2   

  1. 1 Institute of Software,Chinese Academy of Sciences,Beijing 100190,China
    2 State Key Laboratory of Computer Sciences,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China
    3 Information Engineering College,Capital Normal University,Beijing 100048,China
  • Received:2021-01-19 Revised:2021-05-11 Online:2021-12-15 Published:2021-11-26
  • About author:WANG Tao,born in 1982,Ph.D,asso-ciate professor,master supervisor,is a senior member of China Computer Fe-deration.His main research interests include fault diagnosis,software reliability,and microservices.
    ZHANG Shu-dong,born in 1969,Ph.D,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include distributed computing and microservices.
  • Supported by:
    National Key Research and Development Project(2017YFB1400804),National Natural Science Foundation of China(61872344),Natural Science Foundation of Beijing(4182070) and Youth Innovation Promotion Association of the Chinese Academy of Sciences(2018144).

摘要: 微服务软件架构将大型复杂应用软件拆分成多个可独立部署的相互之间通过轻量级通信机制协作的微服务,从而实现了应用软件的敏捷开发和持续交付。然而,应用软件的微服务数量众多,调用关系复杂,当某个微服务出现故障时会引发与之交互的微服务也出现异常,从而大幅增加了软件应用出现故障的可能性。面对众多异常微服务,考虑到异常的传播性,如何高效、准确地定位引发异常的故障微服务,成为亟待解决的问题。针对该问题,文中提出一种面向异常传播的微服务故障诊断方法。首先,监测微服务度量信息与微服务之间的调用行为;然后,基于回归分析构建度量与API调用之间的回归模型以检测异常微服务;同时,构建微服务依赖图以刻画微服务间的异常传播;最后,基于服务依赖图以及异常服务集合得到故障传播子图,并基于PageRank算法找出最有可能引发异常的根因,即故障微服务。实验结果表明,该方法能够有效检测异常服务,准确诊断故障微服务,同时具有较低的开销。

关键词: 故障诊断, 微服务, 服务调用, 度量关联, 异常传播

Abstract: Microservice architectures separate a large-scale complex application into multiple independent microservices.These microservices with various technology stacks communicate with lightweight protocols to implement agile development and conti-nuous delivery.Since the application using a microservice architecture has a large number of microservices communicating with each other,the faulty microservice should cause other microservices interacting with the faulty one to appear anomalies.How to detect anomalous microservices and locate the root cause microservice has become one of the keys of ensuring the reliability of a microservice based application.To address the above issue,this paper proposes an anomaly propagation-based fault diagnosis approach for microservices by considering the propagation of faults.First,we monitor the interactions between microservices to construct a service dependency graph for characterizing anomaly propagation.Second,we construct a regression model between me-trics and API calls to detect anomalous services.Third,we get the fault propagation subgraph by combining the service dependency graph and the detected abnormal service.Finally,we calculate the anomaly degree of microservices with a PageRank algorithm to locate the most likely root cause of the fault.The experimental results show that our approach can locate faulty microservices with low overhead.

Key words: Fault diagnosis, Microservices, Service invocation, Metric correlation, Anomaly propagation

中图分类号: 

  • TP311
[1]THÖNES J.Microservices[J].IEEE Software,2015,32(1): 116-127.
[2]ZHOU X,PENG X,XIE T,et al.Fault Analysis and Debugging of Microservice Systems:Industrial Survey,Benchmark System,and Empirical Study[J].IEEE Transactions on Software Engineering,2021,41(2):243-260.
[3]RAJAGOPALAN S,JAMJOOM H.App-Bisect:Autonomous Healing for Microservice-Based Apps[C]//Usenix Conference on Hot Topics in Cloud Computing.USENIX Association,2015:1-14.
[4]HEORHIADI V,RAJAGOPALAN S,JAMJOOM H,et al. Gremlin:Systematic Resilience Testing of Microservices[C]//IEEE 36th International Conference on Distributed Computing Systems(ICDCS).Nara:IEEE Press,2016:57-66.
[5]WANG T,ZHANG W B,XU J W,et al.A Survey of Fault Detection for Distributed Software Systems with Statistical Monitoring in Cloud Computing[J].Chinese Journal of Computers,2017,40(2):397-413.
[6]MI H B,WANG H M,ZHOU Y F,et al.Toward Fine-Grained,Unsupervised,Scalable Performance Diagnosis for Production Cloud Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2013,24(6):1245-1255.
[7]WANG Z Y,WANG T,ZHANG W B,et al.Fault Diagnosis for Microservices with Execution Trace Monitoring[J].Journal of Software,2017,28(6):1435-1454.
[8]SHARMA B,JAYACHANDRAN P,VERMA A,et al.CloudPD:Problem determination and diagnosis in shared dynamic clouds[C]//IEEE/IFIP International Conference on Dependable Systems & Networks.Budapest:IEEE Computer Society Press,2013:1-12.
[9]CHEN W X,XU H W,LI Z,et al.Unsupervised Anomaly Detection for Intricate KPIs via Adversarial Training of VAE[C]//IEEE Conference on Computer Communications.Paris:IEEE Press,2019:2641-9874.
[10]CHERKASOVA L,OZONAT K,MI N,et al.Automated ano- maly detection and performance modeling of enterprise applications[J].ACM Transactions on Computer Systems,2009,27(3):1-32.
[11]WANG T,WEI J,ZHANG W B,et al.Workload-Aware Ano- maly Detection for Web Applications[J].Journal of Systems and Software,2014,89(3):19-32.
[12]WANG T,ZHANG W B,YE C Y,et al.FD4C:Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing[J].IEEE Transactions on Systems,Man and Cybernetics:Systems,2016,46(1):61-75.
[13]WANG T,ZHANG W B,XU J W,et al.Adaptive Monitoring Based Fault Detection for Cloud Computing Systems[J].Chinese Journal of Computers,2018,41(6):1332-1345.
[14]JAYATHILAKA H,KRINTZ C,WOLSKI R:Performance monitoring and root cause analysis for cloud-hosted web applications[C]//Proceedings of the 26th International Conference on World Wide Web.New York,2017:469-478.
[15]CHEN P,QI Y,ZHENG P,et al.Causeinfer:automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems[C]//IEEE Conference on Computer Communications.Toronto:IEEE Press,2014:1887-1895.
[16]LIN J,CHEN P,ZHENG Z.Microscope:Pinpoint Performance Issues with Causal Graphs in Micro-service Environments[C]//International Conference on Service-Oriented Computing.Cham:Springer,2018:3-20.
[17]THALHEIM J,RODRIGUES A,AKKUS I E,et al.Sieve:actionable insights from monitored metrics in distributed systems[C]//Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference.New York:Association for Computing Machinery,2017:14-27.
[18]MARIANI L,MONNI C,PEZZE M,et al.Localizing Faults in Cloud Systems[C]//IEEE 11th International Conference on Software Testing,Verification and Validation(ICST).Västerås:IEEE Press,2018:262-273.
[19]NIE X,ZHAO Y,SUI K,et al.Mining causality graph for automatic web-based service diagnosis[C]//IEEE 35th International Performance Computing and Communications Conference(IPCCC).Las Vegas:IEEE Press,2016:1-8.
[20]WANG T,WEI J,QIN F,et al.Detecting Performance Anomaly with Correlation Analysis for Internetware[J].Science China Information Sciences,2013,56(8):082104(15).
[21]WANG T,CHEN W,LI J,et al.Association Mining Based Consistent Service Configuration[J].Journal of Computer Research and Development,2020,57(1):188-201.
[22]HASTIE T,TIBSHIRANI R,FRIEDMAN J.The Elements of Statistical Learning[M].New York:Springer,2009:1-745.
[23]Cambridge University.Network Science[EB/OL].https://www.sci.unich.it/~francesc/teaching/network/.
[24]LANGVILLE A N,MEYER C D,HENDLER J.Google's PageRank and Beyond:The Science of Search Engine Rankings[M].Princeton University Press,2011.
[25]JIANG M,MUNAWAR M A,REIDEMEISTER T,et al.Efficient fault detection and diagnosis in complex applications with information theoretic monitoring[J].IEEE Transactions on Dependable and Secure Computing,2011,8(4):510-522.
[26]DEAN D J,NGUYEN H,WANG P,et al.A.Sailer and A.Kochut,PerfCompass:Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds[J].IEEE Transactions on Parallel and Distributed Systems,2016,27(6):1742-1755.
[27]GAN Y,ZHANG Y Q,CHENG D L,et al.An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems[C]//Proceeding of 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS).NewYork:Association for Computing Machinery,2019:3-18.
[28]ZHOU J,CHEN Z,WANG J,et al.A Data Set for User Request Trace-Oriented Monitoring and its Applications[J].IEEE Transactions on Services Computing,2018,11(4):699-712.
[29]PHAM C,WANG L,TAK B C,et al.Failure Diagnosis for Distributed Systems using Targeted Fault Injection[J].IEEE Transactions on Parallel and Distributed Systems,2016,28(2):503-516.
[1] 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法[J]. 计算机科学, 2021, 48(6): 190-195.
[2] 陆懿帆, 曹芮浩, 王俊丽, 闫春钢. 一种基于微服务的检察业务服务封装方法[J]. 计算机科学, 2021, 48(2): 33-40.
[3] 江郑, 王俊丽, 曹芮浩, 闫春钢. 一种基于微服务架构的服务划分方法[J]. 计算机科学, 2021, 48(12): 17-23.
[4] 朱汉卿, 马武彬, 周浩浩, 吴亚辉, 黄宏斌. 基于改进多目标进化算法的微服务用户请求分配策略[J]. 计算机科学, 2021, 48(10): 343-350.
[5] 何志鹏, 李瑞琳, 牛北方. 高可用弹性宏基因组学计算平台[J]. 计算机科学, 2021, 48(1): 326-332.
[6] 于曼, 黄凯, 张翔. 基于微服务架构的ETC系统设计[J]. 计算机科学, 2020, 47(6A): 643-647.
[7] 吴文峻, 于鑫, 蒲彦均, 汪群博, 于笑明. 微服务时代的复杂服务软件开发[J]. 计算机科学, 2020, 47(12): 11-17.
[8] 朱晓玲, 李琨, 张长胜, 杜付鑫. 基于Gabor小波变换和多核支持向量机的电梯导靴故障诊断方法[J]. 计算机科学, 2020, 47(12): 258-261.
[9] 林毅, 吉鸿江, 韩佳佳, 张德平. 一种基于马氏距离的系统故障诊断方法[J]. 计算机科学, 2020, 47(11A): 57-63.
[10] 吴斌烽. 基于微服务架构的物联网中间件设计[J]. 计算机科学, 2019, 46(6A): 580-584.
[11] 郭杨, 梁家荣, 刘峰, 谢敏. 一种基于超立方体网络的高效故障诊断并行算法[J]. 计算机科学, 2019, 46(5): 73-76.
[12] 李文海, 彭鑫, 丁丹, 向麒麟, 郭晓峰, 周翔, 赵文耘. 基于日志可视化分析的微服务系统调试方法[J]. 计算机科学, 2019, 46(11): 145-155.
[13] 王岩, 罗倩, 邓辉. 基于变分贝叶斯的轴承故障诊断方法[J]. 计算机科学, 2019, 46(11): 323-327.
[14] 张刚, 高俊鹏, 李红威. 级联三稳态随机共振的特性研究及应用[J]. 计算机科学, 2018, 45(9): 146-151.
[15] 张斌,滕俊杰,满毅. 改进的并行fp-growth算法在工业设备故障诊断中的应用研究[J]. 计算机科学, 2018, 45(6A): 508-512.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 郑香平, 於志勇, 温广槟. 地点网络中的社区发现[J]. 计算机科学, 2018, 45(6): 46 -50 .
[2] 强振平,何丽波,陈旭,李彤. 基于RBAC的复杂信息系统中访问控制模型的设计[J]. 计算机科学, 2014, 41(Z6): 429 -432 .
[3] 孟红波,王昌明,包建东. 用于图像分割的鲁棒的区域活动轮廓模型[J]. 计算机科学, 2014, 41(Z6): 207 -210 .
[4] 刘小洋, 何道兵. 基于突发公共事件的信息传播动力学模型与舆情演化研究[J]. 计算机科学, 2019, 46(5): 320 -326 .
[5] 张彬彬, 王娟, 岳昆, 武浩, 郝佳. 基于随机森林的虚拟机性能预测与配置优化[J]. 计算机科学, 2019, 46(9): 85 -92 .
[6] 周小龙, 陈小佳, 陈胜勇, 雷帮军. 弱监督学习下的目标检测算法综述[J]. 计算机科学, 2019, 46(11): 49 -57 .
[7] 付子义, 程冰, 邵路路. 面向光伏MPPT控制策略的改进果蝇算法[J]. 计算机科学, 2020, 47(5): 236 -241 .
[8] 王燕, 王丽. 面向高光谱图像分类的局部Gabor卷积神经网络[J]. 计算机科学, 2020, 47(6): 151 -156 .
[9] 董明刚, 黄宇扬, 敬超. 基于遗传实例和特征选择的K近邻训练集优化方法[J]. 计算机科学, 2020, 47(8): 178 -184 .
[10] 单美静, 秦龙飞, 张会兵. L-YOLO:适用于车载边缘计算的实时交通标识检测模型[J]. 计算机科学, 2021, 48(1): 89 -95 .