Computer Science ›› 2021, Vol. 48 ›› Issue (12): 8-16.doi: 10.11896/jsjkx.210100149

• Computer Architecture • Previous Articles     Next Articles

Anomaly Propagation Based Fault Diagnosis for Microservices

WANG Tao1,2, ZHANG Shu-dong3, LI An1, SHAO Ya-ru3, ZHANG Wen-bo1,2   

  1. 1 Institute of Software,Chinese Academy of Sciences,Beijing 100190,China
    2 State Key Laboratory of Computer Sciences,Institute of Software,Chinese Academy of Sciences,Beijing 100190,China
    3 Information Engineering College,Capital Normal University,Beijing 100048,China
  • Received:2021-01-19 Revised:2021-05-11 Online:2021-12-15 Published:2021-11-26
  • About author:WANG Tao,born in 1982,Ph.D,asso-ciate professor,master supervisor,is a senior member of China Computer Fe-deration.His main research interests include fault diagnosis,software reliability,and microservices.
    ZHANG Shu-dong,born in 1969,Ph.D,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include distributed computing and microservices.
  • Supported by:
    National Key Research and Development Project(2017YFB1400804),National Natural Science Foundation of China(61872344),Natural Science Foundation of Beijing(4182070) and Youth Innovation Promotion Association of the Chinese Academy of Sciences(2018144).

Abstract: Microservice architectures separate a large-scale complex application into multiple independent microservices.These microservices with various technology stacks communicate with lightweight protocols to implement agile development and conti-nuous delivery.Since the application using a microservice architecture has a large number of microservices communicating with each other,the faulty microservice should cause other microservices interacting with the faulty one to appear anomalies.How to detect anomalous microservices and locate the root cause microservice has become one of the keys of ensuring the reliability of a microservice based application.To address the above issue,this paper proposes an anomaly propagation-based fault diagnosis approach for microservices by considering the propagation of faults.First,we monitor the interactions between microservices to construct a service dependency graph for characterizing anomaly propagation.Second,we construct a regression model between me-trics and API calls to detect anomalous services.Third,we get the fault propagation subgraph by combining the service dependency graph and the detected abnormal service.Finally,we calculate the anomaly degree of microservices with a PageRank algorithm to locate the most likely root cause of the fault.The experimental results show that our approach can locate faulty microservices with low overhead.

Key words: Anomaly propagation, Fault diagnosis, Metric correlation, Microservices, Service invocation

CLC Number: 

  • TP311
[1]THÖNES J.Microservices[J].IEEE Software,2015,32(1): 116-127.
[2]ZHOU X,PENG X,XIE T,et al.Fault Analysis and Debugging of Microservice Systems:Industrial Survey,Benchmark System,and Empirical Study[J].IEEE Transactions on Software Engineering,2021,41(2):243-260.
[3]RAJAGOPALAN S,JAMJOOM H.App-Bisect:Autonomous Healing for Microservice-Based Apps[C]//Usenix Conference on Hot Topics in Cloud Computing.USENIX Association,2015:1-14.
[4]HEORHIADI V,RAJAGOPALAN S,JAMJOOM H,et al. Gremlin:Systematic Resilience Testing of Microservices[C]//IEEE 36th International Conference on Distributed Computing Systems(ICDCS).Nara:IEEE Press,2016:57-66.
[5]WANG T,ZHANG W B,XU J W,et al.A Survey of Fault Detection for Distributed Software Systems with Statistical Monitoring in Cloud Computing[J].Chinese Journal of Computers,2017,40(2):397-413.
[6]MI H B,WANG H M,ZHOU Y F,et al.Toward Fine-Grained,Unsupervised,Scalable Performance Diagnosis for Production Cloud Computing Systems[J].IEEE Transactions on Parallel and Distributed Systems,2013,24(6):1245-1255.
[7]WANG Z Y,WANG T,ZHANG W B,et al.Fault Diagnosis for Microservices with Execution Trace Monitoring[J].Journal of Software,2017,28(6):1435-1454.
[8]SHARMA B,JAYACHANDRAN P,VERMA A,et al.CloudPD:Problem determination and diagnosis in shared dynamic clouds[C]//IEEE/IFIP International Conference on Dependable Systems & Networks.Budapest:IEEE Computer Society Press,2013:1-12.
[9]CHEN W X,XU H W,LI Z,et al.Unsupervised Anomaly Detection for Intricate KPIs via Adversarial Training of VAE[C]//IEEE Conference on Computer Communications.Paris:IEEE Press,2019:2641-9874.
[10]CHERKASOVA L,OZONAT K,MI N,et al.Automated ano- maly detection and performance modeling of enterprise applications[J].ACM Transactions on Computer Systems,2009,27(3):1-32.
[11]WANG T,WEI J,ZHANG W B,et al.Workload-Aware Ano- maly Detection for Web Applications[J].Journal of Systems and Software,2014,89(3):19-32.
[12]WANG T,ZHANG W B,YE C Y,et al.FD4C:Automatic Fault Diagnosis Framework for Web Applications in Cloud Computing[J].IEEE Transactions on Systems,Man and Cybernetics:Systems,2016,46(1):61-75.
[13]WANG T,ZHANG W B,XU J W,et al.Adaptive Monitoring Based Fault Detection for Cloud Computing Systems[J].Chinese Journal of Computers,2018,41(6):1332-1345.
[14]JAYATHILAKA H,KRINTZ C,WOLSKI R:Performance monitoring and root cause analysis for cloud-hosted web applications[C]//Proceedings of the 26th International Conference on World Wide Web.New York,2017:469-478.
[15]CHEN P,QI Y,ZHENG P,et al.Causeinfer:automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems[C]//IEEE Conference on Computer Communications.Toronto:IEEE Press,2014:1887-1895.
[16]LIN J,CHEN P,ZHENG Z.Microscope:Pinpoint Performance Issues with Causal Graphs in Micro-service Environments[C]//International Conference on Service-Oriented Computing.Cham:Springer,2018:3-20.
[17]THALHEIM J,RODRIGUES A,AKKUS I E,et al.Sieve:actionable insights from monitored metrics in distributed systems[C]//Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference.New York:Association for Computing Machinery,2017:14-27.
[18]MARIANI L,MONNI C,PEZZE M,et al.Localizing Faults in Cloud Systems[C]//IEEE 11th International Conference on Software Testing,Verification and Validation(ICST).Västerås:IEEE Press,2018:262-273.
[19]NIE X,ZHAO Y,SUI K,et al.Mining causality graph for automatic web-based service diagnosis[C]//IEEE 35th International Performance Computing and Communications Conference(IPCCC).Las Vegas:IEEE Press,2016:1-8.
[20]WANG T,WEI J,QIN F,et al.Detecting Performance Anomaly with Correlation Analysis for Internetware[J].Science China Information Sciences,2013,56(8):082104(15).
[21]WANG T,CHEN W,LI J,et al.Association Mining Based Consistent Service Configuration[J].Journal of Computer Research and Development,2020,57(1):188-201.
[22]HASTIE T,TIBSHIRANI R,FRIEDMAN J.The Elements of Statistical Learning[M].New York:Springer,2009:1-745.
[23]Cambridge University.Network Science[EB/OL].https://www.sci.unich.it/~francesc/teaching/network/.
[24]LANGVILLE A N,MEYER C D,HENDLER J.Google's PageRank and Beyond:The Science of Search Engine Rankings[M].Princeton University Press,2011.
[25]JIANG M,MUNAWAR M A,REIDEMEISTER T,et al.Efficient fault detection and diagnosis in complex applications with information theoretic monitoring[J].IEEE Transactions on Dependable and Secure Computing,2011,8(4):510-522.
[26]DEAN D J,NGUYEN H,WANG P,et al.A.Sailer and A.Kochut,PerfCompass:Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds[J].IEEE Transactions on Parallel and Distributed Systems,2016,27(6):1742-1755.
[27]GAN Y,ZHANG Y Q,CHENG D L,et al.An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems[C]//Proceeding of 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(ASPLOS).NewYork:Association for Computing Machinery,2019:3-18.
[28]ZHOU J,CHEN Z,WANG J,et al.A Data Set for User Request Trace-Oriented Monitoring and its Applications[J].IEEE Transactions on Services Computing,2018,11(4):699-712.
[29]PHAM C,WANG L,TAK B C,et al.Failure Diagnosis for Distributed Systems using Targeted Fault Injection[J].IEEE Transactions on Parallel and Distributed Systems,2016,28(2):503-516.
[1] LEI Jian-mei, ZENG Ling-qiu, MU Jie, CHEN Li-dong, WANG Cong, CHAI Yong. Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning [J]. Computer Science, 2021, 48(6): 190-195.
[2] ZHU Han-qing, MA Wu-bin, ZHOU Hao-hao, WU Ya-hui, HUANG Hong-bin. Microservices User Requests Allocation Strategy Based on Improved Multi-objective Evolutionary Algorithms [J]. Computer Science, 2021, 48(10): 343-350.
[3] HE Zhi-peng, LI Rui-lin, NIU Bei-fang. Highly Available Elastic Computing Platform for Metagenomics [J]. Computer Science, 2021, 48(1): 326-332.
[4] ZHU Xiao-ling, LI Kun, ZHANG Chang-sheng, DU Fu-xin. Elevator Boot Fault Diagnosis Method Based on Gabor Wavelet Transform and Multi-coreSupport Vector Machine [J]. Computer Science, 2020, 47(12): 258-261.
[5] LIN Yi, JI Hong-jiang, HAN Jia-jia, ZHANG De-ping. System Fault Diagnosis Method Based on Mahalanobis Distance Metric [J]. Computer Science, 2020, 47(11A): 57-63.
[6] WU Bin-feng. Design of IoT Middleware Based on Microservices Architecture [J]. Computer Science, 2019, 46(6A): 580-584.
[7] GUO Yang, LIANG Jia-rong, LIU Feng, XIE Min. Novel Fault Diagnosis Parallel Algorithm for Hypercube Networks [J]. Computer Science, 2019, 46(5): 73-76.
[8] WANG Yan, LUO Qian, DENG Hui. Bearing Fault Diagnosis Method Based on Variational Bayes [J]. Computer Science, 2019, 46(11): 323-327.
[9] ZHANG Gang, GAO Jun-peng, LI Hong-wei. Research on Stochastic Resonance Characteristics of Cascaded Three-steady-state and Its Application [J]. Computer Science, 2018, 45(9): 146-151.
[10] ZHANG Bin,TENG Jun-jie,MAN Yi. Application Research of Improved Parallel Fp-growth Algorithm in Fault Diagnosis
of Industrial Equipment
[J]. Computer Science, 2018, 45(6A): 508-512.
[11] XUE Shan-liang, YANG Pei-ru and ZHOU Xi. WSN Wireless Data Transceiver Unit Fault Diagnosis with Fuzzy Neural Network [J]. Computer Science, 2018, 45(5): 38-43.
[12] ZHANG Ni, CHE Li-zhi and WU Xiao-jin. Present Situation and Prospect of Data-driven Based Fault Diagnosis Technique [J]. Computer Science, 2017, 44(Z6): 37-42.
[13] CHEN Miao-jiang, LIANG Jia-rong and ZHANG Qian. Hypercube Network Diagnosis Algorithm under Comparison Model [J]. Computer Science, 2017, 44(6): 83-90.
[14] CHEN Lin, YING Shi and JIA Xiang-yang. SHMA:Monitoring Architecture for Clouds [J]. Computer Science, 2017, 44(1): 7-12.
[15] ZHOU Xi and XUE Shan-liang. WSN Fault Diagnosis with Improved Rough Set and Neural Network [J]. Computer Science, 2016, 43(Z11): 21-25.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!