计算机科学 ›› 2013, Vol. 40 ›› Issue (Z6): 302-306.

• 信息安全 • 上一篇    下一篇

基于事件处理的分布式系统故障定位技术

杜翠兰,谭建龙,王晓岩,张宇,刘萍,樊冬进   

  1. 国家计算机网络应急技术处理协调中心 北京100029;中国科学院信息工程研究所 北京100093;中国科学院信息工程研究所 北京100093;中国科学院信息工程研究所 北京100093;中国科学院信息工程研究所 北京100093;国家计算机网络应急技术处理协调中心 北京100029
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受国家“242”信息安全计划基金项目(2010A029),中国科学院战略性科技先导专项(XDA06030200)资助

Fault Location Technology Based on the Distributed Event Processing System

DU Cui-lan,TAN Jian-long,WANG Xiao-yan,ZHANG Yu,LIU Ping and FAN Dong-jin   

  • Online:2018-11-16 Published:2018-11-16

摘要: 近年来,分布式计算系统的规模越来越大、行为越来越复杂难控,系统中出现的各种故障也呈指数级增长,造成了非常严重的危害和损失,并且出现问题时对故障的排查、定位难度进一步加大。传统的通过跟踪程序运行轨迹来判断程序运行正确与否的方法,在分布式监控信息的交互上因消耗过大而且对目标程序侵入性高,已经难以满足软件行为分析的需求。通过复杂事件的处理及时发现和定位系统故障在事件大量、快速、不间断发生的分布式监控环境中显得尤为迫切。它可以利用有意义的信息状态变化事件分析系统行为,进而判断系统的运行状况,及时发现系统故障并定位,保证系统的健康运行。当前已有的复杂事件描述语言大多数是基于SQL 的方法来描述复杂事件。这种数据流查询语言对于普通用户而言比较复杂,难以掌握。通过构建一种基于集合的事件流模型,对事件进行形式化定义,使用集合来表示事件,并定义相应的操作,使得用户只需掌握几个简单的集合操作,便可以定义复杂的故障规则。

关键词: 分布式网络,实时监控系统,故障定位

Abstract: In recent years,distributed computing systems become larger and more complex to control.System faults are growing exponentially,resulting in a very serious harm and loss,and problems on trouble shooting and positioning difficulty further enlarges.Traditional ways by tracking program to judge the running and correct method,using excessive consumption of the target program and invasive in distributed monitoring information interaction,has been difficult to meet the demand of software behavior analysis.Through the complex event processing in time to find and locate the fault,this need in events in a large,rapid,uninterrupted occurrence of distributed monitoring environment appears especially urgent.It can use the meaningful information state change events to analyze system behaviors,and then judge the system operating conditions,to detect fault and positioning system,ensure the healthy operation.The complex event description language is based on the SQL method to describe the complex events.This data stream query language is complex for ordinary users and difficult to master.By constructing a set based event flow model,we can use the set of events to conduct a formal definition.The user only needs to master a few simple assembly operations in order to define complex fault rule.

Key words: Distributed network,Real-time Monitoring system,Fault location

[1] Kamoshida Y,Taura K.Scalable Data Gathering for Real-Time Monitoring Systems on Distributed Computing[C]∥Procee-dings of IEEE International Symposium on Cluster Computing and the Grid.Tokyo,Japan,IEEE Computer Society,May 2008
[2] Robert D,Gardner David A.Network Fault Detection:A Simplified Approach to Alarm Correlation[C]∥Proceedings of XVI World Telecom Congress,university of Strathclyde.1997:115-123
[3] Harrison K.Event Correlation in Telecommunication Network Management[R].Hewlett-Packard Labs,Bristol,1994
[4] Lewis L.A Case-based Reasoning Approach to the Managementof Faults in Communication Networks[C]∥Proceeding IEEE Infocom’93,vol.3.San Francisco,1993:114-120
[5] Lewis L.Implementing Policy in Enterprise Network[J].IEEE Communications Magazine,1996,34(1):50-55
[6] Jakobson G,Weissman M.Alarm Correlation[J].IEEE Net-work,1993,7(6):52-59
[7] Gabriele S,Chiaravalloti E,D’Aquila Q,et al.Distributed real-time monitoring system to natural hazard evaluation and management:the AMAMiR system[C]∥Proceedings of World IMACS|MODSIM Congress.2009
[8] White W,Riedewald M,Gehrke J.What is "next" in event pro-cessing[C]∥Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.New York,NY,USA,2007:263-272
[9] 岳海涛.基于事件关联和数据挖掘的网络故障管理技术的研究[D].长沙:中南大学,2010

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!