计算机科学 ›› 2017, Vol. 44 ›› Issue (Z11): 448-452.doi: 10.11896/j.issn.1002-137X.2017.11A.095

• 大数据与数据挖掘 • 上一篇    下一篇

基于大规模网络日志的模板提取研究

崔元,张琢   

  1. 东北师范大学信息与软件工程学院 长春130117,东北师范大学信息与软件工程学院 长春130117;教育部数字化学习支撑技术工程研究中心 长春130117
  • 出版日期:2018-12-01 发布日期:2018-12-01

Research on Template Extraction Based on Large-scale Network Log

CUI Yuan and ZHANG Zhuo   

  • Online:2018-12-01 Published:2018-12-01

摘要: 针对直接从大型网络日志中提取网络事件困难的问题,提出了基于大规模网络日志的模板提取方法。该方法可将海量的、原始的网络日志主动转换为日志模板,从而为了解网络事件的根因和预防网络故障的发生提供重要的前期准备。首先分析日志的结构,将日志中的词划分为模板词和参数词两类;然后从3个不同的角度切入,分别对日志进行模板提取研究;最后使用互联网公司中的实际生产数据,采用Rand_index方法来评估3种提取方法的准确有效性。结果表明,在从服务集群中收集来的4种不同消息类型中,基于标签识别树模型提取到的日志模板的平均准确率达到99.57%,高于基于统计模板提取模型和基于在线提取模板模型的准确率。

关键词: 切词,提取模板,统计聚类,标签识别树,在线聚类

Abstract: Aiming at the problem of extracting network events directly from large-scale network log,a template extraction method based on large-scale network log was proposed.The method can automatically convert the massive and original network logs into log templates,so as to provide important pre-preparation for understanding the network events root causes and preventing the occurrence of network failure.Firstly,the structure of the log is analyzed,and the words in the log are divided into two types:template word and parameter word.Then,from three different angles,the log template extraction is studied respectively.Finally,the actual production data of the Internet company is used,and Rand_index method is used to evaluate the accuracy and validity of the three extraction methods.The results show that the average accuracy of the log templates based on the tag recognition tree model is 99.57%,which is higher than that of the four different types of messages collected from the service cluster.

Key words: Cut words,Extract template,Statistical clustering,Signature tree,Online clustering

[1] 王兆丰.一种基于k_均值的DBSCAN算法参数动态选择方法[J].计算机工程与应用,2017,3(3):80-86.
[2] WANG T,SRIVATSA M,AGRAWAL D,et al.Learning,Indexing,and Diagnosing Network Faults[C]∥Proc.of KDD.2009.
[3] WANG T,SRIVATSA M,AGRAWAL D,et al.Spatio-temporal Patterns in Network Events[C]∥Proc.of CoNEXT.2010.
[4] QIU T,GE Z,PEI D,et al.What Happened in my Network? Mining Network Events from Router Syslogs[C]∥Proc.of IMC.2010.
[5] OLINER A.GANAPATHI A,XU W.Advances and Challenges in Log Analysis[J].Communications of the ACM,2012,5(2):55-61.
[6] KIMURA T,ISHIBASHI K,MORI T,et al.Spatio-temporalFactorization of Log Data for Understanding Large-scale Network Events[C]∥Proc.INFOCOM.2014.
[7] XIE Y L,YU F,ACHAN K,et al.Spamming botnets:Signa-tures and characteristics[C]∥Proc.ACM SIGCOMM.2008.
[8] KIMURA,WATANABE A,TOYONO T,et al.Proactive Fai-lure Detection Learning Generation Patterns of Large-scale Network Logs[C]∥2015 11th International Conference on Network and Service Management(CNSM).IEEE,2015:8-14.
[9] 庄军,郭平,周杨.路由器日志序列模式挖掘[J].计算机科学,2005,2(11):179-181.
[10] JIA W H,KAMBER M,PEI J.数据挖掘:概念与技术(第3版)[M].机械工业出版社,2012:306-309.
[11] 张曼琪.基于前缀树的日志模式聚类[D].上海:华东理工大学,2013.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!