计算机科学 ›› 2024, Vol. 51 ›› Issue (11): 65-72.doi: 10.11896/jsjkx.230900161

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于Bert和自适应聚类的在线日志解析方法

卢家伟1, 卢士达2, 刘思思2, 吴承荣1   

  1. 1 复旦大学计算机科学技术学院 上海 200082
    2 复旦大学网络信息安全审计与监控教育部工程研究中心 上海 200082
  • 收稿日期:2023-09-28 修回日期:2024-03-13 出版日期:2024-11-15 发布日期:2024-11-06
  • 通讯作者: 吴承荣(cwu@fudan.edu.cn)
  • 作者简介:(jwlu22@m.fudan.edu.cn)
  • 基金资助:
    复旦大学网络信息安全审计与监控教育部工程研究中心与国家电网上海数据中心合作项目(09B307-9003001-0014-1)

Online Log Parsing Method Based on Bert and Adaptive Clustering

LU Jiawei1, LU Shida2, LIU Sisi2, WU Chengrong1   

  1. 1 School of Computer Science and Technology,Fudan University,Shanghai 200082,China
    2 Engineering Research Centre of Network Information Security Audit and Monitoring of Ministry of Education,Fudan University,Shanghai 200082,China
  • Received:2023-09-28 Revised:2024-03-13 Online:2024-11-15 Published:2024-11-06
  • About author:LU Jiawei,born in 2000,postgraduate.His main research interests include machine learning and cyberspace security.
    WU Chengrong,born in 1971,Ph.D,associate professor,master’s supervisor,2004 Shanghai Youth IT Top Ten new talent,is a member of CCF(No.23842M).His main research interest is cyberspace security.
  • Supported by:
    Engineering Research Centre of Network Information Security Audit and Monitoring of Ministry of Education,Fudan University and State Grid Shanghai Data Centre’s Cooperative Project(09B307-9003001-0014-1).

摘要: 日志解析是一种从原始日志文件中提取有效信息的技术,它可以用于系统故障诊断、性能分析、安全审计等领域。日志解析的主要挑战在于日志数据的非结构化、多样性和动态性。不同的系统和应用程序可能使用不同的日志格式,随着时间的推移,日志格式也会发生变化。文中提出一种能够自适应不同日志源和日志格式变化的在线日志解析方法BertLP,它使用预训练语言模型Bert,并结合自适应聚类算法对日志中的单词进行静动态识别,从而对日志进行分组生成日志模板。BertLP方法不需要人工定义日志模板或正则表达式,也不需要对单词进行频率统计,而是通过学习日志消息的语义和结构特征,来自动识别日志字段和类型。在多个公开日志数据集上的对比实验显示,BertLP方法在日志解析的准确率上比现有最佳方法提高了6.1%,并且在日志解析任务上表现更好。

关键词: 日志解析, Bert, 自适应聚类, 语义提取

Abstract: Log parsing is a technique for extracting valid information from raw log files,which can be used in areas such as system troubleshooting,performance analysis and security auditing.The main challenge of log parsing is the unstructured,diversity and dynamics of log data.Different systems and applications may use different log formats,and log formats may change over time.Therefore,this paper proposes BertLP,an online log parsing method that can automatically adapt to different log sources and log format variations.It uses a pre-trained language model,Bert,combined with an adaptive clustering algorithm for static and dynamic recognition of words in logs to group logs to generate log templates.Instead of manually defining log templates or regular expressions and performing frequency counts on words,BertLP automatically identifies log fields and types by learning semantic and structural features of log message.Comparative experiments on public log datasets show BertLP improves log parsing accuracy by 6.1% compared with the best available method and performs better on log parsing tasks.

Key words: Log parsing, Bert, Adaptive clustering, Semantic extraction

中图分类号: 

  • TP181
[1] YU S,CHEN N,WU Y,et al.Self-supervised log parsing using semantic contribution difference[J].Journal of Systems and Software,2023,200:111646.
[2] ZHOU R,HAMDAQA M,CAI H,et al.Mobilogleak:A preli-minary study on data leakage caused by poor logging practices[C]//2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020:577-581.
[3] AMAR A,RIGBY P C.Mining historical test logs to predictbugs and localize faults in the test logs[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:140-151.
[4] EL-MASRI D,PETRILLO F,GUÉHÉNEUC Y G,et al.A systematic literature review on automated log abstraction techniques[J].Information and Software Technology,2020,122:106276.
[5] CHEN R,ZHANG S,LI D,et al.Logtransfer:Cross-system log anomaly detection for software systems with transfer learning[C]//2020 IEEE 31st International Symposium on Software Reliability Engineering(ISSRE).IEEE,2020:37-47.
[6] HE S,HE P,CHEN Z,et al.A survey on automated log analysisfor reliability engineering[J].ACM Computing Surveys(CSUR),2021,54(6):1-37.
[7] VAARANDI R.A data clustering algorithm for mining patterns from event logs[C]//Proceedings of the 3rd IEEE Workshop on IP Operations & Management(IPOM 2003)(IEEE Cat.No.03EX764).IEEE,2003:119-126.
[8] VAARANDI R,PIHELGAS M.Logcluster-a data clusteringand pattern mining algorithm for event logs[C]//2015 11th International Conference on Network and Service Management(CNSM).IEEE,2015:1-7.
[9] DAI H,LI H,CHEN C S,et al.Logram:Efficient Log Parsing Usingn-Gram Dictionaries[J].IEEE Transactions on Software Engineering,2020,48(3):879-892.
[10] MIZUTANI M.Incremental mining of system log format[C]//2013 IEEE International Conference on Services Computing.IEEE,2013:595-602.
[11] SHIMA K.Length matters:Clustering system log messagesusing length of words[J].arXiv:1611.03213,2016.
[12] DU M,LI F.Spell:Online streaming parsing of large unstruc-tured system logs[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(11):2213-2227.
[13] HE P,ZHU J,ZHENG Z,et al.Drain:An online log parsing approach with fixed depth tree[C]//2017 IEEE International Conference on Web Services(ICWS).IEEE,2017:33-40.
[14] SEDKI I,HAMOU-LHADJ A,AIT-MOHAMED O,et al.AnEffective Approach for Parsing Large Log Files[C]//2022 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2022:1-12.
[15] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[16] STROBELT H,HOOVER B,SATYANARAYAN A,et al.LMdiff:A visual diff tool to compare language models[J].ar-Xiv:2111.01582,2021.
[17] GUO H,YUAN S,WU X.Logbert:Log anomaly detection via bert[C]//2021 International Joint Conference on Neural Networks(IJCNN).IEEE,2021:1-8.
[18] LEE Y,KIM J,KANG P.Lanobert:System log anomaly detection based on bert masked language model[J].Applied Soft Computing,2023,146:110689.
[19] OLINER A,STEARLEY J.What supercomputers say:A study of five system logs[C]//37th annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN’07).IEEE,2007:575-584.
[20] ZHU J,HE S,HE P,et al.Loghub:A large collection of system log datasets for ai-driven log analytics[C]//2023 IEEE 34th International Symposium on Software Reliability Engineering(ISSRE).IEEE,2023:355-366.
[21] ZHANG T,QIU H,CASTELLANO G,et al.System Log Parsing:A Survey[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(8):8596-8614.
[22] LANDAUER M,ONDER S,SKOPIK F,et al.Deep learning for anomaly detection in log data:A survey[J].Machine Learning with Applications,2023,12:100470.
[23] MACBETH G,RAZUMIEJCZYK E,LEDESMA R D.Cliff’s Delta Calculator:A non-parametric effect size program for two groups of observations[J].Universitas Psychologica,2011,10(2):545-555.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!