Computer Science ›› 2024, Vol. 51 ›› Issue (11): 65-72.doi: 10.11896/jsjkx.230900161

• Database & Big Data & Data Science • Previous Articles     Next Articles

Online Log Parsing Method Based on Bert and Adaptive Clustering

LU Jiawei1, LU Shida2, LIU Sisi2, WU Chengrong1   

  1. 1 School of Computer Science and Technology,Fudan University,Shanghai 200082,China
    2 Engineering Research Centre of Network Information Security Audit and Monitoring of Ministry of Education,Fudan University,Shanghai 200082,China
  • Received:2023-09-28 Revised:2024-03-13 Online:2024-11-15 Published:2024-11-06
  • About author:LU Jiawei,born in 2000,postgraduate.His main research interests include machine learning and cyberspace security.
    WU Chengrong,born in 1971,Ph.D,associate professor,master’s supervisor,2004 Shanghai Youth IT Top Ten new talent,is a member of CCF(No.23842M).His main research interest is cyberspace security.
  • Supported by:
    Engineering Research Centre of Network Information Security Audit and Monitoring of Ministry of Education,Fudan University and State Grid Shanghai Data Centre’s Cooperative Project(09B307-9003001-0014-1).

Abstract: Log parsing is a technique for extracting valid information from raw log files,which can be used in areas such as system troubleshooting,performance analysis and security auditing.The main challenge of log parsing is the unstructured,diversity and dynamics of log data.Different systems and applications may use different log formats,and log formats may change over time.Therefore,this paper proposes BertLP,an online log parsing method that can automatically adapt to different log sources and log format variations.It uses a pre-trained language model,Bert,combined with an adaptive clustering algorithm for static and dynamic recognition of words in logs to group logs to generate log templates.Instead of manually defining log templates or regular expressions and performing frequency counts on words,BertLP automatically identifies log fields and types by learning semantic and structural features of log message.Comparative experiments on public log datasets show BertLP improves log parsing accuracy by 6.1% compared with the best available method and performs better on log parsing tasks.

Key words: Log parsing, Bert, Adaptive clustering, Semantic extraction

CLC Number: 

  • TP181
[1] YU S,CHEN N,WU Y,et al.Self-supervised log parsing using semantic contribution difference[J].Journal of Systems and Software,2023,200:111646.
[2] ZHOU R,HAMDAQA M,CAI H,et al.Mobilogleak:A preli-minary study on data leakage caused by poor logging practices[C]//2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020:577-581.
[3] AMAR A,RIGBY P C.Mining historical test logs to predictbugs and localize faults in the test logs[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:140-151.
[4] EL-MASRI D,PETRILLO F,GUÉHÉNEUC Y G,et al.A systematic literature review on automated log abstraction techniques[J].Information and Software Technology,2020,122:106276.
[5] CHEN R,ZHANG S,LI D,et al.Logtransfer:Cross-system log anomaly detection for software systems with transfer learning[C]//2020 IEEE 31st International Symposium on Software Reliability Engineering(ISSRE).IEEE,2020:37-47.
[6] HE S,HE P,CHEN Z,et al.A survey on automated log analysisfor reliability engineering[J].ACM Computing Surveys(CSUR),2021,54(6):1-37.
[7] VAARANDI R.A data clustering algorithm for mining patterns from event logs[C]//Proceedings of the 3rd IEEE Workshop on IP Operations & Management(IPOM 2003)(IEEE Cat.No.03EX764).IEEE,2003:119-126.
[8] VAARANDI R,PIHELGAS M.Logcluster-a data clusteringand pattern mining algorithm for event logs[C]//2015 11th International Conference on Network and Service Management(CNSM).IEEE,2015:1-7.
[9] DAI H,LI H,CHEN C S,et al.Logram:Efficient Log Parsing Usingn-Gram Dictionaries[J].IEEE Transactions on Software Engineering,2020,48(3):879-892.
[10] MIZUTANI M.Incremental mining of system log format[C]//2013 IEEE International Conference on Services Computing.IEEE,2013:595-602.
[11] SHIMA K.Length matters:Clustering system log messagesusing length of words[J].arXiv:1611.03213,2016.
[12] DU M,LI F.Spell:Online streaming parsing of large unstruc-tured system logs[J].IEEE Transactions on Knowledge and Data Engineering,2018,31(11):2213-2227.
[13] HE P,ZHU J,ZHENG Z,et al.Drain:An online log parsing approach with fixed depth tree[C]//2017 IEEE International Conference on Web Services(ICWS).IEEE,2017:33-40.
[14] SEDKI I,HAMOU-LHADJ A,AIT-MOHAMED O,et al.AnEffective Approach for Parsing Large Log Files[C]//2022 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2022:1-12.
[15] DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[16] STROBELT H,HOOVER B,SATYANARAYAN A,et al.LMdiff:A visual diff tool to compare language models[J].ar-Xiv:2111.01582,2021.
[17] GUO H,YUAN S,WU X.Logbert:Log anomaly detection via bert[C]//2021 International Joint Conference on Neural Networks(IJCNN).IEEE,2021:1-8.
[18] LEE Y,KIM J,KANG P.Lanobert:System log anomaly detection based on bert masked language model[J].Applied Soft Computing,2023,146:110689.
[19] OLINER A,STEARLEY J.What supercomputers say:A study of five system logs[C]//37th annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN’07).IEEE,2007:575-584.
[20] ZHU J,HE S,HE P,et al.Loghub:A large collection of system log datasets for ai-driven log analytics[C]//2023 IEEE 34th International Symposium on Software Reliability Engineering(ISSRE).IEEE,2023:355-366.
[21] ZHANG T,QIU H,CASTELLANO G,et al.System Log Parsing:A Survey[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(8):8596-8614.
[22] LANDAUER M,ONDER S,SKOPIK F,et al.Deep learning for anomaly detection in log data:A survey[J].Machine Learning with Applications,2023,12:100470.
[23] MACBETH G,RAZUMIEJCZYK E,LEDESMA R D.Cliff’s Delta Calculator:A non-parametric effect size program for two groups of observations[J].Universitas Psychologica,2011,10(2):545-555.
[1] TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
[2] YANG Binxia, LUO Xudong, SUN Kaili. Recent Progress on Machine Translation Based on Pre-trained Language Models [J]. Computer Science, 2024, 51(6A): 230700112-8.
[3] LI Minzhe, YIN Jibin. TCM Named Entity Recognition Model Combining BERT Model and Lexical Enhancement [J]. Computer Science, 2024, 51(6A): 230900030-6.
[4] JIANG Haoda, ZHAO Chunlei, CHEN Han, WANG Chundong. Construction Method of Domain Sentiment Lexicon Based on Improved TF-IDF and BERT [J]. Computer Science, 2024, 51(6A): 230800011-9.
[5] YANG Junzhe, SONG Ying, CHEN Yifei. Text Emotional Analysis Model Fusing Theme Characteristics [J]. Computer Science, 2024, 51(6A): 230600111-8.
[6] CHEN Bingting, ZOU Weiqin, CAI Biyu, LIU Wenjie. Bug Report Severity Prediction Based on Fine-tuned Embedding Model with Domain Knowledge [J]. Computer Science, 2024, 51(6A): 230400068-7.
[7] MENG Xiangfu, REN Quanying, YANG Dongshen, LI Keqian, YAO Keyu, ZHU Yan. Literature Classification of Individual Reports of Adverse Drug Reactions Based on BERT and CNN [J]. Computer Science, 2024, 51(6A): 230400049-6.
[8] CHEN Haoyang, ZHANG Lei. Very Short Texts Hierarchical Classification Combining Semantic Interpretation and DeBERTa [J]. Computer Science, 2024, 51(5): 250-257.
[9] YAN Yintong, YU Lu, WANG Taiyan, LI Yuwei, PAN Zulie. Study on Binary Code Similarity Detection Based on Jump-SBERT [J]. Computer Science, 2024, 51(5): 355-362.
[10] ZHENG Qijian, LIU Feng. BEML:A Blended Learning Analysis Paradigm for Hidden Space Representation of Commodities [J]. Computer Science, 2024, 51(11A): 240300150-6.
[11] LIU Yingying, YANG Qiuhui, YAO Bangguo, LIU Qiaoyun. Study on REST API Test Case Generation Method Based on Dependency Model [J]. Computer Science, 2023, 50(9): 101-107.
[12] ZHAO Jiangjiang, WANG Yang, XU Yingying, GAO Yang. Extractive Automatic Summarization Model Based on Knowledge Distillation [J]. Computer Science, 2023, 50(6A): 210300179-7.
[13] LI Binghui, FANG Huan, MEI Zhenhui. Interpretable Repair Method for Event Logs Based on BERT and Weak Behavioral Profiles [J]. Computer Science, 2023, 50(5): 38-51.
[14] LUO Liang, CHENG Chunling, LIU Qian, GUI Yaocheng. Answer Selection Model Based on MLP and Semantic Matrix [J]. Computer Science, 2023, 50(5): 270-276.
[15] WANG Yali, ZHANG Fan, YU Zeng, LI Tianrui. Aspect-level Sentiment Classification Based on Interactive Attention and Graph Convolutional Network [J]. Computer Science, 2023, 50(4): 196-203.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!