计算机科学 ›› 2020, Vol. 47 ›› Issue (2): 245-250.doi: 10.11896/jsjkx.190500063

• 信息安全 • 上一篇    下一篇

基于增强BiLSTM-CRF模型的推文恶意软件名称识别

古雪梅1,刘嘉勇1,程芃森1,2,何祥1   

  1. (四川大学网络空间安全学院 成都610000)1;
    (中国科学院信息工程研究所中国科学院网络测评技术重点实验室 北京100093)2
  • 收稿日期:2019-05-14 出版日期:2020-02-15 发布日期:2020-03-18
  • 通讯作者: 刘嘉勇(ljy@scu.edu.cn)
  • 基金资助:
    中国科学院网络测评技术重点实验室开放课题基金(NST-18-001)

Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model

GU Xue-mei1,LIU Jia-yong1,CHENG Peng-sen1,2,HE Xiang1   

  1. (School of Cybersecurity,Sichuan University,Chengdu 610000,China)1;
    (Key Laboratory of Network Assessment Technology,Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China)2
  • Received:2019-05-14 Online:2020-02-15 Published:2020-03-18
  • About author:GU Xue-mei,born in 1995,postgraduate,is not member of China Computer Federation (CCF).Her main research interests include information content security and so on;LIU Jia-yong,born in 1962,Ph.D,professor,Ph.D supervisor,is not member of China Computer Federation (CCF).His main research interests include information security,and network communication and security.
  • Supported by:
    This work was supported by Open Research Fund of the Key Laboratory of Network Assessment Technology of Chinese Academy of Sciences (NST-18-001).

摘要: 针对推文中恶意软件名称识别任务存在的文本简短、非正式、实体类别单一以及实体歧义等问题,提出了一种基于BERT-BiLSTM-Self-attention-CRF的实体识别方法,以实现推文中恶意软件名称的自动识别。在BiLSTM-CRF模型的基础上,利用BERT模型编码单词语境信息,提升词嵌入的上下文语义质量,增强原有模型的语义消歧能力;同时,借助Self-attention机制学习单词间关系和句子结构特征,利用加权表征帮助单一类别实体的解码,以提升恶意软件名称实体的识别效果。通过构建包含恶意软件名称实体的推文标记数据集进行实验测试,结果表明,提出的方法可以实现更好的性能,其精确率、召回率、F1值分别为86.38%,84.73%,85.55%,相较于基线模型BiLSTM-CRF,F1值提升了12.61%。

关键词: 动态词嵌入, 恶意软件名称识别, 类别不均, 实体消歧, 重要性加权

Abstract: To address the problems such as short,informal,single entity category and entity disambiguation in the malware name recognition task on Twitter,this paper proposed an entity recognition method based on BERT-BiLSTM-Self-attention-CRF to automatically recognize malware name in tweets.Based on the BiLSTM-CRF model,the BERT is used to encode context information,improve the contextual semantic quality of word embeddings,and enhance the semantic disambiguation ability.At the same time,Self-attention mechanism is used to learn weighted representation to improve the performance of single entity category re-cognition by learning the long-term relations between words and sentence structure.To evaluate the proposed methods,this paper constructed a labeled dataset in tweets that contains malware name entities.Experimental results show that the proposed method can achieve a better performance,attain 86.38% precision,84.73% recall and 85.55% F-score.The proposed model can outperforms the baseline model,with F-score improved by 12.61%.

Key words: Class imbalance, Dynamic word embedding, Entity disambiguation, Importance weighting, Malware name recognition

中图分类号: 

  • TP391
[1]MITTAL S,DAS P K,MULWAD V,et al.Cybertwitter:Using twitter to generate alerts for cybersecurity threats and vulnerabilities[C]∥Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mi-ning.IEEE Press,2016:860-867.
[2]DERCZYNSKI L,MAYNARD D,RIZZO G,et al.Analysis of named entity recognition and linking for tweets[J].Information Processing & Management,2015,51(2):32-49.
[3]LE N T,MALLEK F,SADAT F.Uqam-ntl:Named entity re-cognition in twitter messages[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).2016:197-202.
[4]MAHMOOD T,MUJTABA G,SHUIB L,et al.Public bus commuter assistance through the named entity recognition of twitter feeds and intelligent route finding[J].IET Intelligent Transport Systems,2017,11(8):521-529.
[5]LIU X,ZHANG S,WEI F,et al.Recognizing named entities in tweets[C]∥Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Association for Computational Linguistics,2011:359-367.
[6]RITTER A,CLARK S,ETZIONI O.Named entity recognition in tweets:an experimental study[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2011:1524-1534.
[7]OKUR E,DEMIR H,ÖZGÜR A.Named entity recognition on twitter for turkish using semi-supervised learning with word embeddings[J].arXiv:1810.08732,2018.
[8]ZHANG Q,FU J,LIU X,et al.Adaptive co-attention network for named entity recognition in tweets[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[9]LIMSOPATHAM N,COLLIER N H.Bidirectional LSTM for named entity recognition in Twitter messages[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).2016:197-202.
[10]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]∥Advances in Neural Information Processing Systems.2017:5998-6008.
[12]SHEN T,ZHOU T,LONG G,et al.Disan:Directional self-attention network for rnn/cnn-free language understanding[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[13]TAN Z,WANG M,XIE J,et al.Deep semantic role labeling with self-attention[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[14]CAO P,CHEN Y,LIU K,et al.Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism[C]∥Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:182-192.
[15]MEFTAH S,SEMMAR N.A neural network model for part-of-speech tagging of social media texts[C]∥Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).2018.
[16]JANSSON P,LIU S.Distributed representation,lda topic mo-delling and deep learning for emerging named entity recognition from social media[C]∥Proceedings of the 3rd Workshop on Noisy User-generated Text.2017:154-159.
[17]GUPTA D,EKBAL A,BHATTACHARYYA P.A Deep Neural Network based Approach for Entity Extraction in Code-Mixed Indian Social Media Text[C]∥Proceedings of the Ele-venth International Conference on Language Resources and Ev-aluation (LREC-2018).2018.
[18]LAFFERTY J,MCCALLUM A,PEREIRA F C N.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]∥Proceedings of International Conference on Machine Learning.2001:282-289.
[19]BELAININE B,FONSECA A,SADAT F.Named entity recognition and hashtag decomposition to improve the classification of tweets[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).2016:102-111.
[20]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5:135-146.
[21]PENNINGTON J,SOCHER R,MANNING C.Glove:Global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543.
[1] 王卫红, 陈骁, 吴炜, 高星宇.
高分影像复杂背景下的城市水体自动提取方法
Method of Automatically Extracting Urban Water Bodies from High-resolution Images with Complex Background
计算机科学, 2019, 46(11): 277-283. https://doi.org/10.11896/jsjkx.181001985
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!