Computer Science ›› 2020, Vol. 47 ›› Issue (2): 245-250.doi: 10.11896/jsjkx.190500063

• Information Security • Previous Articles     Next Articles

Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model

GU Xue-mei1,LIU Jia-yong1,CHENG Peng-sen1,2,HE Xiang1   

  1. (School of Cybersecurity,Sichuan University,Chengdu 610000,China)1;
    (Key Laboratory of Network Assessment Technology,Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China)2
  • Received:2019-05-14 Online:2020-02-15 Published:2020-03-18
  • About author:GU Xue-mei,born in 1995,postgraduate,is not member of China Computer Federation (CCF).Her main research interests include information content security and so on;LIU Jia-yong,born in 1962,Ph.D,professor,Ph.D supervisor,is not member of China Computer Federation (CCF).His main research interests include information security,and network communication and security.
  • Supported by:
    This work was supported by Open Research Fund of the Key Laboratory of Network Assessment Technology of Chinese Academy of Sciences (NST-18-001).

Abstract: To address the problems such as short,informal,single entity category and entity disambiguation in the malware name recognition task on Twitter,this paper proposed an entity recognition method based on BERT-BiLSTM-Self-attention-CRF to automatically recognize malware name in tweets.Based on the BiLSTM-CRF model,the BERT is used to encode context information,improve the contextual semantic quality of word embeddings,and enhance the semantic disambiguation ability.At the same time,Self-attention mechanism is used to learn weighted representation to improve the performance of single entity category re-cognition by learning the long-term relations between words and sentence structure.To evaluate the proposed methods,this paper constructed a labeled dataset in tweets that contains malware name entities.Experimental results show that the proposed method can achieve a better performance,attain 86.38% precision,84.73% recall and 85.55% F-score.The proposed model can outperforms the baseline model,with F-score improved by 12.61%.

Key words: Class imbalance, Dynamic word embedding, Entity disambiguation, Importance weighting, Malware name recognition

CLC Number: 

  • TP391
[1]MITTAL S,DAS P K,MULWAD V,et al.Cybertwitter:Using twitter to generate alerts for cybersecurity threats and vulnerabilities[C]∥Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mi-ning.IEEE Press,2016:860-867.
[2]DERCZYNSKI L,MAYNARD D,RIZZO G,et al.Analysis of named entity recognition and linking for tweets[J].Information Processing & Management,2015,51(2):32-49.
[3]LE N T,MALLEK F,SADAT F.Uqam-ntl:Named entity re-cognition in twitter messages[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).2016:197-202.
[4]MAHMOOD T,MUJTABA G,SHUIB L,et al.Public bus commuter assistance through the named entity recognition of twitter feeds and intelligent route finding[J].IET Intelligent Transport Systems,2017,11(8):521-529.
[5]LIU X,ZHANG S,WEI F,et al.Recognizing named entities in tweets[C]∥Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Association for Computational Linguistics,2011:359-367.
[6]RITTER A,CLARK S,ETZIONI O.Named entity recognition in tweets:an experimental study[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2011:1524-1534.
[7]OKUR E,DEMIR H,ÖZGÜR A.Named entity recognition on twitter for turkish using semi-supervised learning with word embeddings[J].arXiv:1810.08732,2018.
[8]ZHANG Q,FU J,LIU X,et al.Adaptive co-attention network for named entity recognition in tweets[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[9]LIMSOPATHAM N,COLLIER N H.Bidirectional LSTM for named entity recognition in Twitter messages[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).2016:197-202.
[10]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]∥Advances in Neural Information Processing Systems.2017:5998-6008.
[12]SHEN T,ZHOU T,LONG G,et al.Disan:Directional self-attention network for rnn/cnn-free language understanding[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[13]TAN Z,WANG M,XIE J,et al.Deep semantic role labeling with self-attention[C]∥Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[14]CAO P,CHEN Y,LIU K,et al.Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism[C]∥Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018:182-192.
[15]MEFTAH S,SEMMAR N.A neural network model for part-of-speech tagging of social media texts[C]∥Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018).2018.
[16]JANSSON P,LIU S.Distributed representation,lda topic mo-delling and deep learning for emerging named entity recognition from social media[C]∥Proceedings of the 3rd Workshop on Noisy User-generated Text.2017:154-159.
[17]GUPTA D,EKBAL A,BHATTACHARYYA P.A Deep Neural Network based Approach for Entity Extraction in Code-Mixed Indian Social Media Text[C]∥Proceedings of the Ele-venth International Conference on Language Resources and Ev-aluation (LREC-2018).2018.
[18]LAFFERTY J,MCCALLUM A,PEREIRA F C N.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]∥Proceedings of International Conference on Machine Learning.2001:282-289.
[19]BELAININE B,FONSECA A,SADAT F.Named entity recognition and hashtag decomposition to improve the classification of tweets[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT).2016:102-111.
[20]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5:135-146.
[21]PENNINGTON J,SOCHER R,MANNING C.Glove:Global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543.
[1] HUAN Wen-ming, LIN Hai-tao. Design of Intrusion Detection System Based on Sampling Ensemble Algorithm [J]. Computer Science, 2021, 48(11A): 705-712.
[2] FANG Meng-lin, TANG Wen-bing, HUANG Hong-yun and DING Zuo-hua. Wall-following Navigation of Mobile Robot Based on Fuzzy-based Information Decomposition and Control Rules [J]. Computer Science, 2020, 47(6A): 79-83.
[3] DONG Ming-gang,JIANG Zhen-long,JING Chao. Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm [J]. Computer Science, 2020, 47(1): 102-109.
[4] LIU Hua-ling, LIN Bei, YUN Wen-jing, DING Yu-jie. Comparison of Balancing Methods in Internet Finance Overdue Recognition:Taking PPDai.com As Case [J]. Computer Science, 2019, 46(11A): 595-598.
[5] WANG Wei-hong, CHEN Xiao, WU Wei, GAO Xing-yu. Method of Automatically Extracting Urban Water Bodies from High-resolution Images with Complex Background [J]. Computer Science, 2019, 46(11): 277-283.
[6] WANG Chang-bao, LI Qing-wen and YU Hua-long. Active,Online and Weighted Extreme Learning Machine Algorithm for Class Imbalance Data [J]. Computer Science, 2017, 44(12): 221-226.
[7] WANG Ying, LI Jin, WANG Lei, XU Cheng-zhen and CAI Zhong-xi. Research and Progress of microRNA Prediction Methods Based on Machine Learning [J]. Computer Science, 2015, 42(2): 7-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!