计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 330-337.doi: 10.11896/jsjkx.220700073

• 信息安全 • 上一篇    下一篇

一种面向开源异构数据的网络安全威胁情报挖掘算法

魏涛, 李志华, 王长杰, 程顺航   

  1. 江南大学人工智能与计算机学院 江苏 无锡 214122
  • 收稿日期:2022-07-07 修回日期:2022-09-06 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 李志华(jswxzhli@aliyun.com)
  • 作者简介:(6201924168@stu.jiangnan.edu.cn)
  • 基金资助:
    工业和信息化部智能制造项目(ZH-XZ-180004);中央高校基本科研业务费专项资金(JUSRP211A41,JUSRP42003)

Cybersecurity Threat Intelligence Mining Algorithm for Open Source Heterogeneous Data

WEI Tao, LI Zhihua, WANG Changjie, CHENG Shunhang   

  1. School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi,Jiangsu 214122,China
  • Received:2022-07-07 Revised:2022-09-06 Online:2023-06-15 Published:2023-06-06
  • About author:WEI Tao,born in 1998,postgraduate.His main research interests include information system analysis and information security.LI Zhihua,born in 1969,Ph.D,professor,master supervisor.His main research interests include the key techno-logies and information security of the end edge cloud,and its intersection with cutting-edge disciplines such as artificial intelligence.
  • Supported by:
    Intelligent Manufacturing Project of the Ministry of Industry and Information Technology(ZH-XZ-180004) and Fundamental Research Funds for the Central Universities of Ministry of Education of China(JUSRP211A41,JUSRP42003).

摘要: 针对如何从开源网络安全报告中高效挖掘威胁情报的问题,提出了一种基于威胁情报命名实体识别(Threat Intelligence Named Entity Recognition,TI-NER)算法的威胁情报挖掘(TI-NER-based Intelligence Mining,TI-NER-IM)方法。首先,收集了近10年的物联网安全报告并进行标注,构建威胁情报实体识别数据集;其次,针对传统实体识别模型在威胁情报IoC攻击指示器挖掘领域的不足,提出了基于自注意力机制和字符嵌入的威胁情报实体识别(Threat Intelligence Entity Identification based on Self-attention Mechanism and Character Embedding,TIEI-SMCE)模型,该模型融合字符嵌入信息,再通过自注意力机制捕获单词间潜在的依赖权重、语境等特征,从而准确地识别威胁情报IoC实体;然后,基于TIEI-SMCE模型,提出了一种威胁情报命名实体识别算法;最后,集成上述模型和算法,进一步提出了一种新的威胁情报挖掘方法。TI-NER-IM方法能实现从非结构化、半结构化网络安全报告中自动挖掘威胁情报IoC实体。实验结果表明,与BERT-BiLSTM-CRF模型相比,TI-NER-IM方法的F1值提升了1.43%。

关键词: 威胁情报挖掘, 自然语言处理, 实体抽取, 攻击指示器(IoC)

Abstract: To address the problem of how to efficiently mine threat intelligence from open source network security reports,a TI-NER-based intelligence mining(TI-NER-IM) method is proposed.Firstly,the IoT cybersecurity reports of nearly 10 years are collected and annotated to construct a threat intelligence entity identification dataset.Secondly,in view of the lack of performance of traditional entity recognition models in the field of threat intelligence mining,a threat intelligence entity identification based on self-attention mechanism and character embedding(TIEI-SMCE) model is proposed,which fuses character embedding information.The potential dependency weights between words,contexts and other characteristics are then captured through self-attention mechanism to accurately identify threat intelligence entities.Thirdly,a threat intelligence named entity recognition(TI-NER) algorithm based on TIEI-SMCE model is proposed.Finally,a TI-NER-based intelligence mining(TI-NER-IM) method is designed and proposed.TI-NER-IM method enables automated mining of threat intelligence from unstructured and semi-structured security reports.Eexperimental results show that compared with the BERT-BiLSTM-CRF model,TI-NER-IM's F1 value increases by 1.43%.

Key words: Threat intelligence mining, Natural language processing, Entity extraction, Indicators of compromise

中图分类号: 

  • TP393.08
[1]CASCAVILLAG,TAMBURRI D A,VAN DEN HEUVEL W J.Cybercrime threat intelligence:A systematic multi-vocal lite-rature review[J].Computers & Security,2021,105:102258.
[2]BIANCHIG,CONTI M,DARGAHI T,et al.Editorial for theSpecial Issue on Sustainable Cyber Forensics and Threat Intelligence[J].IEEE Transactions on Sustainable Computing,2021,6(2):182-183.
[3]WU H,LI X,GAO Y.An effective approach of named entityrecognition for cyber threat intelligence[C]//2020 IEEE 4th Information Technology,Networking,Electronic and Automation Control Conference(ITNEC).IEEE,2020,1:1370-1374.
[4]BARNUM S.Standardizing cyber threat intelligence information with the structured threat information expression(stix)[J].Mitre Corporation,2012,11:1-22.
[5]MOHIT B.Named entity recognition[M]//Natural LanguageProcessing of Semitic Languages.Berlin:Springer,2014:221-245.
[6]LI J,SUN A X,HAN J L,et al.A survey on deep learning for named entity recognition[J].IEEE Transactions on Knowledge and Data Engineering,2020,34(1):50-70.
[7]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isallyou need[C]//Proceedings of the 2017 Advances in Neural Information Processing Systems.California,2017:5998-6008.
[8]LEE C.LSTM-CRF models for named entity recognition[J].IEICE Transactions on Information and Systems,2017,100(4):882-887.
[9]ARKHIPOV M Y,BURTSEV M S.Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition[C]//Conference on Artificial Intelligence and Natural Language.Cham:Springer,2017:91-103.
[10]DASGUPTA S,PIPLAI A,KOTAL A,et al.A comparativestudy of deep learning based named entity recognition algorithms for cybersecurity[C]//2020 IEEE International Confe-rence on Big Data(Big Data).IEEE,2020:2596-2604.
[11]LIU S,YANG H,LI J,et al.Chinese Named Entity Recognition Method in History and Culture Field Based on BERT[J].International Journal of Computational Intelligence Systems,2021,14(1):1-10.
[12]HAO W,KEROU L,ZHEN M,et al.Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J].Data Analysis and Knowledge Discovery,2021,5(7):10-25.
[13]THIVAHARAN S,SRIVATSUN G,SARATHAMBEKAI S.A survey on python libraries used for social media content scraping[C]//2020 International Conference on Smart Electro-nics and Communication(ICOSEC).IEEE,2020:361-366.
[14]MIAHM S U,SULAIMAN J,SARWAR T B,et al.Sentenceboundary extraction from scientific literature of electric double layer capacitor domain:tools and techniques[J].Applied Sciences,2022,12(3):1352.
[15]LIU X,CHEN H,XIA W.Overview of Named Entity Recognition[J].Journal of Contemporary Educational Research,2022,6(5):65-68.
[16]KENTON J D M W C,TOUTANOVA L K.BERT:Pre-trainingof Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT.2019:4171-4186.
[17]NIU Z,ZHONG G,YU H.A review on the attention mechanism of deep learning[J].Neurocomputing,2021,452:48-62.
[18]YU B,FAN Z.A comprehensive review of conditional random fields:variants,hybrids and applications[J].Artificial Intelligence Review,2020,53(6):4289-4333.
[19]LI J,SUN A,HAN J,et al.A survey on deep learning for named entity recognition[J].IEEE Transactions on Knowledge and Data Engineering,2020,34(1):50-70.
[20]LI Z,CHEN Q A,YANG R,et al.Threat detection and investigation with system-level provenance graphs:a survey[J].Computers & Security,2021,106:102282.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!