计算机科学 ›› 2019, Vol. 46 ›› Issue (7): 86-90.doi: 10.11896/j.issn.1002-137X.2019.07.013

• 信息安全 • 上一篇    下一篇

基于递归神经网络的恶意程序检测研究

王乐乐1,汪斌强1,刘建港2,张建辉1,苗启广3   

  1. (国家数字交换系统工程技术研究中心 郑州450000)1
    (南京信息技术研究院 南京210000)2
    (西安电子科技大学计算机学院 西安710071)3
  • 收稿日期:2018-09-08 出版日期:2019-07-15 发布日期:2019-07-15
  • 作者简介:王乐乐(1985-),女,博士生,主要研究方向为信息安全,E-mail:635718080@qq.com;汪斌强(1963-),男,教授,博士生导师,主要研究方向为网络安全;刘建港(1968-),男,研究员,主要研究方向为信息安全;张建辉(1977-),男,博士,副研究员,主要研究方向为宽带信息网络,E-mail:ndsczjh@163.com(通信作者);苗启广(1972-),男,博士,教授,CCF会员,主要研究方向为机器学习、高性能计算。

Study on Malicious Program Detection Based on Recurrent Neural Network

WANG Le-le1,WANG Bin-qiang1,LIU Jian-gang2,ZHANG Jian-hui1,MIAO Qi-guang3   

  1. (National Digital Switching System Engineering and Technological Research Center,Zhengzhou 450000,China)1
    (Nanjing Information Technology Institute,Nanjing 210000,China)2
    (Department of Computer Science,Xidian University,Xi’an 710071,China)3
  • Received:2018-09-08 Online:2019-07-15 Published:2019-07-15

摘要: 针对传统恶意程序检测判定效率低及自动分析恶意程序能力不足的问题,在深度学习环境下,研究利用递归神经网络进行恶意程序的检测分类的问题。首先,用快速模拟器(Quick Emulator,QEMU)捕获到恶意程序运行时所调用的API及其参数序列,经过行为抽象,形成恶意程序的特征序列。然后使用对数化的双线性模型(Hierarchical Log-bilinear Language Model,HLBL)将特征序列映射成固定长度的词向量,并将这些词向量合成递归神经网络(Recursive Neural Network,RNN)所需要的输入矩阵。通过对递归神经网络模型的训练,建立恶意程序的多层语义聚合模型,完成对恶意程序的分类检测。实验数据表明,递归神经网络模型在恶意程序检测分类中能够有效地检测出恶意程序,与传统机器学习算法相比,其检测率提高了17%。特别是在引入张量(Tensor)的概念,采用递归张量神经网络(Recursive Neural Tensor Network,RNTN)模型后,通过降低整体的参数数量和计算量,使检测率较RNN模型又提高了7%。实验数据充分说明,采用递归神经网络模型完全可以完成大数据环境下恶意程序的检测分类任务。

关键词: HLBL, QEMU, 词向量, 递归神经网络, 多层语义聚合模型

Abstract: In view of the low efficiency of traditional malicious program detection and the lack of automatic analysis of malicious programs,this paper studied to use recurrent neural networks to detect and classify malicious programs in deep learning environment.First,the QEMU is used to capture the API and its parameter sequence that are called when the malicious program runs,after the behavior abstraction,the characteristic sequence of the malicious program is formed.Then the feature sequence is mapped to a fixed length word vector by using a logarithmic bilinear model (HLBL),and these word vectors are synthesized into an input matrix of a recursive neural network (RNN).Through the training of the recursive neural network model,a multi-layer semantic aggregation model of malicious programs is established to complete the classification detection of malicious programs.The experimental data show that the recursive neural network model can detect malicious program effectively in the classification of malicious program detection.Compared with the traditional machine learning algorithm,its detection rate has increased by 17%.In particular,when the concept of tensors is introduced,after using the Recursive Neural Tensor Network (RNTN) model,the detection rate is increased by 7% compared to the RNN model by reducing the overall number of parameters and the amount of calculations.The experimental data fully show that the recursive neural network model can complete the detection and classification of malicious programs in big data environment.

Key words: Hierarchical log-bilinear language model, Multi-levelsemantic aggregate model, Quick emulator, Recursive neural network, Word vector

中图分类号: 

  • TP393
[1]360互联网安全中心.2018年上半年互联网安全报告[EB/OL].www.anquanke.com/post/id/156689.
[2]HINTON G,OSINDERO S,WELLING M,et al.Unsupervised discovery of nonlinear structure using contrastive backpropagation [J].Cognitive Science,2006,30(4):725-731.
[3]LV Y,DUAN Y,KANG W,et al.Traffic Flow Prediction With Big Data:A Deep Learning Approach [J].IEEE Transactions on Intelligent Transportation Systems,2015,16(2):865-873.
[4]CUI Z,XUE F,CAI X,et al.Detection of Malicious Code Va- riants Based on Deep Learning [J].IEEE Transactions on Industrial Informatics,2018,14(7):3187-3196.
[5]DING Y,ZHU S.Malware detection based on deep learning algorithm [J].Neural Computing & Applications,2017(1):1-12.
[6]IDIKA N,MATHUR A P.A survey of malware detection techniques[R].Purdue University,2007.
[7]PEREVOZCHIKOV V A,SHAYMARDANOV T A,CHU- GUNKOV I V.New techniques of malware detection using FTP Honeypot systems[C]∥Young Researchers in Electrical and Electronic Engineering.IEEE,2017:204-207.
[8]YE Y,LI T,ADJEROH D,et al.A survey on malware detection using data mining techniques [J].ACM Computing Surveys (CSUR),2017,50(3):1-40.
[9]MAHINDRU A,SINGH P.Dynamic Permissions based An- droid Malware Detection using Machine Learning Techniques[C]∥Innovations in Software Engineering Conference.ACM,2017:202-210.
[10]BELLARD F.QEMU,a fast and portable dynamic translator [C]∥Conference on Usenix Technical Conference.USENIX Association,2005:41.
[11]HINTON G E.Learning distributed representations of concepts [C]∥Eighth Conference of the Cognitive Science Society.1989.
[12]BENGIO Y,VINCENT P,JANVIN C.A neural probabilistic language model [J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[13]MNIH A,HINTON G.Three new graphical models for statistical language modelling[C]∥International Conference on Machine Learning.ACM,2007:641-648.
[14]MNIH A,HINTON G.A scalable hierarchical distributed language model[C]∥International Conference on Neural Information Processing Systems.Curran Associates Inc.2008:1081-1088.
[15]PENNINGTON J,SOCHER R,MANNING C.Glove:Global Vectors for Word Representation[C]∥Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[16]SOCHER R,MANNING C D,NG A Y.Learning continuous phrase representations and syntactic parsing with recursive neural networks[C]∥Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop.2010:1-9.
[17]SOCHER R,PERELYGIN A,WU J,et al.Recursive deep models for semantic compositionality over a sentiment treebank[C]∥Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.2013:1631-1642.
[1] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[2] 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波.
语义通信系统的性能度量指标分析
Analysis of Performance Metrics of Semantic Communication Systems
计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[3] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[4] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[5] 刘凯, 张宏军, 陈飞琼.
基于领域适应嵌入的军事命名实体识别
Name Entity Recognition for Military Based on Domain Adaptive Embedding
计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[6] 杨进才, 曹元, 胡泉, 沈显君.
基于Transformer模型与关系词特征的汉语因果类复句关系自动识别
Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature
计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019
[7] 杨青, 张亚文, 朱丽, 吴涛.
基于注意力机制和BiGRU融合的文本情感分析
Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU
计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075
[8] 张玉帅, 赵欢, 李博.
基于BERT和BiLSTM的语义槽填充
Semantic Slot Filling Based on BERT and BiLSTM
计算机科学, 2021, 48(1): 247-252. https://doi.org/10.11896/jsjkx.191200088
[9] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[10] 李舟军,范宇,吴贤杰.
面向自然语言处理的预训练技术研究综述
Survey of Natural Language Processing Pre-training Techniques
计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167
[11] 霍丹, 张生杰, 万路军.
基于上下文的情感词向量混合模型
Context-based Emotional Word Vector Hybrid Model
计算机科学, 2020, 47(11A): 28-34. https://doi.org/10.11896/jsjkx.191100114
[12] 景丽, 李曼曼, 何婷婷.
结合扩充词典与自监督学习的网络评论情感分类
Sentiment Classification of Network Reviews Combining Extended Dictionary and Self-supervised Learning
计算机科学, 2020, 47(11A): 78-82. https://doi.org/10.11896/jsjkx.200400061
[13] 李苑,李智星,滕磊,王化明,王国胤.
基于注意力机制的评论情感分析及情感词检测
Comment Sentiment Analysis and Sentiment Words Detection Based on Attention Mechanism
计算机科学, 2020, 47(1): 186-192. https://doi.org/10.11896/jsjkx.181002011
[14] 杨丹浩,吴岳辛,范春晓.
一种基于注意力机制的中文短文本关键词提取模型
Chinese Short Text Keyphrase Extraction Model Based on Attention
计算机科学, 2020, 47(1): 193-198. https://doi.org/10.11896/jsjkx.181202261
[15] 李舟军,王昌宝.
基于深度学习的机器阅读理解综述
Survey on Deep-learning-based Machine Reading Comprehension
计算机科学, 2019, 46(7): 7-12. https://doi.org/10.11896/j.issn.1002-137X.2019.07.002
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!