计算机科学 ›› 2020, Vol. 47 ›› Issue (5): 295-300.doi: 10.11896/jsjkx.190800046

• 信息安全 • 上一篇    下一篇

基于BiLSTM模型的漏洞检测

龚扣林, 周宇, 丁笠, 王永超   

  1. 南京航空航天大学计算机科学与技术学院 南京211100
    高安全系统的软件开发与验证技术工信部重点实验室 南京211100
  • 收稿日期:2019-08-09 出版日期:2020-05-15 发布日期:2020-05-19
  • 通讯作者: 周宇(zhouyu@nuaa.edu.cn)
  • 作者简介:506531906@qq.com
  • 基金资助:
    国家自然科学基金(61972197);中央高校基本科研业务项目(NS2019055)

Vulnerability Detection Using Bidirectional Long Short-term Memory Networks

GONG Kou-lin, ZHOU Yu, DING Li, WANG Yong-chao   

  1. School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211100,China
    Ministry Key Laboratory for Safety-critical Software Development and Verification,Nanjing 211100,China
  • Received:2019-08-09 Online:2020-05-15 Published:2020-05-19
  • About author:GONG Kou-lin,born in 1995,postgra-duate,is a member of China Computer Federation.His main research interests include software evolution analysis and mining software repositories.
    ZHOU Yu,born in 1981,Ph.D,professor,is a member of China Computer Federation .His main research interests include software evolution analysis,mining software repositories,software architecture,and reliability analysis.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61972197)and Fundamental Research Funds for the Central Universities(NS2019055)

摘要: 随着计算机技术应用的不断深化,软件的数量和需求不断增加,开发难度不断升级。代码复用以及代码本身的复杂度,使得软件中不可避免地引入了大量漏洞。这些漏洞隐藏在海量代码中很难被发现,但一旦被人利用,将导致不可挽回的经济损失。为了及时发现软件漏洞,首先从源代码中提取方法体,形成方法集;为方法集中的每个方法构建抽象语法树,借助抽象语法树抽取方法中的语句,形成语句集;替换语句集中程序员自定义的变量名、方法名及字符串,并为每条语句分配一个独立的节点编号,形成节点集。其次,运用数据流和控制流分析提取节点间的数据依赖和控制依赖关系。然后,将从方法体中提取的节点集、节点间的数据依赖关系以及控制依赖关系组合成方法对应的特征表示,并运用one-hot编码进一步将其处理为特征矩阵。最后,为每个矩阵贴上是否含有漏洞的标签以生成训练样本,并利用神经网络训练出相应的漏洞分类模型。为了更好地学习序列的上下文信息,选取了双向长短时记忆网络(Bidirectional Long Short-Term Memory Networks,BiLSTM)神经网络,并在其上增加了Attention层,以进一步提升模型性能。实验中,漏洞检测结果的精确率和召回率分别达到了95.3%和93.5%,证实了所提方法能够较为准确地检测到代码中的安全漏洞。

关键词: Attention, BiLSTM, 分类模型, 漏洞检测, 特征表示

Abstract: With the continuous development of the application of computer technology,the number and demand of software continue to increase,and the difficulty of development is constantly escalating.Code reuse and the complexity of the code itself have inevitably introduced a number of vulnerabilities in software.These vulnerabilities hidden in massive code are hard to find.But once they are exploited by people,it will lead to irreparable economic losses.In order to discover software vulnerabilities in time,firstly,this paper extracts the method body from the source code to form a method set,and then constructs an abstract syntax tree for each method in the method set.The statements in the method are extracted by means of the abstract syntax tree to form a statement set.The customized variable name,method name and string with some uniform identifiers are replaced.A separate node number is assigned to each statement to form a node set.Secondly,data flow and control flow analysis are used to extract data dependencies and control dependencies between nodes.Then,the node set extracted from the method body,the inter-node data dependency relationship and control dependency relationship are combined into a feature representation corresponding to the method,and further processed into a feature matrix by using one-hot encoding.Finally,each matrix is labeled with a vulnerability tag to generate training samples,and a neural network is used to train the corresponding vulnerability classification model.In order to learn the context information of the sequence better,the BiLSTM network is selected and the Attention layer is added to further improve the performance of the model.In the experiment,the accuracy and recall rate of the vulnerability detection results reach 95.3% and 93.5% respectively,which confirmes that the proposed method can detect the security vulnerabilities in the code more accurately.

Key words: Attention, BiLSTM, Classification model, Feature representation, Vulnerability detection

中图分类号: 

  • TP305
[1]GHAFFARIAN S M,SHAHRIARI H R.Software vulnerability analysis and discovery using machine-learning and data-mining techniques[J].ACM Computing Surveys,2017,50(4):1-36.
[2]US-CERT[OL].http://us-cert.gov.
[3]ZIMMERMANN T,NAGAPPAN N,WILLIAMS L.Searching for a needle in a haystack:predicting security vulnerabilities for windows vista[C]//2010 Third International Conference on Software Testing,Verification and Validation.Paris,France:IEEE,2010.
[4]WOO M,CHA S K,GOTTLIEB S,et al.Scheduling black-box mutational fuzzing[C]//Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security-CCS '13.New York:ACM Press,2013.
[5]American fuzzy lop[OL].http://lcamtuf.coredump.cx/a?/.
[6]WANG T L,WEI T,GU G F,et al.TaintScope:a checksum-aware directed fuzzing tool for automatic software vulnerability detection[C]//2010 IEEE Symposium on Security and Privacy.Oakland:IEEE,2010.
[7]BÖHME M,PHAM V T,ROYCHOUDHURY A.Coverage-based greybox fuzzing as Markov chain[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(CCS'16).Vienna,Austria.New York:ACM Press,2016.
[8]RAWAT S,JAIN V,KUMAR A,et al.VUzzer:Application-aware Evolutionary Fuzzing[C]//NDSS.2017.
[9]MOLNAR D A.Automated Whitebox Fuzz Testing[C]//Network & Distributed System Security Symposium.DBLP,2011.
[10]BABIĆ D,MARTIGNONI L,MCCAMANT S,et al.Statically-directed dynamic automated test generation[C]//Proceedings of the 2011 International Symposium on Software Testing and Analysis-ISSTA'11.New York:ACM Press,2011.
[11]NEUHAUS S,ZIMMERMANN T,HOLLER C,et al.Predicting vulnerable software components[C]//Proceedings of the 14th ACM conference on Computer and communications security-CCS'07.New York:ACM Press,2007.
[12]YAMAGUCHI F,GOLDE N,ARP D,et al.Modeling and discovering vulnerabilities with code property graphs[C]//2014 IEEE Symposium on Security and Privacy.San Jose,CA:IEEE,2014.
[13]CHANDRAMOHAN M,XUE Y X,XU Z Z,et al.BinGo:cross-architecture cross-OS binary search[C]//Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering(FSE 2016).New York:ACM Press,2016:678-689.
[14]XU Z Z,CHEN B H,CHANDRAMOHAN M,et al.SPAIN:security patch analysis for binaries towards understanding the pain and pills[C]//2017 IEEE/ACM 39th International Conferenceon Software Engineering (ICSE).Buenos Aires:IEEE,2017.
[15]LI Z,ZOU D Q,XU S H,et al.VulPecker:an automated vulnerability detection system based on code similarity analysis[C]//Proceedings of the 32nd Annual Conference on Computer Securi-ty Applications.2016:201-213.
[16]KIM S,WOO S,LEE H,et al.VUDDY:a scalable approach for vulnerable code clone discovery[C]//2017 IEEE Symposium on Security and Privacy (SP).San Jose:IEEE,2017.
[17]SCANDARIATO R,WALDEN J,HOVSEPYAN A,et al.Predicting vulnerable software components via text mining[J].IEEE Transactions on Software Engineering,2014,40(10):993-1006.
[18]YAMAGUCHI F,LINDNER F,RIECK K.Vulnerability ex-trapolation:assisted discovery of vulnerabilities using machine learning[C]//Proceedings of the 5th USENIX Conference on Offensive Technologies.2011:13.
[19]RUSSELL R,KIM L,HAMILTON L,et al.Automated vulnerability detection in source code using deep representation lear-ning[C]//2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).Orlando:IEEE,2018.
[20]HARER J A,KIM L Y,RUSSELL R L,et al.Automated software vulnerability detection with machine learning[J].arXiv:1803.04497,2018.
[21]LI Z,ZOU D Q,XU S H,et al.VulDeePecker:a deep learning-based system for vulnerability detection[C]//Proceedings 2018 Network and Distributed System Security Symposium.Reston,VA:Internet Society,2018.
[22]LI Z,ZOU D,XU S,et al.SySeVR:A Framework for Using Deep Learning to Detect Software Vulnerabilities[J].arXiv:1807.06756,2018.
[23]ANTLR4[OL].https://github.com/antlr/antlr4.
[24]Common Weakness Enumeration[OL].https://cwe.mitre.org.
[25]Software Assurance Reference Dataset of National Institute of Standards and Technology[OL].https://samate.nist.gov/SARD.
[26]LI Y C,HUANG R,LAI F G,et al.Open source software vulnerability detection method based on deep clustering[J].Application Research of Computers,2020,37(4):1107-1110,1114.
[1] 于家畦, 康晓东, 白程程, 刘汉卿.
一种新的中文电子病历文本检索模型
New Text Retrieval Model of Chinese Electronic Medical Records
计算机科学, 2022, 49(6A): 32-38. https://doi.org/10.11896/jsjkx.210400198
[2] 韩洁, 陈俊芬, 李艳, 湛泽聪.
基于自注意力的自监督深度聚类算法
Self-supervised Deep Clustering Algorithm Based on Self-attention
计算机科学, 2022, 49(3): 134-143. https://doi.org/10.11896/jsjkx.210100001
[3] 张潆藜, 马佳利, 刘子昂, 刘新, 周睿.
以太坊Solidity智能合约漏洞检测方法综述
Overview of Vulnerability Detection Methods for Ethereum Solidity Smart Contracts
计算机科学, 2022, 49(3): 52-61. https://doi.org/10.11896/jsjkx.210700004
[4] 董哲, 邵若琦, 陈玉梁, 翟维枫.
基于BERT和对抗训练的食品领域命名实体识别
Named Entity Recognition in Food Field Based on BERT and Adversarial Training
计算机科学, 2021, 48(5): 247-253. https://doi.org/10.11896/jsjkx.200800181
[5] 李明磊, 黄晖, 陆余良, 朱凯龙.
SymFuzz:一种复杂路径条件下的漏洞检测技术
SymFuzz:Vulnerability Detection Technology Under Complex Path Conditions
计算机科学, 2021, 48(5): 25-31. https://doi.org/10.11896/jsjkx.200600128
[6] 陈明豪, 祝跃飞, 芦斌, 翟懿, 李玎.
基于Attention-CNN的加密流量应用类型识别
Classification of Application Type of Encrypted Traffic Based on Attention-CNN
计算机科学, 2021, 48(4): 325-332. https://doi.org/10.11896/jsjkx.200900155
[7] 刘全明, 李尹楠, 郭婷, 李岩纬.
基于Borderline-SMOTE和双Attention的入侵检测方法
Intrusion Detection Method Based on Borderline-SMOTE and Double Attention
计算机科学, 2021, 48(3): 327-332. https://doi.org/10.11896/jsjkx.200600025
[8] 柴冰, 李冬冬, 王喆, 高大启.
融合频率和通道卷积注意的脑电(EEG)情感识别
EEG Emotion Recognition Based on Frequency and Channel Convolutional Attention
计算机科学, 2021, 48(12): 312-318. https://doi.org/10.11896/jsjkx.201000141
[9] 涂良琼, 孙小兵, 张佳乐, 蔡杰, 李斌, 薄莉莉.
智能合约漏洞检测工具研究综述
Survey of Vulnerability Detection Tools for Smart Contracts
计算机科学, 2021, 48(11): 79-88. https://doi.org/10.11896/jsjkx.210600117
[10] 肖潇, 孔凡芝.
三角坐标系下人脸表情表示方法
New Representation of Facial Affect Based on Triangular Coordinate System
计算机科学, 2020, 47(6A): 250-253. https://doi.org/10.11896/JsJkx.190700081
[11] 陈俊芬,张明,赵佳成.
复杂高维数据的密度峰值快速搜索聚类算法
Clustering Algorithm by Fast Search and Find of Density Peaks for Complex High-dimensional Data
计算机科学, 2020, 47(3): 79-86. https://doi.org/10.11896/jsjkx.190400123
[12] 高楠,李利娟,李伟,祝建明.
融合语义特征的关键词提取方法
Keywords Extraction Method Based on Semantic Feature Fusion
计算机科学, 2020, 47(3): 110-115. https://doi.org/10.11896/jsjkx.190700041
[13] 杜琳, 曹东, 林树元, 瞿溢谦, 叶辉.
基于BERT与Bi-LSTM融合注意力机制的中医病历文本的提取与自动分类
Extraction and Automatic Classification of TCM Medical Records Based on Attention Mechanism of BERT and Bi-LSTM
计算机科学, 2020, 47(11A): 416-420. https://doi.org/10.11896/jsjkx.200200020
[14] 崔丹丹, 刘秀磊, 陈若愚, 刘旭红, 李臻, 齐林.
基于Lattice LSTM的古汉语命名实体识别
Named Entity Recognition in Field of Ancient Chinese Based on Lattice LSTM
计算机科学, 2020, 47(11A): 18-23. https://doi.org/10.11896/jsjkx.200500090
[15] 阳小华, 闫仕宇, 刘杰, 李萌.
科学计算程序蜕变关系层次分类模型
Hierarchical Classification Model for Metamorphic Relations of Scientific Computing Programs
计算机科学, 2020, 47(11A): 557-561. https://doi.org/10.11896/jsjkx.200200015
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!