计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 283-290.doi: 10.11896/jsjkx.220600131

• 信息安全 • 上一篇    下一篇

基于增强AST的图神经网络函数级代码漏洞检测方法

顾守珂, 陈文   

  1. 四川大学网络空间安全学院 成都 610065
  • 收稿日期:2022-06-13 修回日期:2023-03-17 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 陈文(wenchen@scu.edu.cn)
  • 作者简介:(gushouke@scu.edu.cn)
  • 基金资助:
    国家重点研发计划(020YFB1805405,2019QY0800);国家自然科学基金(U1736212,61872255,U19A2068);模式识别与智能信息处理四川省高校重点实验室(MSSB-2020-01)

Function Level Code Vulnerability Detection Method of Graph Neural Network Based on Extended AST

GU Shouke, CHEN Wen   

  1. School of Cyber Science and Engineering,Sichuan University,Chengdu 610065,China
  • Received:2022-06-13 Revised:2023-03-17 Online:2023-06-15 Published:2023-06-06
  • About author:GU Shouke,born in 1998,postgra-duate.His main research interests include graph neural network,cyber security and vulnerability miningCHEN Wen,born in 1983,Ph.D,asso-ciate professor,master supervisor,is a member of China Computer Federation.His main research interests include network security,information hiding and data mining.
  • Supported by:
    National Key Research and Development Program of China(020YFB1805405,2019QY0800),National Natural Science Foundation of China(U1736212,61872255,U19A2068) and Key Laboratory of Pattern Recognition and Intelligent Information Proces-sing,Institutions of Higher Education of Sichuan Province(MSSB-2020-01).

摘要: 软件漏洞逐年递增,安全问题愈发严重。在软件项目的交付阶段对原始代码进行漏洞检测可以有效避免后期运行时的安全漏洞,而代码漏洞检测依赖于有效的代码表征。传统的基于软件度量的表征方法与漏洞关联性较弱,难以对漏洞信息进行有效表征。近年来,机器学习为漏洞的智能化发现提供了新的思路,但该方法同样可能遗漏关键的代码特征信息。针对以上问题,文中在传统抽象语法树(AST)上增加控制依赖、数据依赖和语句序列边生成增强抽象语法树(EXAST)图结构,对原始代码进行表征以更好地处理代码结构化信息,并采用词向量嵌入算法(Word2Vec)将代码信息初始化为机器能够识别和学习的数值向量。同时,在传统的图神经网络(GNN)中引入门控循环单元(GRU),构建图识别模型,以缓解梯度消失并加强图结构中长期信息的传播,从而增强了代码执行的时序关系,提高了漏洞检测的准确度。最后在SARD公开数据集上对模型进行对比测试,实现了函数粒度的代码漏洞检测,相比传统的漏洞检测方法,准确率和F1分值分别最大提高了32.54%和44.99,实验结果证明了所提方法对代码漏洞检测的有效性。

关键词: 漏洞挖掘, 图神经网络, 深度学习, 抽象语法树, 门控循环单元

Abstract: With the increase of software vulnerabilities year by year,security problems are becoming more and more serious.Vulnerability detection of original code in the delivery stage of software project can effectively avoid security vulnerabilities in later run-time,and the discovery of code vulnerability depends on effective code characterization.The traditional characterization me-thods based on software metrics have weak correlation with vulnerabilities,so it is difficult to characterize vulnerability information efficiently.In recent years,machine learning has provided a new idea for intelligent discovery of vulnerabilities,but this method also has the problem of missing key information of code feature.To solve the above problems,control flow edge,data flow edge and next token edge are added to the traditional abstract syntax tree(AST) to generate an expanded abstract syntax tree (EXAST) graph structure,characterize the original code to better process the code structure information,and the word vector embedding model(word2vec) is used to initialize the code information into a numerical vector that the machine can recognize and learn.At the same time,the gate recurrent unit(GRU) is introduced into the traditional graph neural network(GNN) to build the model,which can alleviate the disappearance of the gradient,enhance the dissemination of long-term information in the graph structure to strengthen the timing relationship of code execution and improve the accuracy of vulnerability detection.Finally,the model is trained and tested on the SARD data sets to realize the function granularity code vulnerability detection,which can improve the accuracy of 32.54% and the F1 score of 44.99 compared with the traditional vulnerability detection method.Experimental results confirm the effectiveness of the method for code vulnerability detection.

Key words: Vulnerability mining, Graph neural network, Deep learning, Abstract syntax tree, Gate recurrent unit

中图分类号: 

  • TP309
[1]NIST.CVSS Severity Distribution Over Time [EB/OL].[2021-12-10].https://nvd.nist.gov/general/visualizations/vulnerability-visualizations/cvss-severity-distribution-over-time.
[2]PEISERT S,SCHNEIER B,OKHRAVI H,et al.Perspectiveson the solarwinds incident[J].IEEE Security & Privacy,2021,19(2):7-13.
[3]CVE[EB/OL].https://www.cve.org/CVERecord?id=CVE-2021-44228.
[4]Dwheeler.Flawfinder software official website[EB/OL].https://dwheeler.com/flawfinder/.
[5]KlockWork:Best Static Code Analyzer for Developer Productivity[EB/OL].https://www.perforce.com/products/klocwork.
[6]GAO Q,ZHANG S,CHEN X,et al.CoBOT:Static C/C++ bugdetection in the presence of incomplete code[C]//IEEE/ACM 26th International Conference on Program Comprehension.2018.
[7]AFL[OL].https://lcamtuf.coredump.cx/afl.
[8]LibFuzzer[OL].https://llvm.org/docs/LibFuzzer.html.
[9]LIN G,ZHANG J,LUO W,et al.Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases[J].IEEE Transactions on Dependable and Secure Computing,2021,18(5):2469-2485.
[10]PERL H,DECHAND S,SMITH M,et al.VCCFinder:Finding potential vulnerabilities in open-source projects to assist code audits[C]//Proceedings of the 22nd ACM SIGSAC Conference on Computer & Communications Security.2015:426-437.
[11]SHIN Y,MENEELY A,WILLIAMS L,et al.Evaluating Complexity,Code Churn,and Developer Activity Metrics as Indicators of Software Vulnerabilities[J].IEEE Transactions on Software Engineering,2011,37(6):772-787.
[12]RUSSELL R,KIM L,HAMILTON L,et al.Automated Vulne-rability Detection in Source Code Using Deep Representation Learning[C]//2018 17th IEEE international conference on machine learning and applications.2018.757-762.
[13]SHEN Y,MARICONTI E,VERVIER P A,et al.Tiresias:Predicting security events through deep learning[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:592-605.
[14]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[15]Joern[OL].https://joern.readthedocs.io/en/latest.
[16]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[17]GRIECO G,GRINBLAT G L,UZAL L,et al.Toward large-scale vulnerability discovery using machine learning[C]//Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy.2016:85-96.
[18]YOUNIS A,MALAIYA Y,ANDERSON C,et al.To fear or not to fear that is the question:Code characteristics of a vulnerable function with an existing exploit[C]//the Sixth ACM Confe-rence on Data & Applications Security & Privacy.2016:97-104.
[19]YAMAGUCHI F,RIECK K.Vulnerability extrapolation:Assisted discovery of vulnerabilities using machine learning[C]//5th USENIX Workshop on Offensive Technologies(WOOT 11).2011.
[20]LI Z,ZOU D,XU S,et al.Vuldeepecker:A deep learning-based system for vulnerability detection[J].arXiv:1801.01681,2018.
[21]ZOU D,WANG S,XU S,et al.μvuldeepecker:A deep learning-based system for multiclass vulnerability detection[J].IEEE Transactions on Dependable and Secure Computing,2019,18(5):2224-2236.
[22]LIN G,WEN S,HAN Q L,et al.Software vulnerability detection using deep neural networks:a survey[J].Proceedings of the IEEE,2020,108(10):1825-1848.
[23]LI Z,ZOU D,XU S,et al.Sysevr:A framework for using deep learning to detect software vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2022,19(4):2244-2258.
[24]LIN G,ZHANG J,LUO W,et al.POSTER:Vulnerability dis-covery with function representation learning from unlabeled projects[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:2539-2541.
[25]HARER J A,KIM L Y,RUSSELL R L,et al.Automated software vulnerability detection with machine learning[J].arXiv:1803.04497,2018.
[26]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[27]YU Z,CAO R,TANG Q,et al.Order matters:semantic-aware neural networks for binary code similarity detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(1):1145-1152.
[28]DUAN X,WU J,JI S,et al.Vulsniper:Focus your attention to shoot fine-grained vulnerabilities[C]//International Joint Conference on Artificial Intelligence.2019:4665-4671.
[29]YAMAGUCHI F,GOLDE N,ARP D,et al.Modeling and discovering vulnerabilities with code property graphs[C]//2014 IEEE Symposium on Security and Privacy.IEEE,2014:590-604.
[30]KIPF T N,WELLING M.Semi-supervised classification withgraph convolutional networks[J].arXiv:1609.02907,2016.
[31]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[32]LI Y,TARLOW D,BROCKSCHMIDT M,et al.Gated graph sequence neural networks[J].arXiv:1511.05493,2015.
[33]CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014.
[34]NIST software assurance reference dataset project[EB/OL].https://www.nist.gov/itl/ssd/software-quality-group/software-assurance-reference-dataset-sard-manual.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!