计算机科学 ›› 2023, Vol. 50 ›› Issue (5): 64-71.doi: 10.11896/jsjkx.220100094

• 可解释性人工智能 • 上一篇    下一篇

一种基于神经网络的代码嵌入方法

孙雪凯, 蒋烈辉   

  1. 信息工程大学数学工程与先进计算国家重点实验室 郑州 450001
  • 收稿日期:2022-01-11 修回日期:2022-09-19 出版日期:2023-05-15 发布日期:2023-05-06
  • 通讯作者: 蒋烈辉(jiangliehui@163.com)
  • 作者简介:(sunxuekai1226@163.com)

Code Embedding Method Based on Neural Network

SUN Xuekai, JIANG Liehui   

  1. State Key Laboratory of Mathematical Engineering andAdvanced Computing,PLA Information Engineering University,Zhengzhou 450001,China
  • Received:2022-01-11 Revised:2022-09-19 Online:2023-05-15 Published:2023-05-06
  • About author:SUN Xuekai,born in 1991,Ph.D candidate.His main research interests include code similarity detection and code vulnerability mining.
    JIANG Liehui,born in 1967,Ph.D,professor.His main research interests include computer architecture,reverse engineering,and cyberspace security.

摘要: 对代码进行分析研究具有很多的应用场景,例如代码抄袭检测、软件漏洞搜索等。随着人工智能的发展,神经网络技术被广泛应用于代码分析和研究。然而,现有的方法要么简单地将代码视为普通的自然语言处理,要么使用太过复杂的规则对代码进行采样,前者的处理方式容易造成代码关键信息的丢失,而后者会造成算法过于复杂,模型的训练需要花费较长的时间。Alon等提出了一种名为Code2vec的算法,该算法采用了一种简单且有效的代码表示方法,相比之前的代码分析方法有着显著的优势,但Code2vec算法仍存在一些局限性。因此,在其基础上提出了一种基于神经网络的代码嵌入方法,该方法的主要思想是将代码函数表示为代码的嵌入向量。首先将一个代码函数分解为一系列抽象语法树路径,然后通过神经网络去学习如何表示每一条路径,最后将所有路径聚合成一个嵌入向量来表示当前的代码函数。文中实现了一个基于该方法的原型系统,实验结果表明,相比Code2vec,所提算法的结构更加简单、训练速度更快。

关键词: 神经网络, 代码嵌入, 代码分析, 抽象语法树, 代码分类

Abstract: There are many application scenarios for code analysis and research,such as code plagiarism detection and software vulnerability search.With the development of artificial intelligence,neural network technology has been widely used in code analysis and research.However,the existing methods either simply treat the code as ordinary natural language processing,or use much more complex rules to sample the code.The former processing method is easy to cause the loss of key information of the code,while the latter can make the algorithm to be too complicated,and the training of the model will take a lot of time.Alon proposed an algorithm named Code2vec,which has significant advantages compared with previous code analysis methods.But the Code2vec still has some limitations.Therefore,a code embedding method based on neural network is proposed.The main idea of this method is to express the code function as the code embedding vector.First,a code function is decomposed into a series of abstract syntax tree paths,then a neural network is used to learn how to represent each path,and finally all paths are aggregated into an embedding vector to represent the current code function.A prototype system based on this method is implemented in this paper.Experimental results show that compared with Code2vec,the new algorithm has the advantages of simpler structure and faster training speed.

Key words: Neural network, Code embedding, Code analysis, Abstract syntax tree, Code classification

中图分类号: 

  • TP311
[1]ZHANG D,LUO P.Survey of code similarity detection methods and tools[J].Computer Science,2020,47(3):5-10.
[2]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A literature review[J].Journal of Software,2019,30(4):962-980.
[3]ALON U,ZILBERSTEIN M,LEVY O,et al.Code2vec:learning distributed representations of code[J].Proceedings of the Programming Languages,2019,3(POPL):1-29.
[4]SHI Z C,ZHOU Y.Method of Code Features Automated Ex-traction[J].Journal of Frontiers of Computer Science and Technology,2021,15(3):456-467.
[5]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:a multilinguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[6]SAJNANI H,SAINI V,SVAJLENKO J,et al.SourcererCC:scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering.2016:1157-1168.
[7]JIANG L,MISHERGHI G,SU Z,et al.DECKARD:scalable and accurate tree-based detection of code clones[C]//International Conference on Software Engineering.IEEE,2006.
[8]ZHOU Y,YAN X,YANG W,et al.Augmenting Java method comments generation with context information based on neural networks[J].The Journal of Systems and Software,2019,156(Oct.):328-340.
[9]YAN X,ZHOU Y,HUANG Z Q.Code snippets recommendation based on sequence to sequence model[J].Journal of Frontiers of Computer Science and Technology,2020,14(5):731-739.
[10]HU X,LI G,XIA X,et al.Deep code comment generation[C]//Proceedings of the 26th Conference on Program Comprehension.2018:200-210.
[11]WHITE M,TUFANO M,VENDOME C,et al.Deep learningcode fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering.2016:87-98.
[12]WAN Y,ZHAO Z,YANG M,et al.Improving automaticsource code summarization via deep reinforcement learning[C]//Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering.2018:397-407.
[13]MOU L L,LI G,ZHANG L,et al.Convolutional neural net-works over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence.2016:1287-1293.
[14]ALLAMANIS M,PENG H,SUTTON C.Aconvolutional at-tention network for extreme summarization of source code[C]//Proceedings of the 33nd International Conference on Machine Learning.2016:2091-2100.
[15]IYER S,KONSTAS I,CHEUNG A,et al.Summarizing source code using a neural attention model[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016.
[16]XU X J,LIU C,QIAN F,et al.Neural network-based graphembedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[17]XIONG H,YAN H H,GUO T,et al.Code similarity detection:a survey[J].Computer Science,2010,37(8):9-14,76.
[18]DONALDSON J L,LANCASTER A M,SPOSATO P H.Aplagiarism detection system[C]//ACM SIGCSE Bulletin.1981:21-25.
[19]ENGELS S,LAKSHMANAN V,CRAIG M.Plagiarism detection using feature-based neural networks[C]//ACM SIGCSE Bulletin.2007:34-38.
[20]RUBINSTEIN R.The cross-entropy method for combinatorial and continuous optimization[J].Methodology and Computing in Applied Probability,1999,1(2):127-190.
[21]ALON U,ZILBERSTEIN M,LEVY O,et al.A general path-based representation forpredicting program properties[C]//Proceedings of the 39th ACM SIGPLAN Conference.ACM,2018.
[22]KINGMA D P,BA J.Adam:a method for stochastic optimization[J].arXiv:1412.6980,2017.
[23]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:a simple way to prevent neural networks from overfitting[J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[24]BENGIO Y,GLOROT X.Understanding the difficulty of trai-ning deep feed forward neural networks[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics.2010:249-256.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!