计算机科学 ›› 2023, Vol. 50 ›› Issue (5): 64-71.doi: 10.11896/jsjkx.220100094
孙雪凯, 蒋烈辉
SUN Xuekai, JIANG Liehui
摘要: 对代码进行分析研究具有很多的应用场景,例如代码抄袭检测、软件漏洞搜索等。随着人工智能的发展,神经网络技术被广泛应用于代码分析和研究。然而,现有的方法要么简单地将代码视为普通的自然语言处理,要么使用太过复杂的规则对代码进行采样,前者的处理方式容易造成代码关键信息的丢失,而后者会造成算法过于复杂,模型的训练需要花费较长的时间。Alon等提出了一种名为Code2vec的算法,该算法采用了一种简单且有效的代码表示方法,相比之前的代码分析方法有着显著的优势,但Code2vec算法仍存在一些局限性。因此,在其基础上提出了一种基于神经网络的代码嵌入方法,该方法的主要思想是将代码函数表示为代码的嵌入向量。首先将一个代码函数分解为一系列抽象语法树路径,然后通过神经网络去学习如何表示每一条路径,最后将所有路径聚合成一个嵌入向量来表示当前的代码函数。文中实现了一个基于该方法的原型系统,实验结果表明,相比Code2vec,所提算法的结构更加简单、训练速度更快。
中图分类号:
[1]ZHANG D,LUO P.Survey of code similarity detection methods and tools[J].Computer Science,2020,47(3):5-10. [2]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A literature review[J].Journal of Software,2019,30(4):962-980. [3]ALON U,ZILBERSTEIN M,LEVY O,et al.Code2vec:learning distributed representations of code[J].Proceedings of the Programming Languages,2019,3(POPL):1-29. [4]SHI Z C,ZHOU Y.Method of Code Features Automated Ex-traction[J].Journal of Frontiers of Computer Science and Technology,2021,15(3):456-467. [5]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:a multilinguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670. [6]SAJNANI H,SAINI V,SVAJLENKO J,et al.SourcererCC:scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering.2016:1157-1168. [7]JIANG L,MISHERGHI G,SU Z,et al.DECKARD:scalable and accurate tree-based detection of code clones[C]//International Conference on Software Engineering.IEEE,2006. [8]ZHOU Y,YAN X,YANG W,et al.Augmenting Java method comments generation with context information based on neural networks[J].The Journal of Systems and Software,2019,156(Oct.):328-340. [9]YAN X,ZHOU Y,HUANG Z Q.Code snippets recommendation based on sequence to sequence model[J].Journal of Frontiers of Computer Science and Technology,2020,14(5):731-739. [10]HU X,LI G,XIA X,et al.Deep code comment generation[C]//Proceedings of the 26th Conference on Program Comprehension.2018:200-210. [11]WHITE M,TUFANO M,VENDOME C,et al.Deep learningcode fragments for code clone detection[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering.2016:87-98. [12]WAN Y,ZHAO Z,YANG M,et al.Improving automaticsource code summarization via deep reinforcement learning[C]//Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering.2018:397-407. [13]MOU L L,LI G,ZHANG L,et al.Convolutional neural net-works over tree structures for programming language processing[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence.2016:1287-1293. [14]ALLAMANIS M,PENG H,SUTTON C.Aconvolutional at-tention network for extreme summarization of source code[C]//Proceedings of the 33nd International Conference on Machine Learning.2016:2091-2100. [15]IYER S,KONSTAS I,CHEUNG A,et al.Summarizing source code using a neural attention model[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016. [16]XU X J,LIU C,QIAN F,et al.Neural network-based graphembedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376. [17]XIONG H,YAN H H,GUO T,et al.Code similarity detection:a survey[J].Computer Science,2010,37(8):9-14,76. [18]DONALDSON J L,LANCASTER A M,SPOSATO P H.Aplagiarism detection system[C]//ACM SIGCSE Bulletin.1981:21-25. [19]ENGELS S,LAKSHMANAN V,CRAIG M.Plagiarism detection using feature-based neural networks[C]//ACM SIGCSE Bulletin.2007:34-38. [20]RUBINSTEIN R.The cross-entropy method for combinatorial and continuous optimization[J].Methodology and Computing in Applied Probability,1999,1(2):127-190. [21]ALON U,ZILBERSTEIN M,LEVY O,et al.A general path-based representation forpredicting program properties[C]//Proceedings of the 39th ACM SIGPLAN Conference.ACM,2018. [22]KINGMA D P,BA J.Adam:a method for stochastic optimization[J].arXiv:1412.6980,2017. [23]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:a simple way to prevent neural networks from overfitting[J].Journal of Machine Learning Research,2014,15(1):1929-1958. [24]BENGIO Y,GLOROT X.Understanding the difficulty of trai-ning deep feed forward neural networks[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics.2010:249-256. |
[1] | 李汇来, 杨斌, 于秀丽, 唐晓梅. 软件缺陷预测模型可解释性对比 Explainable Comparison of Software Defect Prediction Models 计算机科学, 2023, 50(5): 21-30. https://doi.org/10.11896/jsjkx.221000028 |
[2] | 王慧妍, 于明鹤, 于戈. 基于深度学习的异质信息网络表示学习方法综述 Deep Learning-based Heterogeneous Information Network Representation:A Survey 计算机科学, 2023, 50(5): 103-114. https://doi.org/10.11896/jsjkx.220800112 |
[3] | 张雪, 赵晖. 基于多事件语义增强的情感分析 Sentiment Analysis Based on Multi-event Semantic Enhancement 计算机科学, 2023, 50(5): 238-247. https://doi.org/10.11896/jsjkx.220400256 |
[4] | 汪林, 蒙祖强, 杨丽娜. 基于多级多尺度特征提取的CNN-BiLSTM模型的中文情感分析 Chinese Sentiment Analysis Based on CNN-BiLSTM Model of Multi-level and Multi-scale Feature Extraction 计算机科学, 2023, 50(5): 248-254. https://doi.org/10.11896/jsjkx.220400069 |
[5] | 叶瀚, 李欣, 孙海春. 结合门控机制的卷积网络实体缺失检测方法 Convolutional Network Entity Missing Detection Method Combined with Gated Mechanism 计算机科学, 2023, 50(5): 262-269. https://doi.org/10.11896/jsjkx.220400126 |
[6] | 常利伟, 刘秀娟, 钱宇华, 耿海军, 赖裕平. 基于卷积神经网络多源融合的网络安全态势感知模型 Multi-source Fusion Network Security Situation Awareness Model Based on Convolutional Neural Network 计算机科学, 2023, 50(5): 382-389. https://doi.org/10.11896/jsjkx.220400134 |
[7] | 邵云飞, 宋友, 王宝会. 基于社交网络图节点度的神经网络个性化传播算法研究 Study on Degree of Node Based Personalized Propagation of Neural Predictions forSocial Networks 计算机科学, 2023, 50(4): 16-21. https://doi.org/10.11896/jsjkx.220300274 |
[8] | 白祉旭, 王衡军, 郭可翔. 基于图像颜色随机变换的对抗样本生成方法 Adversarial Examples Generation Method Based on Image Color Random Transformation 计算机科学, 2023, 50(4): 88-95. https://doi.org/10.11896/jsjkx.211100164 |
[9] | 王振彪, 覃亚丽, 王荣芳, 郑欢. 基于残差特征聚合的图像压缩感知注意力神经网络 Image Compressed Sensing Attention Neural Network Based on Residual Feature Aggregation 计算机科学, 2023, 50(4): 117-124. https://doi.org/10.11896/jsjkx.211200215 |
[10] | 尹海涛, 王天由. 基于深度多尺度卷积稀疏编码的图像去噪算法 Image Denoising Algorithm Based on Deep Multi-scale Convolution Sparse Coding 计算机科学, 2023, 50(4): 133-140. https://doi.org/10.11896/jsjkx.220100090 |
[11] | 陈进杰, 贺超, 肖枭, 雷印杰. 基于细粒度星座图识别的光性能监测方法 Optical Performance Monitoring Method Based on Fine-grained Constellation Diagram Recognition 计算机科学, 2023, 50(4): 220-225. https://doi.org/10.11896/jsjkx.220600238 |
[12] | 刘泽润, 郑红, 邱俊杰. 基于抽象语法树裁剪的智能合约漏洞检测研究 Smart Contract Vulnerability Detection Based on Abstract Syntax Tree Pruning 计算机科学, 2023, 50(4): 317-322. https://doi.org/10.11896/jsjkx.220300063 |
[13] | 董程昱, 吕明琪, 陈铁明, 朱添田. 基于异构溯源图学习的APT攻击检测方法 Heterogeneous Provenance Graph Learning Model Based APT Detection 计算机科学, 2023, 50(4): 359-368. https://doi.org/10.11896/jsjkx.220300040 |
[14] | 曹晨阳, 杨晓东, 段鹏松. WiDoor:一种近距离非接触式身份识别方法 WiDoor:Close-range Contactless Human Identification Approach 计算机科学, 2023, 50(4): 388-396. https://doi.org/10.11896/jsjkx.220300278 |
[15] | 李帅, 徐彬, 韩祎珂, 廖同鑫. SS-GCN:情感增强和句法增强的方面级情感分析模型 SS-GCN:Aspect-based Sentiment Analysis Model with Affective Enhancement and Syntactic Enhancement 计算机科学, 2023, 50(3): 3-11. https://doi.org/10.11896/jsjkx.220700238 |
|