计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211000186-6.doi: 10.11896/jsjkx.211000186
梁瑶, 谢春丽, 王文捷
LIANG Yao, XIE Chun-li, WANG Wen-jie
摘要: 近年来,代码相似性检测一直是软件工程领域的热点问题,它可以帮助代码克隆检测、代码缺陷预测等,降低软件维护成本。目前流行的代码相似性检测方法大多是借用自然语言处理方法从符号(Token)、抽象语法树(Abstract Syntax Tree,AST)等代码表征中提取源代码的文本、语法、结构等特征信息,将其映射为连续空间的实值向量,然后通过直接计算提取特征的欧氏距离、余弦值,或通过浅层神经网络模型获得代码的相似值,这些方法取得了优于传统程序静态分析的效果。但这些方法大多数是基于源代码语法层面的检测技术,未充分利用源代码的语义信息。Doc2Vec和Word2Vec虽然能够挖掘代码的词汇语义信息,但对代码的执行语义信息无能为力,针对这一问题,提出了使用控制流程图(Control Flow Graph,CFG)来表示代码的执行语义,并使用基于随机游走(Random Walk)的图嵌入方法来学习和推理代码的语义信息,进而判断源代码的功能相似性。实验结果表明,和Doc2Vec以及Word2Vec方法相比,该模型能够较精确地检测出源代码的功能相似性,其F1值相较于Doc2Vec和Word2Vec方法分别提高了16.01%和18.72%。
中图分类号:
[1]BAKER B S.On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering.IEEE,1995:86-95. [2]DUCASSE S,RIEGER M,DEMEYER S.A language independent approach for detecting duplicated code[C]//Proceedings IEEE International Conference on Software Maintenance-1999(ICSM’99)[C]//Software Maintenance for Business Change.IEEE,1999:109-118. [3]ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//2008 16th iEEE International Conference on Program Comprehension.IEEE,2008:172-181. [4]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A multilinguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670. [5]LIVIERI S,HIGO Y,MATUSHITA M,et al.Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder:D-CCFinder[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:106-115. [6]LI Z,LU S,MYAGMAR S,et al.CP-Miner:Finding copy-paste and related bugs in large-scale software code[J].IEEE Transactions on Software Engineering,2006,32(3):176-192. [7]LI L,FENG H,ZHUANG W,et al.Cclearner:A deep learning-based clone detection approach[C]//2017 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2017:249-260. [8]BAXTER I D,YAHIN A,MOURA L,et al.Clone detection using abstract syntax trees[C]//Proceedings.International Conference on Software Maintenance.IEEE,1998:368-377. [9]JIANG L,MISHERGHI G,SU Z,et al.Deckard:Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:96-105. [10]WEI H,LI M.Supervised Deep Features for Software Func-tional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code[C]//IJCAI.2017:3034-3040. [11]MAYRAND J,LEBLANC C,MERLO E M.Experiment on the automatic detection of function clones in a software system using metrics[C]//International Conference on Software Maintenance.IEEE,1996:244-253. [12]KONTOGIANNIS K,GALLER M,DEMORI R.Detecting code similarity using patterns[C]//Working Notes of 3rd Workshop on AI and Software Engineering.1995:68-73. [13]KOMONDOOR R,HORWITZ S.Using slicing to identify duplication in source code[C]//International static analysis symposium.Springer,Berlin,Heidelberg,2001:40-56. [14]KRINKE J.Identifying similar code with program dependence graphs[C]//Proceedings Eighth Working Conference on Reverse Engineering.IEEE,2001:301-309. [15]HUMMEL B,JUERGENS E,HEINEMANN L,et al.Index-based code clone detection:incremental,distributed,scalable[C]//2010 IEEE International Conference on Software Maintenance.IEEE,2010:1-9. [16]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Learning distributed representations of code[J].arXiv:1803,09473,2019. [17]ROY D,PANDA P,ROY K.Tree-cnn:A deep convolutional neural network for lifelong learning[J].arXiv:1802.05800,2018. [18]DEFREEZ D,THAKUR A V,RUBIO-GONZÁLEZ C.Path-based function embedding and its application to specification mining[J].arXiv:1802.07779,2018. [19]PEROZZI B,AL-RFOU R,SKIENA S.Deepwalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2014:701-710. [20]ZHANG J,WANG X,ZHANG H,et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:783-794. [21]WANG W,LI G,SHEN S,et al.Modular tree network forsource code representation learning[J].ACM Transactions on Software Engineering and Methodology(TOSEM),2020,29(4):1-23. [22]BUI N D Q,YU Y,JIANG L.Infercode:Self-supervised learning of code representations by predicting subtrees[C]//2021 IEEE/ACM 43rd International Conference on Software Engineering(ICSE).IEEE,2021:1186-1197. [23]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196. [24]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[C]//Advances in Neural Information Processing Systems.2013:3111-3119. [25]ZHAO G,HUANG J.Deepsim:deep learning code functionalsimilarity[C]//Proceedings of the 2018 26th ACM Joint Mee-ting on European Software Engineering Conference and Sympo-sium on the Foundations of Software Engineering.2018:141-151. [26]WANG W,LI G,MA B,et al.Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]//2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020:261-271. [27]KARNALIM O.Syntax trees and information retrieval to improve code similarity detection[C]//Proceedings of the Twenty-Second Australasian Computing Education Conference.2020:48-55. [28]HAAS R,NIEDERMAYR R,RÖHM T,et al.RecommendingUnnecessary Source Code Based on Static Analysis[C]//2019 IEEE/ACM 41st International Conference on Software Engineering:Companion Proceedings(ICSE-Companion).IEEE,2019:274-275. [29]ALON U,ZILBERSTEIN M,LEVY O,et al.A general path-based representation for predicting program properties[J].ACM SIGPLAN Notices,2018,53(4):404-419. [30]ZHANG J,WANG X,ZHANG H,et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:783-794. [31]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A literature review[J].Ruan Jian Xue Bao/Journal of Software,2019,30(4):962-980. |
[1] | 李勇, 吴京鹏, 张钟颖, 张强. 融合快速注意力机制的节点无特征网络链路预测算法 Link Prediction for Node Featureless Networks Based on Faster Attention Mechanism 计算机科学, 2022, 49(4): 43-48. https://doi.org/10.11896/jsjkx.210800276 |
[2] | 杨辉, 陶力宏, 朱建勇, 聂飞平. 基于锚点的快速无监督图嵌入 Fast Unsupervised Graph Embedding Based on Anchors 计算机科学, 2022, 49(4): 116-123. https://doi.org/10.11896/jsjkx.210200098 |
[3] | 富坤, 郭云朋, 禚佳明, 李佳宁, 刘琪. 语义增强的完全不平衡标签网络表示学习算法 Semantic Information Enhanced Network Embedding with Completely Imbalanced Labels 计算机科学, 2022, 49(11): 109-116. https://doi.org/10.11896/jsjkx.210900101 |
[4] | 刘扬, 郑文萍, 张川, 王文剑. 一种基于局部随机游走的标签传播算法 Local Random Walk Based Label Propagation Algorithm 计算机科学, 2022, 49(10): 103-110. https://doi.org/10.11896/jsjkx.220400145 |
[5] | 方磊, 武泽慧, 魏强. 二进制代码相似性检测技术综述 Summary of Binary Code Similarity Detection Techniques 计算机科学, 2021, 48(5): 1-8. https://doi.org/10.11896/jsjkx.200400085 |
[6] | 邢长征, 朱金侠, 孟祥福, 齐雪月, 朱尧, 张峰, 杨一鸣. 兴趣点推荐方法研究综述 Point-of-interest Recommendation:A Survey 计算机科学, 2021, 48(11A): 176-183. https://doi.org/10.11896/jsjkx.201100021 |
[7] | 刘丹, 赵森, 颜志良, 赵静, 王会青. 基于堆叠自动编码器的miRNA-疾病关联预测方法 miRNA-disease Association Prediction Model Based on Stacked Autoencoder 计算机科学, 2021, 48(10): 114-120. https://doi.org/10.11896/jsjkx.200900169 |
[8] | 张虎, 周晶晶, 高海慧, 王鑫. 融合节点结构和内容的网络表示学习方法 Network Representation Learning Method on Fusing Node Structure and Content 计算机科学, 2020, 47(12): 119-124. https://doi.org/10.11896/jsjkx.190900027 |
[9] | 唐家琪, 吴璟莉, 廖元秀, 王金艳. 基于双加权投票的蛋白质功能预测 Prediction of Protein Functions Based on Bi-weighted Vote 计算机科学, 2019, 46(4): 222-227. https://doi.org/10.11896/j.issn.1002-137X.2019.04.035 |
[10] | 赵倩倩,吕敏,许胤龙. 基于两种子结构感知的社交网络Graphlets采样估计算法 Estimating Graphlets via Two Common Substructures Aware Sampling in Social Networks 计算机科学, 2019, 46(3): 314-320. https://doi.org/10.11896/j.issn.1002-137X.2019.03.046 |
[11] | 尹欣红, 赵世燕, 陈晓云. 带偏置的信号传播的随机游走的社团检测算法 Community Detection Algorithm Based on Random Walk of Signal Propagation with Bias 计算机科学, 2019, 46(12): 45-55. https://doi.org/10.11896/jsjkx.190700051 |
[12] | 刘庆烽, 刘哲, 宋余庆, 朱彦. 基于约束随机游走的肿瘤图像分割方法 Tumor Image Segmentation Method Based on Random Walk with Constraint 计算机科学, 2018, 45(7): 243-247. https://doi.org/10.11896/j.issn.1002-137X.2018.07.042 |
[13] | 肖迎元,张红玉. 基于用户潜在特征的社交网络好友推荐方法 Friend Recommendation Method Based on Users’ Latent Features in Social Networks 计算机科学, 2018, 45(3): 218-222. https://doi.org/10.11896/j.issn.1002-137X.2018.03.034 |
[14] | 卿勇,刘梦娟,银盈,李杨曦. SMART:一种面向电商平台快速消费品的图推荐算法 SMART:A Graph-based Recommendation Algorithm for Fast Moving Consumer Goods in E-commerce Platform 计算机科学, 2017, 44(Z11): 464-469. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.099 |
[15] | 卞梦阳,杨青,张敬伟,张会兵,钱俊彦. 基于FP-Growth的图上随机游走推荐方法 Recommendation Method Based on Random Walk on Graph Integrated with FP-Growth 计算机科学, 2017, 44(6): 232-236. https://doi.org/10.11896/j.issn.1002-137X.2017.06.039 |
|