计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211000186-6.doi: 10.11896/jsjkx.211000186

• 软件工程 • 上一篇    下一篇

基于图嵌入的代码相似性度量

梁瑶, 谢春丽, 王文捷   

  1. 江苏师范大学计算机科学与技术学院 江苏 徐州 221116
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 谢春丽(6020030132@jsnu.edu.cn)
  • 作者简介:(2396769801@qq.com)
  • 基金资助:
    国家自然科学基金(61773185,61877030);江苏省研究生科研与实践创新计划项目(2021XKT1392)

Code Similarity Measurement Based on Graph Embedding

LIANG Yao, XIE Chun-li, WANG Wen-jie   

  1. School of Computer Science and Technology,Jiangsu Normal University,Xuzhou,Jiangsu 221116,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:LIANG Yao,born in 1997,postgra-duate,is a member of China Computer Federation.Her main research interests include code analysis and so on.
    XIE Chun-li,born in 1979,Ph.D,asso-ciate professor,is a member of China Computer Federation.Her main research interests include software reliability analysis and deep learning.
  • Supported by:
    National Natural Science Foundation of China(61773185,61877030) and Postgraduate Research & Practice Innovation Program of Jiangsu Province(2021XKT1392).

摘要: 近年来,代码相似性检测一直是软件工程领域的热点问题,它可以帮助代码克隆检测、代码缺陷预测等,降低软件维护成本。目前流行的代码相似性检测方法大多是借用自然语言处理方法从符号(Token)、抽象语法树(Abstract Syntax Tree,AST)等代码表征中提取源代码的文本、语法、结构等特征信息,将其映射为连续空间的实值向量,然后通过直接计算提取特征的欧氏距离、余弦值,或通过浅层神经网络模型获得代码的相似值,这些方法取得了优于传统程序静态分析的效果。但这些方法大多数是基于源代码语法层面的检测技术,未充分利用源代码的语义信息。Doc2Vec和Word2Vec虽然能够挖掘代码的词汇语义信息,但对代码的执行语义信息无能为力,针对这一问题,提出了使用控制流程图(Control Flow Graph,CFG)来表示代码的执行语义,并使用基于随机游走(Random Walk)的图嵌入方法来学习和推理代码的语义信息,进而判断源代码的功能相似性。实验结果表明,和Doc2Vec以及Word2Vec方法相比,该模型能够较精确地检测出源代码的功能相似性,其F1值相较于Doc2Vec和Word2Vec方法分别提高了16.01%和18.72%。

关键词: 控制流程图, 图嵌入, 随机游走, 代码相似性检测

Abstract: In recent years,code similarity detection has been a hot topic in the field of software engineering,which can help code clone detection,code defect prediction,and reduce the cost of software maintenance.At present,most popular code similarity detection methods build language processing model to extract the text,syntax,structure and other feature information of source code from tokens,AST and other code representations,and map them to real value vectors in continuous space.Then,obtain the similar value of the code comparison by calculating the Euclidean distance and cosine value of the extracted features or by the shallow neural network model.These methods have achieved better results than the traditional static analysis program.However,most of these methods are based on the grammar level of source code,which can not make full use of the semantic information of source code.Although Doc2Vec and Word2Vec can extract the lexical semantic information of code,they are powerless to handle the execution semantic information of code.To solve this problem,control flow graph(CFG) is proposed to represent the execution semantics of code,and the graph embedding method based on random walk is used to learn and reason the semantic information of the code,and then judge the functional similarity of the source code.Compared with Doc2Vec and Word2Vec methods,experimental results show that the model can accurately detect the functional similarity of source code,and its F1 value improves by 16.01% and 18.72% compared with Doc2Vec and Word2Vec methods,respectively.

Key words: Control flow graph, Graph embedding, Random walk, Code similarity detection

中图分类号: 

  • TP311
[1]BAKER B S.On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering.IEEE,1995:86-95.
[2]DUCASSE S,RIEGER M,DEMEYER S.A language independent approach for detecting duplicated code[C]//Proceedings IEEE International Conference on Software Maintenance-1999(ICSM’99)[C]//Software Maintenance for Business Change.IEEE,1999:109-118.
[3]ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//2008 16th iEEE International Conference on Program Comprehension.IEEE,2008:172-181.
[4]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A multilinguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[5]LIVIERI S,HIGO Y,MATUSHITA M,et al.Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder:D-CCFinder[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:106-115.
[6]LI Z,LU S,MYAGMAR S,et al.CP-Miner:Finding copy-paste and related bugs in large-scale software code[J].IEEE Transactions on Software Engineering,2006,32(3):176-192.
[7]LI L,FENG H,ZHUANG W,et al.Cclearner:A deep learning-based clone detection approach[C]//2017 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2017:249-260.
[8]BAXTER I D,YAHIN A,MOURA L,et al.Clone detection using abstract syntax trees[C]//Proceedings.International Conference on Software Maintenance.IEEE,1998:368-377.
[9]JIANG L,MISHERGHI G,SU Z,et al.Deckard:Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:96-105.
[10]WEI H,LI M.Supervised Deep Features for Software Func-tional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code[C]//IJCAI.2017:3034-3040.
[11]MAYRAND J,LEBLANC C,MERLO E M.Experiment on the automatic detection of function clones in a software system using metrics[C]//International Conference on Software Maintenance.IEEE,1996:244-253.
[12]KONTOGIANNIS K,GALLER M,DEMORI R.Detecting code similarity using patterns[C]//Working Notes of 3rd Workshop on AI and Software Engineering.1995:68-73.
[13]KOMONDOOR R,HORWITZ S.Using slicing to identify duplication in source code[C]//International static analysis symposium.Springer,Berlin,Heidelberg,2001:40-56.
[14]KRINKE J.Identifying similar code with program dependence graphs[C]//Proceedings Eighth Working Conference on Reverse Engineering.IEEE,2001:301-309.
[15]HUMMEL B,JUERGENS E,HEINEMANN L,et al.Index-based code clone detection:incremental,distributed,scalable[C]//2010 IEEE International Conference on Software Maintenance.IEEE,2010:1-9.
[16]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Learning distributed representations of code[J].arXiv:1803,09473,2019.
[17]ROY D,PANDA P,ROY K.Tree-cnn:A deep convolutional neural network for lifelong learning[J].arXiv:1802.05800,2018.
[18]DEFREEZ D,THAKUR A V,RUBIO-GONZÁLEZ C.Path-based function embedding and its application to specification mining[J].arXiv:1802.07779,2018.
[19]PEROZZI B,AL-RFOU R,SKIENA S.Deepwalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2014:701-710.
[20]ZHANG J,WANG X,ZHANG H,et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:783-794.
[21]WANG W,LI G,SHEN S,et al.Modular tree network forsource code representation learning[J].ACM Transactions on Software Engineering and Methodology(TOSEM),2020,29(4):1-23.
[22]BUI N D Q,YU Y,JIANG L.Infercode:Self-supervised learning of code representations by predicting subtrees[C]//2021 IEEE/ACM 43rd International Conference on Software Engineering(ICSE).IEEE,2021:1186-1197.
[23]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196.
[24]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[25]ZHAO G,HUANG J.Deepsim:deep learning code functionalsimilarity[C]//Proceedings of the 2018 26th ACM Joint Mee-ting on European Software Engineering Conference and Sympo-sium on the Foundations of Software Engineering.2018:141-151.
[26]WANG W,LI G,MA B,et al.Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]//2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020:261-271.
[27]KARNALIM O.Syntax trees and information retrieval to improve code similarity detection[C]//Proceedings of the Twenty-Second Australasian Computing Education Conference.2020:48-55.
[28]HAAS R,NIEDERMAYR R,RÖHM T,et al.RecommendingUnnecessary Source Code Based on Static Analysis[C]//2019 IEEE/ACM 41st International Conference on Software Engineering:Companion Proceedings(ICSE-Companion).IEEE,2019:274-275.
[29]ALON U,ZILBERSTEIN M,LEVY O,et al.A general path-based representation for predicting program properties[J].ACM SIGPLAN Notices,2018,53(4):404-419.
[30]ZHANG J,WANG X,ZHANG H,et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:783-794.
[31]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A literature review[J].Ruan Jian Xue Bao/Journal of Software,2019,30(4):962-980.
[1] 李勇, 吴京鹏, 张钟颖, 张强.
融合快速注意力机制的节点无特征网络链路预测算法
Link Prediction for Node Featureless Networks Based on Faster Attention Mechanism
计算机科学, 2022, 49(4): 43-48. https://doi.org/10.11896/jsjkx.210800276
[2] 杨辉, 陶力宏, 朱建勇, 聂飞平.
基于锚点的快速无监督图嵌入
Fast Unsupervised Graph Embedding Based on Anchors
计算机科学, 2022, 49(4): 116-123. https://doi.org/10.11896/jsjkx.210200098
[3] 富坤, 郭云朋, 禚佳明, 李佳宁, 刘琪.
语义增强的完全不平衡标签网络表示学习算法
Semantic Information Enhanced Network Embedding with Completely Imbalanced Labels
计算机科学, 2022, 49(11): 109-116. https://doi.org/10.11896/jsjkx.210900101
[4] 刘扬, 郑文萍, 张川, 王文剑.
一种基于局部随机游走的标签传播算法
Local Random Walk Based Label Propagation Algorithm
计算机科学, 2022, 49(10): 103-110. https://doi.org/10.11896/jsjkx.220400145
[5] 方磊, 武泽慧, 魏强.
二进制代码相似性检测技术综述
Summary of Binary Code Similarity Detection Techniques
计算机科学, 2021, 48(5): 1-8. https://doi.org/10.11896/jsjkx.200400085
[6] 邢长征, 朱金侠, 孟祥福, 齐雪月, 朱尧, 张峰, 杨一鸣.
兴趣点推荐方法研究综述
Point-of-interest Recommendation:A Survey
计算机科学, 2021, 48(11A): 176-183. https://doi.org/10.11896/jsjkx.201100021
[7] 刘丹, 赵森, 颜志良, 赵静, 王会青.
基于堆叠自动编码器的miRNA-疾病关联预测方法
miRNA-disease Association Prediction Model Based on Stacked Autoencoder
计算机科学, 2021, 48(10): 114-120. https://doi.org/10.11896/jsjkx.200900169
[8] 张虎, 周晶晶, 高海慧, 王鑫.
融合节点结构和内容的网络表示学习方法
Network Representation Learning Method on Fusing Node Structure and Content
计算机科学, 2020, 47(12): 119-124. https://doi.org/10.11896/jsjkx.190900027
[9] 唐家琪, 吴璟莉, 廖元秀, 王金艳.
基于双加权投票的蛋白质功能预测
Prediction of Protein Functions Based on Bi-weighted Vote
计算机科学, 2019, 46(4): 222-227. https://doi.org/10.11896/j.issn.1002-137X.2019.04.035
[10] 赵倩倩,吕敏,许胤龙.
基于两种子结构感知的社交网络Graphlets采样估计算法
Estimating Graphlets via Two Common Substructures Aware Sampling in Social Networks
计算机科学, 2019, 46(3): 314-320. https://doi.org/10.11896/j.issn.1002-137X.2019.03.046
[11] 尹欣红, 赵世燕, 陈晓云.
带偏置的信号传播的随机游走的社团检测算法
Community Detection Algorithm Based on Random Walk of Signal Propagation with Bias
计算机科学, 2019, 46(12): 45-55. https://doi.org/10.11896/jsjkx.190700051
[12] 刘庆烽, 刘哲, 宋余庆, 朱彦.
基于约束随机游走的肿瘤图像分割方法
Tumor Image Segmentation Method Based on Random Walk with Constraint
计算机科学, 2018, 45(7): 243-247. https://doi.org/10.11896/j.issn.1002-137X.2018.07.042
[13] 肖迎元,张红玉.
基于用户潜在特征的社交网络好友推荐方法
Friend Recommendation Method Based on Users’ Latent Features in Social Networks
计算机科学, 2018, 45(3): 218-222. https://doi.org/10.11896/j.issn.1002-137X.2018.03.034
[14] 卿勇,刘梦娟,银盈,李杨曦.
SMART:一种面向电商平台快速消费品的图推荐算法
SMART:A Graph-based Recommendation Algorithm for Fast Moving Consumer Goods in E-commerce Platform
计算机科学, 2017, 44(Z11): 464-469. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.099
[15] 卞梦阳,杨青,张敬伟,张会兵,钱俊彦.
基于FP-Growth的图上随机游走推荐方法
Recommendation Method Based on Random Walk on Graph Integrated with FP-Growth
计算机科学, 2017, 44(6): 232-236. https://doi.org/10.11896/j.issn.1002-137X.2017.06.039
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!