Computer Science ›› 2022, Vol. 49 ›› Issue (11A): 211000186-6.doi: 10.11896/jsjkx.211000186

• Software Engineering • Previous Articles     Next Articles

Code Similarity Measurement Based on Graph Embedding

LIANG Yao, XIE Chun-li, WANG Wen-jie   

  1. School of Computer Science and Technology,Jiangsu Normal University,Xuzhou,Jiangsu 221116,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:LIANG Yao,born in 1997,postgra-duate,is a member of China Computer Federation.Her main research interests include code analysis and so on.
    XIE Chun-li,born in 1979,Ph.D,asso-ciate professor,is a member of China Computer Federation.Her main research interests include software reliability analysis and deep learning.
  • Supported by:
    National Natural Science Foundation of China(61773185,61877030) and Postgraduate Research & Practice Innovation Program of Jiangsu Province(2021XKT1392).

Abstract: In recent years,code similarity detection has been a hot topic in the field of software engineering,which can help code clone detection,code defect prediction,and reduce the cost of software maintenance.At present,most popular code similarity detection methods build language processing model to extract the text,syntax,structure and other feature information of source code from tokens,AST and other code representations,and map them to real value vectors in continuous space.Then,obtain the similar value of the code comparison by calculating the Euclidean distance and cosine value of the extracted features or by the shallow neural network model.These methods have achieved better results than the traditional static analysis program.However,most of these methods are based on the grammar level of source code,which can not make full use of the semantic information of source code.Although Doc2Vec and Word2Vec can extract the lexical semantic information of code,they are powerless to handle the execution semantic information of code.To solve this problem,control flow graph(CFG) is proposed to represent the execution semantics of code,and the graph embedding method based on random walk is used to learn and reason the semantic information of the code,and then judge the functional similarity of the source code.Compared with Doc2Vec and Word2Vec methods,experimental results show that the model can accurately detect the functional similarity of source code,and its F1 value improves by 16.01% and 18.72% compared with Doc2Vec and Word2Vec methods,respectively.

Key words: Control flow graph, Graph embedding, Random walk, Code similarity detection

CLC Number: 

  • TP311
[1]BAKER B S.On finding duplication and near-duplication in large software systems[C]//Proceedings of 2nd Working Conference on Reverse Engineering.IEEE,1995:86-95.
[2]DUCASSE S,RIEGER M,DEMEYER S.A language independent approach for detecting duplicated code[C]//Proceedings IEEE International Conference on Software Maintenance-1999(ICSM’99)[C]//Software Maintenance for Business Change.IEEE,1999:109-118.
[3]ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//2008 16th iEEE International Conference on Program Comprehension.IEEE,2008:172-181.
[4]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A multilinguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[5]LIVIERI S,HIGO Y,MATUSHITA M,et al.Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder:D-CCFinder[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:106-115.
[6]LI Z,LU S,MYAGMAR S,et al.CP-Miner:Finding copy-paste and related bugs in large-scale software code[J].IEEE Transactions on Software Engineering,2006,32(3):176-192.
[7]LI L,FENG H,ZHUANG W,et al.Cclearner:A deep learning-based clone detection approach[C]//2017 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2017:249-260.
[8]BAXTER I D,YAHIN A,MOURA L,et al.Clone detection using abstract syntax trees[C]//Proceedings.International Conference on Software Maintenance.IEEE,1998:368-377.
[9]JIANG L,MISHERGHI G,SU Z,et al.Deckard:Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:96-105.
[10]WEI H,LI M.Supervised Deep Features for Software Func-tional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code[C]//IJCAI.2017:3034-3040.
[11]MAYRAND J,LEBLANC C,MERLO E M.Experiment on the automatic detection of function clones in a software system using metrics[C]//International Conference on Software Maintenance.IEEE,1996:244-253.
[12]KONTOGIANNIS K,GALLER M,DEMORI R.Detecting code similarity using patterns[C]//Working Notes of 3rd Workshop on AI and Software Engineering.1995:68-73.
[13]KOMONDOOR R,HORWITZ S.Using slicing to identify duplication in source code[C]//International static analysis symposium.Springer,Berlin,Heidelberg,2001:40-56.
[14]KRINKE J.Identifying similar code with program dependence graphs[C]//Proceedings Eighth Working Conference on Reverse Engineering.IEEE,2001:301-309.
[15]HUMMEL B,JUERGENS E,HEINEMANN L,et al.Index-based code clone detection:incremental,distributed,scalable[C]//2010 IEEE International Conference on Software Maintenance.IEEE,2010:1-9.
[16]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Learning distributed representations of code[J].arXiv:1803,09473,2019.
[17]ROY D,PANDA P,ROY K.Tree-cnn:A deep convolutional neural network for lifelong learning[J].arXiv:1802.05800,2018.
[18]DEFREEZ D,THAKUR A V,RUBIO-GONZÁLEZ C.Path-based function embedding and its application to specification mining[J].arXiv:1802.07779,2018.
[19]PEROZZI B,AL-RFOU R,SKIENA S.Deepwalk:Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2014:701-710.
[20]ZHANG J,WANG X,ZHANG H,et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:783-794.
[21]WANG W,LI G,SHEN S,et al.Modular tree network forsource code representation learning[J].ACM Transactions on Software Engineering and Methodology(TOSEM),2020,29(4):1-23.
[22]BUI N D Q,YU Y,JIANG L.Infercode:Self-supervised learning of code representations by predicting subtrees[C]//2021 IEEE/ACM 43rd International Conference on Software Engineering(ICSE).IEEE,2021:1186-1197.
[23]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196.
[24]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[25]ZHAO G,HUANG J.Deepsim:deep learning code functionalsimilarity[C]//Proceedings of the 2018 26th ACM Joint Mee-ting on European Software Engineering Conference and Sympo-sium on the Foundations of Software Engineering.2018:141-151.
[26]WANG W,LI G,MA B,et al.Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]//2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020:261-271.
[27]KARNALIM O.Syntax trees and information retrieval to improve code similarity detection[C]//Proceedings of the Twenty-Second Australasian Computing Education Conference.2020:48-55.
[28]HAAS R,NIEDERMAYR R,RÖHM T,et al.RecommendingUnnecessary Source Code Based on Static Analysis[C]//2019 IEEE/ACM 41st International Conference on Software Engineering:Companion Proceedings(ICSE-Companion).IEEE,2019:274-275.
[29]ALON U,ZILBERSTEIN M,LEVY O,et al.A general path-based representation for predicting program properties[J].ACM SIGPLAN Notices,2018,53(4):404-419.
[30]ZHANG J,WANG X,ZHANG H,et al.A novel neural source code representation based on abstract syntax tree[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:783-794.
[31]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A literature review[J].Ruan Jian Xue Bao/Journal of Software,2019,30(4):962-980.
[1] LI Yong, WU Jing-peng, ZHANG Zhong-ying, ZHANG Qiang. Link Prediction for Node Featureless Networks Based on Faster Attention Mechanism [J]. Computer Science, 2022, 49(4): 43-48.
[2] YANG Hui, TAO Li-hong, ZHU Jian-yong, NIE Fei-ping. Fast Unsupervised Graph Embedding Based on Anchors [J]. Computer Science, 2022, 49(4): 116-123.
[3] FU Kun, GUO Yun-peng, ZHUO Jia-ming, LI Jia-ning, LIU Qi. Semantic Information Enhanced Network Embedding with Completely Imbalanced Labels [J]. Computer Science, 2022, 49(11): 109-116.
[4] LIU Yang, ZHENG Wen-ping, ZHANG Chuan, WANG Wen-jian. Local Random Walk Based Label Propagation Algorithm [J]. Computer Science, 2022, 49(10): 103-110.
[5] FANG Lei, WU Ze-hui, WEI Qiang. Summary of Binary Code Similarity Detection Techniques [J]. Computer Science, 2021, 48(5): 1-8.
[6] XING Chang-zheng, ZHU Jin-xia, MENG Xiang-fu, QI Xue-yue, ZHU Yao, ZHANG Feng, YANG Yi-ming. Point-of-interest Recommendation:A Survey [J]. Computer Science, 2021, 48(11A): 176-183.
[7] LIU Dan, ZHAO Sen, YAN Zhi-liang, ZHAO Jing, WANG Hui-qing. miRNA-disease Association Prediction Model Based on Stacked Autoencoder [J]. Computer Science, 2021, 48(10): 114-120.
[8] LI Yang, LI Wei-gang, ZHAO Yun-tao, LIU Ao. Grey Wolf Algorithm Based on Levy Flight and Random Walk Strategy [J]. Computer Science, 2020, 47(8): 291-296.
[9] ZHANG Hu, ZHOU Jing-jing, GAO Hai-hui, WANG Xin. Network Representation Learning Method on Fusing Node Structure and Content [J]. Computer Science, 2020, 47(12): 119-124.
[10] TANG Jia-qi, WU Jing-li, LIAO Yuan-xiu, WANG Jin-yan. Prediction of Protein Functions Based on Bi-weighted Vote [J]. Computer Science, 2019, 46(4): 222-227.
[11] ZHAO Qian-qian, LV Min, XU Yin-long. Estimating Graphlets via Two Common Substructures Aware Sampling in Social Networks [J]. Computer Science, 2019, 46(3): 314-320.
[12] YIN Xin-hong, ZHAO Shi-yan, CHEN Xiao-yun. Community Detection Algorithm Based on Random Walk of Signal Propagation with Bias [J]. Computer Science, 2019, 46(12): 45-55.
[13] LIU Qing-feng, LIU Zhe, SONG Yu-qing, ZHU Yan. Tumor Image Segmentation Method Based on Random Walk with Constraint [J]. Computer Science, 2018, 45(7): 243-247.
[14] XIAO Ying-yuan and ZHANG Hong-yu. Friend Recommendation Method Based on Users’ Latent Features in Social Networks [J]. Computer Science, 2018, 45(3): 218-222.
[15] QING Yong, LIU Meng-juan, YIN Ying and LI Yang-xi. SMART:A Graph-based Recommendation Algorithm for Fast Moving Consumer Goods in E-commerce Platform [J]. Computer Science, 2017, 44(Z11): 464-469.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!