Computer Science ›› 2020, Vol. 47 ›› Issue (3): 5-10.doi: 10.11896/jsjkx.190500148

• Intelligent Software Engineering • Previous Articles     Next Articles

Survey of Code Similarity Detection Methods and Tools

ZHANG Dan,LUO Ping   

  1. (School of Software, Tsinghua University, Beijing 100084, China)
    (Key Laboratory of Information System Security(Tsinghua University), Ministry of Education, Beijing 100084, China)
  • Received:2019-05-27 Online:2020-03-15 Published:2020-03-30
  • About author:ZHANG Dan,master.Her main research interests include information security and software analysis. LUO Ping,born in 1959,Ph.D,professor.His main research interests include information security and code detection.
  • Supported by:
    This work was supported by National Key R&D Program of China (2018YFF0215901).

Abstract: Source code opening has become a new trend in the information technology field.While code cloning improves code quality and reduces software development cost to some extent,it also affects the stability,robustness and maintainability of a software system.Therefore,code similarity detection plays an important role in the development of computer and information security.To overcome the various hazards brought by code cloning,many code similarity detection methods and corresponding tools have been developed by academic and industrial circles.According to the manner of processing source code,these detection methodscould be roughly divided into five categories:text analysis based,lexical analysis based,grammar analysis based,semantics analysis based and metrics based.These detection tools can provide good detection performance in many application scenarios,but are also facing a series of challenges brought by ever-increasing data in this big data era.This paper firstly introduced code cloning problem andmade a detailed comparison between code similarity detection methods divided into five categories.Then,it classified and organized currently available code similarity detection tools.Finally,it comprehensively evaluated the detection performance of detection tools based on various evaluation criteria.Furthermore,the future research direction of code similarity detection was prospected.

Key words: Code clone, Clone detection, Clone evaluation

CLC Number: 

  • TP311
[1]BAKER B S.A Program for Identifying Duplicated Code[C]∥Proceedings of Computing Science and Statistics:24thSympo-sium on the Interface.1992.
[2]KIM M,SAZAWAL V,NOTKIN D,et al.An empirical study of code clone genealogies [J].ACM SIGSOFT Software Engineering Notes.ACM,2005,30(5):187-196.
[3]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A multi- linguistic token based code clone detection system for large scale source code [J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[4]CAO Y Z,JIN M Z,LIU C.Overview on code clones detection [J].Computer Engineering:Science,2006,28 (A2):9-13.
[5]BELON S,KOSCHKE R,ANTONIOL G,et al.Comparison and evaluation of clone detection tools [J].IEEE Transactions on Software Engineering,2007,33(9):577-591.
[6]FOWLER M.Refactoring:Improving the Design of Existing Code[C]∥Xp Universe and First Agile Universe Conference on Extreme Programming and Agile Methods-Xp/agile Universe.Springer-Verlag,2002:256.
[7]MANN Z A.Three public enemies:cut,copy,and paste [J].Computer,2006,39(7):31-35.
[8]RIEGER M.Effective clone detection without language barriers[D].Bern,Switzerland:University of Bern,2005.
[9]WHALE G.Plague:plagiarism detection using program struc- ture[R].1988.
[10]DUCASSE,STÉPHANE,RIEGER M,et al.A Language Independent Approach for Detecting Duplicated Code[C]∥IEEE International Conference on Software Maintenance.IEEE,1999.
[11]GITCHELL D,TRAN N.Sim:a utility for detecting similarity in computer programs [C]∥Thirtieth Sigcse Technical Symposium on Computer Science Education.ACM,1999.
[12]KIM Y C,CHO Y Y,MOON J B.A Plagiarism Detection System Using A Syntax-Tree [C]∥International Conference on Computational Intelligence.DBLP,2004.
[13]JIANG L,MISHERGHI G,SU Z,et al.DECKARD:Scalable and Accurate Tree-Based Detection of Code Clones[C]∥International Conference on Software Engineering.IEEE Computer Society,2007.
[14]BAXTER I D,YAHIN A,MOURA L,et al.Clone detection using abstract syntax trees [C]∥Conference on Reverse Engineering.IEEE,2006.
[15]FERRANTE J,OTTENSTEIN K J,WARREN J D.The program dependence graph and its use in optimization [J].Acm Trasactions on Programming Languages & Systems,1987,9(3):125-132.
[16]LIU C,CHEN C,HAN J,et al.GPLAG:detection of software plagiarism by program dependence graph analysis[C]∥Acm Sigkdd International Conference on Knowledge Discovery & Data Mining.ACM,2006.
[17]KOMONDOOR R,HORWITZ S.Using Slicing to Identify Duplication in Source Code[C]∥International Symposium on Static Analysis.Springer-Verlag,2001.
[18]KRINKE J.Identifying Similar Code with Program Dependence Graphs[C]∥Conference on Reverse Engineering.IEEE,2001.
[19]PHAM N H,NGUYEN H A,NGUYEN T T,et al.Complete and accurate clone detection in graph-based models[C]∥International Conference on Software Engineering.IEEE,2009.
[20]SHENEAMER A,ROY S,KALITA J.A detection framework for semantic code clones and obfuscated code [J].Expert Systems with Applications,2018,97(1):405-420.
[21]ARWIN C,TAHAGHOGHI S M M.Plagiarism detection across programming languages[C]∥Computer Science,Twenty-nineth Australasian Computer Science Conference.DBLP,2006.
[22]ENGELS S,LAKSHMANAN V,CRAIG M.Plagiarism detec- tion using feature-based neural networks [C]∥Sigcse Technical Symposium on Computer Science Education.ACM,2007.
[23]SINGHE S.Neural networks and disputed authorship:new challenges[C]∥International Conference on Artificial Neural Networks.IET,1995:24-28.
[24]ELENBOGEN B S,SELIYA N.Detecting outsourced student programming assignments[M].Consortium for Computing Scien-ces in Colleges,2008.
[25]CIESIELSKI V,WU N,TAHAGHOGHI S.Evolving Similarity Functions for Code Plagiarism Detection [C]∥Conference on Genetic & Evolutionary Computation.ACM,2008.
[26]CHEN X,FRANCIA B,LI M,et al.Shared Information and Program Plagiarism Detection [J].IEEE Transactions on Information Theory,2004,50(7):1545-1551.
[27]ZHANG L,ZHUANG Y T,YUAN Z M.A Program Plagiarism Detection Model Based on Information Distance and Clustering[C]∥International Conference on Intelligent Pervasive Computing.IEEE Computer Society,2007.
[28]XIONG H,YAN H H,HUANG Y G,et al.Code Similarity Detection Approach Based on Back-Proagation Neural Network [J].Computer Science,2010,37(3):159-164.
[29]WEISER M.Program slicing [J].TSE,1984,10 (4):352-357.
[30]MERLO E,ANTONIOL G,PENTA M D,et al.Linear com- plexity object-oriented similarity for clone detection and software evolution analyses [C]∥IEEE International Conference on Software Maintenance.IEEE Computer Society,2004.
[31]KONTOGIANNIS K.Evaluation experiments on the detection of programming patterns using software metrics[C]∥Conference on Reverse Engineering.IEEE,1997.
[32]MAYRAND J,LEBLANC C,MERLO E M.Experiment on the automatic detection of function clones in a software system using metrics [C]∥International Conference on Software Maintenance.IEEE,1996.
[33]GUO Y,CHEN F H,ZHOU M H.Code Clone Detection Method for Large-Scale Source Code [J].Journal of Frontiers of Computer Science and Technology,2014,8(4):417-426.
[34]SCHLEIMER S,WILKERSON D S,AIKEN A.Winnowing:local algorithms for document fingerprinting [C]∥Proc Acm Sigmod Conference.2003.
[35]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A litera- ture review[J].Journal of Software,2019,30(4):962-980.
[36]WHITE M,TUFANO M,VENDOME C,et al.Deep learning code fragments for code clone detection [C]∥IEEE/ACM International Conference on Automated Software Engineering.IEEE,2016.
[37]SHENEAMER A,KALITA J.Semantic Clone Detection Using Machine Learning [C]∥IEEE International Conference on Machine Learning & Applications.IEEE,2017.
[38]RAGKHITWETSAGUL C,KRINKE J,CLARK D.A comparison of code similarity analysers [J].Empirical Software Engineering,2017(9):1-56.
[39]ROY C K,CORDY J R,KOSCHKE R.Comparison and evaluation of code clone detection techniques and tools:A qualitative approach [J].Science of Computer Programming,2009,74(7):470-495.
[40]TOOMIM M,BEGEL A,GRAHAM S L.Managing Duplicated Code with Linked Editing[C]∥Proc IEEE Symposium on VisualLanguages & Human-centric Computing.2004.
[41]WALENSTEIN A,JYOTI N,LI J,et al.Problems creating task-relevant clone detection reference data[C]∥Conference on Reverse Engineering.IEEE,2003.
[42]SAJNANI H,SAINI V,SVAJLENKO J,et al.SourcererCC: scaling code clone detection to big-code[C]∥International Conference on Software Engineering.IEEE,2016.
[43]SAJNANI H.Large-Scale Code Clone Detection[D].Irvine: University of California,2016.
[44]BECKER K,SCHLICH J G.AI Programmer:Autonomously Creating So ware Programs Using Genetic Algorithms[J/OL].http://arxiv.org/abs/1709.05703.
[45]DANIEL A,ABOLA A,NOROUZI M,et al.Neural Program Synthesis with Priority eue Training[J/OL].http://arxiv.org/abs/1801.03526.
[1] OUYANG Peng, LU Lu, ZHANG Fan-long, QIU Shao-jian. Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology [J]. Computer Science, 2020, 47(9): 10-16.
[2] SHE Rong-rong, ZHANG Li-ping. Method for Identifying and Recommending Reconstructed Clones Based on Software Evolution History [J]. Computer Science, 2019, 46(8): 224-232.
[3] LIU Dong-rui, LIU Dong-sheng, ZHANG Li-ping, HOU Min and WANG Chun-hui. Prediction of Code Clone Quality Based on Bayesian Network [J]. Computer Science, 2017, 44(4): 165-168.
[4] DONG Jia-xing and XU Chang. Efficient Clone Detection Technique for Functionally Similar Programs [J]. Computer Science, 2017, 44(4): 12-15.
[5] HUANG Shou-meng, GAO Hua-ling and PAN Yu-xia. Summary of Research on Similarity Analysis of Software [J]. Computer Science, 2016, 43(Z6): 467-470.
[6] . Research on Clone Detection for Large-scale Model [J]. Computer Science, 2012, 39(4): 28-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .