计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 5-10.doi: 10.11896/jsjkx.190500148

所属专题: 智能软件工程

• 智能软件工程 • 上一篇    下一篇

代码相似性检测方法与工具综述

张丹,罗平   

  1. (清华大学软件学院 北京100084)
    (信息系统安全教育部重点实验室(清华大学) 北京100084)
  • 收稿日期:2019-05-27 出版日期:2020-03-15 发布日期:2020-03-30
  • 通讯作者: 罗平(luop@mail.tsinghua.edu.cn)
  • 基金资助:
    国家重点研发计划项目(2018YFF0215901)

Survey of Code Similarity Detection Methods and Tools

ZHANG Dan,LUO Ping   

  1. (School of Software, Tsinghua University, Beijing 100084, China)
    (Key Laboratory of Information System Security(Tsinghua University), Ministry of Education, Beijing 100084, China)
  • Received:2019-05-27 Online:2020-03-15 Published:2020-03-30
  • About author:ZHANG Dan,master.Her main research interests include information security and software analysis. LUO Ping,born in 1959,Ph.D,professor.His main research interests include information security and code detection.
  • Supported by:
    This work was supported by National Key R&D Program of China (2018YFF0215901).

摘要: 在代码开源的潮流下,代码克隆在提高代码质量和降低开发成本的同时,一定程度地影响了软件系统的稳定性、健壮性与可维护性。代码相似性检测在计算机与信息安全发展方面具有重要的意义。为应对代码克隆带来的各种危害,目前学术界和工业界提出了很多代码相似性检测的方法,这些方法按照源代码信息处理程度可分为基于文本、词法、语法、语义和度量值5类;并开发了相应的检测工具,这些工具实现了很好的检测效果,但在大数据时代背景下也面临着数据规模不断扩大带来的一系列挑战。文中综述了代码相似性检测的方法,对5类检测方法做了详细比较;结合传统方法与机器学习技术,归类了不同检测方法对应的检测工具;按照不同评价标准评估了检测工具的检测效果,总结了每种检测方法的首选检测工具,并对未来代码相似性检测的研究方向做出了展望。

关键词: 代码克隆, 克隆检测, 克隆评估

Abstract: Source code opening has become a new trend in the information technology field.While code cloning improves code quality and reduces software development cost to some extent,it also affects the stability,robustness and maintainability of a software system.Therefore,code similarity detection plays an important role in the development of computer and information security.To overcome the various hazards brought by code cloning,many code similarity detection methods and corresponding tools have been developed by academic and industrial circles.According to the manner of processing source code,these detection methodscould be roughly divided into five categories:text analysis based,lexical analysis based,grammar analysis based,semantics analysis based and metrics based.These detection tools can provide good detection performance in many application scenarios,but are also facing a series of challenges brought by ever-increasing data in this big data era.This paper firstly introduced code cloning problem andmade a detailed comparison between code similarity detection methods divided into five categories.Then,it classified and organized currently available code similarity detection tools.Finally,it comprehensively evaluated the detection performance of detection tools based on various evaluation criteria.Furthermore,the future research direction of code similarity detection was prospected.

Key words: Clone detection, Clone evaluation, Code clone

中图分类号: 

  • TP311
[1]BAKER B S.A Program for Identifying Duplicated Code[C]∥Proceedings of Computing Science and Statistics:24thSympo-sium on the Interface.1992.
[2]KIM M,SAZAWAL V,NOTKIN D,et al.An empirical study of code clone genealogies [J].ACM SIGSOFT Software Engineering Notes.ACM,2005,30(5):187-196.
[3]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A multi- linguistic token based code clone detection system for large scale source code [J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[4]CAO Y Z,JIN M Z,LIU C.Overview on code clones detection [J].Computer Engineering:Science,2006,28 (A2):9-13.
[5]BELON S,KOSCHKE R,ANTONIOL G,et al.Comparison and evaluation of clone detection tools [J].IEEE Transactions on Software Engineering,2007,33(9):577-591.
[6]FOWLER M.Refactoring:Improving the Design of Existing Code[C]∥Xp Universe and First Agile Universe Conference on Extreme Programming and Agile Methods-Xp/agile Universe.Springer-Verlag,2002:256.
[7]MANN Z A.Three public enemies:cut,copy,and paste [J].Computer,2006,39(7):31-35.
[8]RIEGER M.Effective clone detection without language barriers[D].Bern,Switzerland:University of Bern,2005.
[9]WHALE G.Plague:plagiarism detection using program struc- ture[R].1988.
[10]DUCASSE,STÉPHANE,RIEGER M,et al.A Language Independent Approach for Detecting Duplicated Code[C]∥IEEE International Conference on Software Maintenance.IEEE,1999.
[11]GITCHELL D,TRAN N.Sim:a utility for detecting similarity in computer programs [C]∥Thirtieth Sigcse Technical Symposium on Computer Science Education.ACM,1999.
[12]KIM Y C,CHO Y Y,MOON J B.A Plagiarism Detection System Using A Syntax-Tree [C]∥International Conference on Computational Intelligence.DBLP,2004.
[13]JIANG L,MISHERGHI G,SU Z,et al.DECKARD:Scalable and Accurate Tree-Based Detection of Code Clones[C]∥International Conference on Software Engineering.IEEE Computer Society,2007.
[14]BAXTER I D,YAHIN A,MOURA L,et al.Clone detection using abstract syntax trees [C]∥Conference on Reverse Engineering.IEEE,2006.
[15]FERRANTE J,OTTENSTEIN K J,WARREN J D.The program dependence graph and its use in optimization [J].Acm Trasactions on Programming Languages & Systems,1987,9(3):125-132.
[16]LIU C,CHEN C,HAN J,et al.GPLAG:detection of software plagiarism by program dependence graph analysis[C]∥Acm Sigkdd International Conference on Knowledge Discovery & Data Mining.ACM,2006.
[17]KOMONDOOR R,HORWITZ S.Using Slicing to Identify Duplication in Source Code[C]∥International Symposium on Static Analysis.Springer-Verlag,2001.
[18]KRINKE J.Identifying Similar Code with Program Dependence Graphs[C]∥Conference on Reverse Engineering.IEEE,2001.
[19]PHAM N H,NGUYEN H A,NGUYEN T T,et al.Complete and accurate clone detection in graph-based models[C]∥International Conference on Software Engineering.IEEE,2009.
[20]SHENEAMER A,ROY S,KALITA J.A detection framework for semantic code clones and obfuscated code [J].Expert Systems with Applications,2018,97(1):405-420.
[21]ARWIN C,TAHAGHOGHI S M M.Plagiarism detection across programming languages[C]∥Computer Science,Twenty-nineth Australasian Computer Science Conference.DBLP,2006.
[22]ENGELS S,LAKSHMANAN V,CRAIG M.Plagiarism detec- tion using feature-based neural networks [C]∥Sigcse Technical Symposium on Computer Science Education.ACM,2007.
[23]SINGHE S.Neural networks and disputed authorship:new challenges[C]∥International Conference on Artificial Neural Networks.IET,1995:24-28.
[24]ELENBOGEN B S,SELIYA N.Detecting outsourced student programming assignments[M].Consortium for Computing Scien-ces in Colleges,2008.
[25]CIESIELSKI V,WU N,TAHAGHOGHI S.Evolving Similarity Functions for Code Plagiarism Detection [C]∥Conference on Genetic & Evolutionary Computation.ACM,2008.
[26]CHEN X,FRANCIA B,LI M,et al.Shared Information and Program Plagiarism Detection [J].IEEE Transactions on Information Theory,2004,50(7):1545-1551.
[27]ZHANG L,ZHUANG Y T,YUAN Z M.A Program Plagiarism Detection Model Based on Information Distance and Clustering[C]∥International Conference on Intelligent Pervasive Computing.IEEE Computer Society,2007.
[28]XIONG H,YAN H H,HUANG Y G,et al.Code Similarity Detection Approach Based on Back-Proagation Neural Network [J].Computer Science,2010,37(3):159-164.
[29]WEISER M.Program slicing [J].TSE,1984,10 (4):352-357.
[30]MERLO E,ANTONIOL G,PENTA M D,et al.Linear com- plexity object-oriented similarity for clone detection and software evolution analyses [C]∥IEEE International Conference on Software Maintenance.IEEE Computer Society,2004.
[31]KONTOGIANNIS K.Evaluation experiments on the detection of programming patterns using software metrics[C]∥Conference on Reverse Engineering.IEEE,1997.
[32]MAYRAND J,LEBLANC C,MERLO E M.Experiment on the automatic detection of function clones in a software system using metrics [C]∥International Conference on Software Maintenance.IEEE,1996.
[33]GUO Y,CHEN F H,ZHOU M H.Code Clone Detection Method for Large-Scale Source Code [J].Journal of Frontiers of Computer Science and Technology,2014,8(4):417-426.
[34]SCHLEIMER S,WILKERSON D S,AIKEN A.Winnowing:local algorithms for document fingerprinting [C]∥Proc Acm Sigmod Conference.2003.
[35]CHEN Q Y,LI S P,YAN M,et al.Code clone detection:A litera- ture review[J].Journal of Software,2019,30(4):962-980.
[36]WHITE M,TUFANO M,VENDOME C,et al.Deep learning code fragments for code clone detection [C]∥IEEE/ACM International Conference on Automated Software Engineering.IEEE,2016.
[37]SHENEAMER A,KALITA J.Semantic Clone Detection Using Machine Learning [C]∥IEEE International Conference on Machine Learning & Applications.IEEE,2017.
[38]RAGKHITWETSAGUL C,KRINKE J,CLARK D.A comparison of code similarity analysers [J].Empirical Software Engineering,2017(9):1-56.
[39]ROY C K,CORDY J R,KOSCHKE R.Comparison and evaluation of code clone detection techniques and tools:A qualitative approach [J].Science of Computer Programming,2009,74(7):470-495.
[40]TOOMIM M,BEGEL A,GRAHAM S L.Managing Duplicated Code with Linked Editing[C]∥Proc IEEE Symposium on VisualLanguages & Human-centric Computing.2004.
[41]WALENSTEIN A,JYOTI N,LI J,et al.Problems creating task-relevant clone detection reference data[C]∥Conference on Reverse Engineering.IEEE,2003.
[42]SAJNANI H,SAINI V,SVAJLENKO J,et al.SourcererCC: scaling code clone detection to big-code[C]∥International Conference on Software Engineering.IEEE,2016.
[43]SAJNANI H.Large-Scale Code Clone Detection[D].Irvine: University of California,2016.
[44]BECKER K,SCHLICH J G.AI Programmer:Autonomously Creating So ware Programs Using Genetic Algorithms[J/OL].http://arxiv.org/abs/1709.05703.
[45]DANIEL A,ABOLA A,NOROUZI M,et al.Neural Program Synthesis with Priority eue Training[J/OL].http://arxiv.org/abs/1801.03526.
[1] 乐乔艺, 刘建勋, 孙晓平, 张祥平.
代码克隆检测研究进展综述
Survey of Research Progress of Code Clone Detection
计算机科学, 2021, 48(11A): 509-522. https://doi.org/10.11896/jsjkx.210300310
[2] 张凌浩, 桂盛霖, 穆逢君, 王胜.
基于后缀树的二进制可执行代码的克隆检测算法
Clone Detection Algorithm for Binary Executable Code with Suffix Tree
计算机科学, 2019, 46(10): 141-147. https://doi.org/10.11896/jsjkx.180801573
[3] 董加星,许畅.
一种面向功能类似程序的高效克隆检测技术
Efficient Clone Detection Technique for Functionally Similar Programs
计算机科学, 2017, 44(4): 12-15. https://doi.org/10.11896/j.issn.1002-137X.2017.04.003
[4] 梁正平,谭佳加,程一群,马骁驰.
大型模型克隆检测技术研究
Research on Clone Detection for Large-scale Model
计算机科学, 2012, 39(4): 28-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!