Computer Science ›› 2024, Vol. 51 ›› Issue (6): 12-22.doi: 10.11896/jsjkx.230400117

• Computer Software • Previous Articles     Next Articles

Summary of Token-based Source Code Clone Detection Techniques

LIU Chunling, QI Xuyan, TANG Yonghe, SUN Xuekai, LI Qinghao, ZHANG Yu   

  1. School of Network and Cybersecurity,Information Engineering University,Zhengzhou 450001,China
  • Received:2023-04-17 Revised:2023-09-27 Online:2024-06-15 Published:2024-06-05
  • About author:LIU Chunling,born in 1981,master,lecturer.Her main research interests include code vulnerability detection and code similarity detection.
    TANG Yonghe,born in 1983,Ph.D,lecturer.His main research interests include malware detection and classification,code similarity detection and computer security.
  • Supported by:
    Key R & D Program of Henan Province(221111210300).

Abstract: Code cloning refers to the generation of similar or identical code during software development due to the reuse,modification,and refactoring of source code.Code cloning has a positive impact on improving software development efficiency and redu-cing development costs,but it can also do harm to the development and maintenance of software system,including but not limited to the decline of stability,and propagation of software defects.Clone detection techniques for source code have important research and application value in plagiarism detection,vulnerability detection,copyright infringement,and other fields.Although some excellent detection tools and techniques have emerged,there are still challenges in detecting syntactic and semantic clones on a large scale and in an effective manner.Among them,lexical-based clone detection technology can quickly detect type 1-3 clones and can be extended to other programming languages and large-scale projects,therefore it is commonly used for clone detection in large-scale databases.This paper reviews the research status of lexical-based clone detection technology in the past decade,analyzes and summarizes 16 selected literature from 10 characteristics,and finally proposes possible research directions for lexical-based clone detection technology in the future in light of new technological developments.

Key words: Software security, Source code clone detection, Code representation, Deep learning

CLC Number: 

  • TP311
[1]JUERGENS E,DEISSENBOECK F,HUMMEL B,et al.Docode clones matter?[C]//2009 IEEE 31st International Conference on Software Engineering.IEEE,2009:485-495.
[2]SHENEAMER A,KALITA J.A survey of software clone de-tection techniques[J].International Journal of Computer Applications,2016,137(10):1-21.
[3]ISLAM J F,MONDAL M,ROY C K.Bug replication in code clones:An empirical study[C]//2016 IEEE 23rd International Conference on Software Analysis,Evolution,and Reengineering(SANER).IEEE,2016,1:68-78.
[4]MONDAL M,ROY B,ROY C K,et al.An empirical study on bug propagation through code cloning[J].Journal of Systems and Software,2019,158:110407
[5]MONDAL M,ROY B,ROY C K,et al.Investigating contextadaptation bugs in code clones[C]//2019 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2019:157-168.
[6]MONDAL M,ROY C K,SCHNEIDER K A.A Summary on the Stability of Code Clones and Current Research Trends[M]//Code Clone Analysis:Research,Tools,and Practices.2021:169-180.
[7]MONDAL M,ROY C K,SCHNEIDER K A.A fifine-grainedanalysis on the inconsistent changes in code clones[C]//2020 IEEE International Conferenceon Software Maintenance and Evolution(ICSME).IEEE,2020:220-231.
[8]KIM S,WOO S,LEE H,et al.Vuddy:A scalable approach for vulnerable code clone discovery[C]//2017 IEEE Symposium on Security and Privacy(SP).IEEE,2017:595-614.
[9]BELLON S,KOSCHKE R,ANTONIOL G,et al.Comparisonand evaluation of clone detection tools[J].IEEE Transactions on Software Engineering,2007,33(9):577-591.
[10]SVAJLENKO J,ROY C K.Bigcloneeval:A clone detection tool evaluation framework with bigclonebench[C]//2016 IEEE International Conference on Software Maintenance End evolution(ICSME).IEEE,2016:596-600.
[11]WANG P,SVAJLENKO J,WU Y,et al.CCAligner:a token based large-gap clone detector[C]//Proceedings of the 40th International Conference on Software Engineering.2018:1066-1077.
[12]WU M,WANG P,YIN K,et al.Lvmapper:A large-varianceclone detector using sequencing alignment approach[J].IEEE Access,2020,8:27986-27997.
[13]KAMIYA T,KUSUMOTO S,INOUE K.CCFinder:A multilinguistic token-based code clone detection system for large scale source code[J].IEEE Transactions on Software Engineering,2002,28(7):654-670.
[14]SAJNANI H,SAINI V,SVAJLENKO J,et al.Sourcerercc:Scaling code clone detection to big-code[C]//Proceedings of the 38th International Conference on Software Engineering.2016:1157-1168.
[15]JANG J,AGRAWAL A,BRUMLEY D.ReDeBug:finding unpatched code clones in entire os distributions[C]//2012 IEEE Symposium on Security and Privacy.IEEE,2012:48-62.
[16]RATTAN D,BHATIA R,SINGH M.Software clone detection:A systematic review[J].Information and Software Technology,2013,55(7):1165-1199.
[17]ZHANG H,SAKURAI K.A survey of software clone detection from security perspective[J].IEEE Access,2021,9:48157-48173.
[18]CHEN Q Y,LI S P,YAN M,et al.Code Clone Detection:A Li-terature Review[J].Journal of Software,2019,30(4):962-980.
[19]ROY C K,CORDY J R.A survey on software clone detection research[J].Queen’s School of Computing TR,2007,541(115):64-68.
[20]AIN Q U,BUTT W H,ANWAR M W,et al.A systematic review on code clone detection[J].IEEE Access,2019,7:86121-86144.
[21]MIN H,LI PING Z.Survey on software clone detection research[C]//Proceedings of the 2019 3rd International Conference on Management Engineering,Software Engineering and Service Sciences.2019:9-16.
[22]WALKER A,CERNY T,SONG E.Open-source tools andbenchmarks for code-clone detection:past,present,and future trends[J].ACM SIGAPP Applied Computing Review,2020,19(4):28-39.
[23]KAUR A,SHARMA S,SAINI M.Code clone detection usingmachine learning techniques:A systematic literature review[J].International Journal of Open Source Software and Processes(IJOSSP),2020,11(2):49-75.
[24]LEI M,LI H,LI J,et al.Deep learning application on code clone detection:A review of current knowledge[J].Journal of Systems and Software,2022,184:111141.
[25]SEMURA Y,YOSHIDA N,CHOI E,et al.CCFinderSW:Clone detection tool with flexible multilingual tokenization[C]//24th Asia-Pacific Software Engineering Conference(APSEC 2017).IEEE,2017:654-659.
[26]NAKAGAWA T,HIGO Y,KUSUMOTO S.NIl:large-scale detection of large-variance clones[C]//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Confe-rence and Symposium on the Foundations of Software Enginee-ring.2021:830-841.
[27]LI Z,LU S,MYAGMAR S,et al.CP-Miner:Finding copy-paste and related bugs in large-scale software code[J].IEEE Transactions on software Engineering,2006,32(3):176-192.
[28]LI L,FENG H,ZHUANG W,et al.CClearner:A deep learning-based clone detection approach[C]//IEEE International Confe-rence on Software Maintenance and Evolution(ICSME 2017).IEEE,2017:249-260.
[29]YUKI Y,HIGO Y,KUSUMOTO S.A technique to detectmulti-grained code clones[C]//2017 IEEE 11th International Workshop on Software Clones(IWSC).IEEE,2017:1-7.
[30]AKRAM J,QI L,LUO P.VCIPR:vulnerable code is identifiable when a patch is released(hacker’s perspective)[C]//2th IEEE Conference on Software Testing,Validation and Verification(ICST 2019 ).IEEE,2019:402-413.
[31]LI G,WU Y,ROY C K,et al.SAGA:efficient and large-scale detection of near-miss clones with GPU acceleration[C]//2020 IEEE 27th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2020:272-283.
[32]SVAJLENKO J,ROY C K.Fast and flexible large-scale clonedetection with CloneWorks[C]//ICSE(Companion Volume).2017:27-30.
[33]NISHI M A,DAMEVSKI K.Scalable code clone detection and search based on adaptive prefix filtering[J].Journal of Systems and Software,2018,137:130-142.
[34]GOLUBEV Y,POLETANSKY V,POVAROV N,et al.Multi-threshold token-based code clone detection[C]//2021 IEEE International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2021:496-500.
[35]ZHU W,YOSHIDA N,KAMIYA T,et al.MSCCD:grammarpluggable clone detection based on ANTLR parser generation[C]//Proceedings of the 30th IEEE/ACM International Confe-rence on Program Comprehension.2022:460-470.
[36]WANG W,DENG Z,XUE Y,et al.Ccstokener:Fast yet accurate code clone detection with semantic token[J].Journal of Systems and Software,2023,199:111618.
[37]SVAJLENKO J,ISLAM J F,KEIVANLOO I,et al.Towards a big data curated benchmark of inter-project code clones[C]//2014 IEEE International Conference on Software Maintenance and Evolution.IEEE,2014:476-480.
[38]ROY C K,CORDY J R.Amutation/injection-based automaticframework for evaluating code clone detection tools[C]//2009 International Conference on Software Testing,Verification,and Validation Workshops.IEEE,2009:157-166.
[39]ISHIHARA T,HOTTA K,HIGO Y,et al.Inter-project functional clone detection toward building libraries-an empirical study on 13 000 projects[C]//2012 19th Working Conference on Reverse Engineering.IEEE,2012:387-391.
[40]ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//2008 16th IEEE International Conference on Program Comprehension.IEEE,2008:172-181.
[41]JIANG L,MISHERGHI G,SU Z,et al.Deckard:Scalable and accurate tree-based detection of code clones[C]//29th International Conference on Software Engineering(ICSE’07).IEEE,2007:96-105.
[42]WAN Y,ZHAO W,ZHANG H,et al.What do they capture? a structural analysis of pre-trained language models for source code[C]//Proceedings of the 44th International Conference on Software Engineering.2022:2377-2388.
[43]LI Z,ZOU D,XU S,et al.SySeVR:A framework for using deep learning to detect software vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2021,19(4):2244-2258.
[44]RUSSELL R,KIM L,HAMILTON L,et al.Automated vulnera-bility detection in source code using deep representation lear-ning[C]//2018 17th IEEE International Conference on Machine Learning and Applications(ICMLA).IEEE,2018:757-762.
[45]ISLAM M R,ZIBRAN M F,NAGPAL A.Security vulnerabilities in categories of clones and non-cloned code:An empirical study[C]//2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(ESEM).IEEE,2017:20-29.
[46]YUE R,MENG N,WANG Q.A characterization study of re-peated bug fixes[C]//2017 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2017:422-432.
[47]MONDAL M,ROY C K,SCHNEIDER K A.Bug-proneness and late propagation tendency of code clones:A comparative study on different clone types[J].Journal of Systems and Software,2018,144:41-59.
[48]ZHU C,TANG Y,WANG Q,et al.Enhancing code similarityanalysis for effective vulnerability detection[C]//Proceedings of the 2nd International Conference on Computer Science and Software Engineering.2019:153-158.
[1] KONG Jialin, ZHANG Qi, WANG Caiyong. Review of Heterogeneous Iris Recognition [J]. Computer Science, 2024, 51(6): 186-197.
[2] LI Zekai, BAI Zhengyao, XIAO Xiao, ZHANG Yihan, YOU Yilin. Point Cloud Upsampling Network Incorporating Transformer and Multi-stage Learning Framework [J]. Computer Science, 2024, 51(6): 231-238.
[3] GAO Nan, ZHANG Lei, LIANG Ronghua, CHEN Peng, FU Zheng. Scene Text Detection Algorithm Based on Feature Enhancement [J]. Computer Science, 2024, 51(6): 256-263.
[4] LIU Jiasen, HUANG Jun. Center Point Target Detection Algorithm Based on Improved Swin Transformer [J]. Computer Science, 2024, 51(6): 264-271.
[5] JIANG Rui, YANG Kaihui, WANG Xiaoming, LI Dapeng, XU Youyun. Attentional Interaction-based Deep Learning Model for Chinese Question Answering [J]. Computer Science, 2024, 51(6): 325-330.
[6] BAO Kainan, ZHANG Junbo, SONG Li, LI Tianrui. ST-WaveMLP:Spatio-Temporal Global-aware Network for Traffic Flow Prediction [J]. Computer Science, 2024, 51(5): 27-34.
[7] ZHANG Jianliang, LI Yang, ZHU Qingshan, XUE Hongling, MA Junwei, ZHANG Lixia, BI Sheng. Substation Equipment Malfunction Alarm Algorithm Based on Dual-domain Sparse Transformer [J]. Computer Science, 2024, 51(5): 62-69.
[8] HE Shiyang, WANG Zhaohui, GONG Shengrong, ZHONG Shan. Cross-modal Information Filtering-based Networks for Visual Question Answering [J]. Computer Science, 2024, 51(5): 85-91.
[9] SONG Jianfeng, ZHANG Wenying, HAN Lu, HU Guozheng, MIAO Qiguang. Multi-stage Intelligent Color Restoration Algorithm for Black-and-White Movies [J]. Computer Science, 2024, 51(5): 92-99.
[10] HE Xiaohui, ZHOU Tao, LI Panle, CHANG Jing, LI Jiamian. Study on Building Extraction from Remote Sensing Image Based on Multi-scale Attention [J]. Computer Science, 2024, 51(5): 134-142.
[11] XU Xuejie, WANG Baohui. Multi-label Patent Classification Based on Text and Historical Data [J]. Computer Science, 2024, 51(5): 172-178.
[12] LI Zichen, YI Xiuwen, CHEN Shun, ZHANG Junbo, LI Tianrui. Government Event Dispatch Approach Based on Deep Multi-view Network [J]. Computer Science, 2024, 51(5): 216-222.
[13] HONG Tijing, LIU Dengfeng, LIU Yian. Radar Active Jamming Recognition Based on Multiscale Fully Convolutional Neural Network and GRU [J]. Computer Science, 2024, 51(5): 306-312.
[14] SUN Jing, WANG Xiaoxia. Convolutional Neural Network Model Compression Method Based on Cloud Edge Collaborative Subclass Distillation [J]. Computer Science, 2024, 51(5): 313-320.
[15] CHEN Runhuan, DAI Hua, ZHENG Guineng, LI Hui , YANG Geng. Urban Electricity Load Forecasting Method Based on Discrepancy Compensation and Short-termSampling Contrastive Loss [J]. Computer Science, 2024, 51(4): 158-164.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!