Computer Science ›› 2025, Vol. 52 ›› Issue (6): 365-380.doi: 10.11896/jsjkx.240400003

• Information Security • Previous Articles     Next Articles

Survey of Binary Code Similarity Detection Method

WEI Youyuan1, SONG Jianhua1,3,4, ZHANG Yan2,3   

  1. 1 School of Cyber Science and Technology,Hubei University,Wuhan 430062,China
    2 School of Computer Science andInformation Engineering,Hubei University,Wuhan 430062,China
    3 Key Laboratory of Intelligent Sensing System and Security(Hubei University),Ministry of Education,Wuhan 430062,China
    4 Hubei Provincial Engineering Research Center of Intelligent Connected Vehicle Network Security,Wuhan 430062,China
  • Received:2024-04-01 Revised:2024-10-03 Online:2025-06-15 Published:2025-06-11
  • About author:WEI Youyuan,born in 2000,postgraduate,is a member of CCF(No.T7853G).His main research interests include binary code similarity detection,malicious code identification and so on.
    SONG Jianhua,born in 1973,Ph.D,professor,postgraduate supervisor,is a member of CCF(No.27785M).Her main research interests include network and information security and so on.
  • Supported by:
    National Natural Science Foundation of China(62377009),Major Program(JD) of Hubei Province(2023BAA018),Key R&D program of Hubei Province(2021BAA184,2021BAA188) and Hubei Province Project of Key Research Institute of Humanities and Social Sciences at Universities(Research Center of Information Management for Performance Evaluation)(2020JX01).

Abstract: Code similarity detection can be divided into two types according to the research object:source code similarity detection and binary code similarity detection,which are commonly used in scenarios such as malicious code identification,vulnerability search,and copyright protection.Based on the current domestic Internet environment,programs are usually released in the form of binary files,and most programs cannot directly obtain source code.Therefore,in related research in the field of software security,the application scope of binary code similarity detection is relatively wider.Starting from the definition and implementation process of binary code similarity detection,according to the code representation form,it is divided into three categories:text cha-racter-based,code embedding-based,and graph embedding-based.The classic binary code similarity detection methods and the recent five years of research and development are compared.A total of 19 documents on new methods are sorted out,and various methods are analyzed and summarized based on multi-architecture,Baseline,benchmark datasets and detection performance.Finally,current problems and possible future research directions are analyzed based on the development of new methods.

Key words: Binary code similarity detection, Code representation, Software security, Malicious code identification, Vulnerability search

CLC Number: 

  • TP311.5
[1]SUN X J,WEI Q,WANG Y S,et al.Survey of code similarity detection technology[J].Journal of Computer Applications,2024,44(4):1248-1258.
[2]NVD.CVE-2023-20892[EB/OL].(2023-06-22) [2024-01-20].https://nvd.nist.gov/vuln/detail/CVE-2023-20892/.
[3]ROY C K,CORDY J R,KOSCHKE R.Comparison and evaluation of code clone detection techniques and tools:A qualitative approach[J].Science of Computer Programming,2009,74(7):470-495.
[4]UL HAQ I,CABALLERO J.A Survey of Binary Code Similarity[J].ACM Computing Surveys,2021,54(3):1-38.
[5]XIA B,PANG J M,ZHOU X,et al.Research progress on binarycode similarity search[J].Journal of Computer Applications,2022,42(4):985-998.
[6]ZHOU Z J,DONG R C,JIANG J H,et al.Survey on Binary Code Security Techniques[J].Computer Systems and Applications,2023,32(1):1-11.
[7]FANG L,WU Z H,WEI Q.Summary of Binary Code Similarity Detection Techniques[J].Computer Science,2021,48(5):1-8.
[8]LI Z,ZOU D Q,XU S H,et al.SySeVR:A Framework for Using Deep Learning to Detect Software Vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2022,19(4):2244-2258.
[9]XIE C L,LIANG Y,WANG X.Survey of Deep Learning Applied in Code Representation[J].Computer Engineering and Applications,2021,57(20):53-63.
[10]BELLON S,KOSCHKE R,ANTONIOl G,et al.Comparisonand evaluation of clone detection tools[J].IEEE Transactions on Software Engineering,2007,33(9):577-591.
[11]CHEN Q Y,LI S P,YAN M,et al.Code Clone Detection:A Li-terature Review[J].Journal of Software,2019,30(4):962-980.
[12]LE Q Y,LIU J X,SUN X P,et al.Survey of Research Progress of Code Clone Detection[J].Computer Science,2021,48(S2):509-522.
[13]WHALE G.Plague:Plagiarism Detection Using Program Structure[D].Sydeny:University of New South Wales,1988.
[14]MCCREIGHT E M.A Space-Economical Suffix Tree Construction Algorithm[J].Journal of the ACM,1976,23(2):262-272.
[15]UKKONEN E.On-line construction of suffix trees[J].Algorithmica,1995,14(3):249-260.
[16]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//The 38th ACMSIGPLAN Conference on Programming Language Design and Implementation.ACM,2017:79-94.
[17]DAVID Y,PARTUSH N,YAHAV E.Statistical similarity ofbinaries[C]//The 37th ACM SIGPLAN Conference on Programming Language Design and Implementation.PLDI,2016:266-280.
[18]NETHERCOTE N,SEWARD J.Valgrind:a framework forheavyweight dynamic binary instrumentation[C]//The 28th ACM SIGPLAN Conference on Programming Language Design and Implementation.PLDI,2007:89-100.
[19]ZHANG L H,GUI S L,MU F J,et al.Clone Detection Algorithm for Binary Executable Code with Suffix Tree[J].Compu-ter Science,2019,46(10):141-147.
[20]ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//2008 16th IEEE International Conference on Program Comprehension.IEEE,2008:172-181.
[21]XIONG M,XUE Y X,XU Y.A binary code similarity analysis method based on code embedding[J].Cyber Security And Data Governance,2023,42(3):58-67.
[22]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//International Conference on Learning Representations.ICLR,2013.
[23]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//The 31st International Conference on Machine Learning.PMLR,2014:1188-1196.
[24]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[C]//Network and Distributed Systems Security Symposium.NDSS,2019.
[25]MASSARELLI L,GIUSEPPE A D L,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2019:309-329.
[26]MASSARELLI L,GIUSEPPE A D L,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//The 2nd Workshop on Binary Analysis Research.BAR,2019.
[27]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//The 26th International Conference on Neural Information Processing Systems.NIPS,2013:3111-3119.
[28]LIN Z,FENG M,NOGUEIRA DOS SANTOS C,et al.A structured self-attentive sentence embedding[C]//International Conference on Learning Representations.ICLR,2017.
[29]BROMLEY J,GUYON I,LECUN Y,et al.Signature verification using a “Siamese” time delay neural network[C]//The 6th International Conference on Neural Information Processing Systems.NIPS,1994:737-744.
[30]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//The 2019 IEEE Symposium on Security and Privacy.IEEE,2019:472-489.
[31]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[C]//North American Chapter of the Association for Computational Linguistics.NAACL-HLT,2019:4171-4186.
[32]LIU Y H,OTT M,GOYAL N,et al.RoBERTa:A Robustly Optimized BERT Pretraining Approach[C]//International Conference on Learning Representations.ICLR,2020.
[33]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings using Siamese BERT-Networks[C]//The 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.EMNLP,2019:3982-3992.
[34]LI X,QU Y,YIN H.Palmtree:learning an assembly language model for instruction embedding[C]//The 2021 ACM SIGSAC Conference on Computer and Communications Security.ACM,2021:3236-3251.
[35]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//The 33rd ACM/IEEE International Conference on Automated Software Engineering.IEEE,2018:667-678.
[36]WANG H,QU W,KATZ G,et al.jTrans:jump-aware trans-former for binary code similarity detection[C]//The 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.ACM,2022:1-13.
[37]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//The 31st International Conference on Neural Information Processing Systems.NIPS,2017:5998-6008.
[38]YAN Y T,YU L,WANG T Y,et al.Research on Binary Code Similarity Detection Based on Jump-SBERT[J].Computer Science,2024,51(5):355-362.
[39]PALMER D D,OSTENDORF M.Improving out-of-vocabulary name resolution[J].Computer Speech & Language,2005,19(1):107-128.
[40]WANG T Y,PAN Z L,YU L,et al.Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation[J].Computer Science,2023,50(4):288-297.
[41]LI T,WANG J S.Binary code similarity detection via attention mechanism and Child-Sum Tree-LSTM[J].Cyber Security and Data Governance,2023,42(11):8-14,34.
[42]AHMED M,SAMEE M,MERCER R.Improving Tree-LSTMwith Tree Attention[C]//2019 IEEE 13th International Confe-rence on Semantic Computing.ICSC,2019:247-254.
[43]HUANG C S,ZHU G B,GE G J,et al.FastBCSD:Fast and Efficient Neural Network for Binary Code Similarity Detection[J].arXiv:2306.14168,2023.
[44]KIM Y.Convolutional Neural Networks for Sentence Classifica-tion[C]//the 2014 Conference on Empirical Methods in Natural Language Processing.EMNLP,2014:1746-1751.
[45]TOLSTIKHIN I,HOULSBY N,KOLESNIKOV A,et al.MLP-mixer:an all-MLP architecture for vision[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2021:24261-24272.
[46]WANG H,GAO Z Y,ZHANG C,et al.CEBin:A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection[C]//The ACM SIGSOFT International Symposium on Software Testing and Analysis.ISSTA,2024.
[47]TAKU K.Subword regularization:Improving neural networktranslation models with multiple subword candidates[C]//The 56th Annual Meeting of the Association for Computational Linguistics(Long Papers).Association for Computational Linguistics,2018:66-75.
[48]Zynamics.com.BinDiff[EB/OL].(2024-03-09) [2024-03-09].https://www.zynamics.com/bindiff.html.
[49]Joxeankoret.Diaphora:A Free and Open Source Program Diffing Tool[EB/OL].(2024-03-12) [2024-03-12].http://diaphora.re/.
[50]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.KDD,2016:461-470.
[51]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//ACM Sigsac Conference on Computer & Communications Security.CCS,2016:480-491.
[52]NG A Y,JORDAN M I,WEISS Y.On spectral clustering:ana-lysis and an algorithm[C]//Proceedings of the 15th International Conference on Neural Information Processing Systems:Natural and Synthetic.Cambridge,MA:MIT,2001:849-856.
[53]CHATFIELD K,LEMPITSKY V S,VEDALDI A,et al.The devil is in the details:an evaluation of recent feature encoding methods[C]//British Machine Vision Conference 2011.NIPS,2011.
[54]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross platform binary code similarity detection[C]//The 2017 ACM SIGSAC Conference on Computer and Communications Security.CCS,2017:363-376.
[55]DAI H J,DAI B,SONG L.Discriminative Embeddings of Latent Variable Models for Structured Data[C]//The 33rd International Conference on International Conference on Machine Learning.ICML,2016:2702-2711.
[56]GAO J,YANG X,FU Y,et al.VulSeeker:a semantic learning based vulnerability seeker for cross-platform binary[C]//The 33rd ACM/IEEE International Conference on Automated Software Engineering.ACM,2018:896-899.
[57]JIANG S,FU C,QIAN Y K,et al.IFAttn:Binary code similarity analysis based on interpretable features with attention[J].Computers & Security,2022,120:102804.
[58]KIM D,KIM E,CHA S K,et al.Revisiting Binary Code Simila-rity Analysis Using Interpretable Feature Engineering and Lessons Learned[C]//IEEE Transactions on Software Enginee-ring.IEEE,2022:1661-1682.
[59]JIA A,FAN M,XU X,et al.Cross-Inlining Binary FunctionSimilarity Detection[C]//The IEEE/ACM 46th International Conference on Software Engineering.ICSE,2024:1-13.
[60]KINABLE J,KOSTAKIS O.Malware Classification based onCall Graph Clustering[J].Journal in Computer Virology,2010,7:233-245.
[61]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//the 2nd Workshop on Binary Analysis Research.BAR,2019.
[62]YU Z P,CAO R,TANG Q Y,et al.Order matters:Semantic-aware neural networks for binary code similarity detection[C]//The AAAI Conference on Artificial Intelligence.AAAI,2020:1145-1152.
[63]CHANDRAMOHAN M,XUE Y X,XU Z Z,et al.Bingo:Crossarchitecture cross-os binary search[C]//The 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.ACM,2016:678-689.
[64]LI Y J,GU C J,DULLIEN T,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects[C]//The 36th International Conference on Machine Learning.ICML,2019:3835-3845.
[1] ZHU Xiaoyan, WANG Wenge, WANG Jiayin, ZHANG Xuanping. Just-In-Time Software Defect Prediction Approach Based on Fine-grained Code Representationand Feature Fusion [J]. Computer Science, 2025, 52(1): 242-249.
[2] LIU Chunling, QI Xuyan, TANG Yonghe, SUN Xuekai, LI Qinghao, ZHANG Yu. Summary of Token-based Source Code Clone Detection Techniques [J]. Computer Science, 2024, 51(6): 12-22.
[3] SHEN Nan, CHEN Gang. Formalization of Inverse Matrix Operation Based on Coq [J]. Computer Science, 2023, 50(6A): 220400108-7.
[4] GAO Yuzhao, XING Yunhan, LIU Jiaxiang. Constraint-based Verification Method for Neural Networks [J]. Computer Science, 2023, 50(11A): 221000045-5.
[5] FANG Lei, WU Ze-hui, WEI Qiang. Summary of Binary Code Similarity Detection Techniques [J]. Computer Science, 2021, 48(5): 1-8.
[6] MI Qing, GUO Li-min, CHEN Jun-cheng. Code Readability Assessment Method Based on Multidimensional Features and Hybrid Neural Networks [J]. Computer Science, 2021, 48(12): 94-99.
[7] ZHANG Xiong and LI Zhou-jun. Survey of Fuzz Testing Technology [J]. Computer Science, 2016, 43(5): 1-8.
[8] WANG Guo-dong,CHEN Ping,MAO Bing,XIE Li. Automatic Generation of Attach-based Signature [J]. Computer Science, 2012, 39(3): 118-123.
[9] JIN Ying,LIU Xin,ZHANG Jing. Research on Eliciting Security Requirement Methods [J]. Computer Science, 2011, 38(5): 14-19.
[10] XIAO Hai,CHEN Ping,MAO Bing,XIE Li. New Binary System for Detecting and Locating Integer-based Vulnerability on Run-time Type Analysis [J]. Computer Science, 2011, 38(1): 140-144.
[11] . [J]. Computer Science, 2009, 36(4): 169-171.
[12] . [J]. Computer Science, 2009, 36(1): 252-255.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!