计算机科学 ›› 2025, Vol. 52 ›› Issue (6): 365-380.doi: 10.11896/jsjkx.240400003

• 信息安全 • 上一篇    下一篇

二进制代码相似性检测方法综述

魏有缘1, 宋建华1,3,4, 张龑2,3   

  1. 1 湖北大学网络空间安全学院 武汉 430062
    2 湖北大学计算机与信息工程学院 武汉 430062
    3 智能感知系统与安全教育部重点实验室 武汉 430062
    4 智能网联汽车网络安全湖北省工程研究中心 武汉 430062
  • 收稿日期:2024-04-01 修回日期:2024-10-03 出版日期:2025-06-15 发布日期:2025-06-11
  • 通讯作者: 宋建华(sjhhubu@126.com)
  • 作者简介:(202221116012664@stu.hubu.edu.cn)
  • 基金资助:
    国家自然科学基金(62377009);湖北省重大攻关项目(JD)(2023BAA018);湖北省重点研发计划重点项目(2021BAA188,2021BAA184);湖北省高等学校人文社会科学重点研究基地绩效评价信息管理研究中心课题(2020JX01)

Survey of Binary Code Similarity Detection Method

WEI Youyuan1, SONG Jianhua1,3,4, ZHANG Yan2,3   

  1. 1 School of Cyber Science and Technology,Hubei University,Wuhan 430062,China
    2 School of Computer Science andInformation Engineering,Hubei University,Wuhan 430062,China
    3 Key Laboratory of Intelligent Sensing System and Security(Hubei University),Ministry of Education,Wuhan 430062,China
    4 Hubei Provincial Engineering Research Center of Intelligent Connected Vehicle Network Security,Wuhan 430062,China
  • Received:2024-04-01 Revised:2024-10-03 Online:2025-06-15 Published:2025-06-11
  • About author:WEI Youyuan,born in 2000,postgraduate,is a member of CCF(No.T7853G).His main research interests include binary code similarity detection,malicious code identification and so on.
    SONG Jianhua,born in 1973,Ph.D,professor,postgraduate supervisor,is a member of CCF(No.27785M).Her main research interests include network and information security and so on.
  • Supported by:
    National Natural Science Foundation of China(62377009),Major Program(JD) of Hubei Province(2023BAA018),Key R&D program of Hubei Province(2021BAA184,2021BAA188) and Hubei Province Project of Key Research Institute of Humanities and Social Sciences at Universities(Research Center of Information Management for Performance Evaluation)(2020JX01).

摘要: 代码相似性检测按照研究对象可分为源代码相似性检测和二进制代码相似性检测两种,常用于恶意代码识别、漏洞搜索、版权保护等场景。基于目前国内的互联网环境,程序通常以二进制文件的形式发布,大多数程序都无法直接获得源代码,因此在软件安全领域的相关研究中,二进制代码相似性检测的应用范围相对更广。从二进制代码相似性检测的定义和实现流程出发,按照代码表征形式将其分为基于文本字符、基于代码嵌入、基于图嵌入三大类,对经典的二进制代码相似性检测方法和近5年的新方法共19篇文献进行了整理,并根据多架构、Baseline、基准数据集和检测性能对各类方法进行了分析和总结。最后,结合新方法的发展分析了当前存在的问题和未来可能的研究方向。

关键词: 二进制代码相似性检测, 代码表征, 软件安全, 恶意代码识别, 漏洞搜索

Abstract: Code similarity detection can be divided into two types according to the research object:source code similarity detection and binary code similarity detection,which are commonly used in scenarios such as malicious code identification,vulnerability search,and copyright protection.Based on the current domestic Internet environment,programs are usually released in the form of binary files,and most programs cannot directly obtain source code.Therefore,in related research in the field of software security,the application scope of binary code similarity detection is relatively wider.Starting from the definition and implementation process of binary code similarity detection,according to the code representation form,it is divided into three categories:text cha-racter-based,code embedding-based,and graph embedding-based.The classic binary code similarity detection methods and the recent five years of research and development are compared.A total of 19 documents on new methods are sorted out,and various methods are analyzed and summarized based on multi-architecture,Baseline,benchmark datasets and detection performance.Finally,current problems and possible future research directions are analyzed based on the development of new methods.

Key words: Binary code similarity detection, Code representation, Software security, Malicious code identification, Vulnerability search

中图分类号: 

  • TP311.5
[1]SUN X J,WEI Q,WANG Y S,et al.Survey of code similarity detection technology[J].Journal of Computer Applications,2024,44(4):1248-1258.
[2]NVD.CVE-2023-20892[EB/OL].(2023-06-22) [2024-01-20].https://nvd.nist.gov/vuln/detail/CVE-2023-20892/.
[3]ROY C K,CORDY J R,KOSCHKE R.Comparison and evaluation of code clone detection techniques and tools:A qualitative approach[J].Science of Computer Programming,2009,74(7):470-495.
[4]UL HAQ I,CABALLERO J.A Survey of Binary Code Similarity[J].ACM Computing Surveys,2021,54(3):1-38.
[5]XIA B,PANG J M,ZHOU X,et al.Research progress on binarycode similarity search[J].Journal of Computer Applications,2022,42(4):985-998.
[6]ZHOU Z J,DONG R C,JIANG J H,et al.Survey on Binary Code Security Techniques[J].Computer Systems and Applications,2023,32(1):1-11.
[7]FANG L,WU Z H,WEI Q.Summary of Binary Code Similarity Detection Techniques[J].Computer Science,2021,48(5):1-8.
[8]LI Z,ZOU D Q,XU S H,et al.SySeVR:A Framework for Using Deep Learning to Detect Software Vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2022,19(4):2244-2258.
[9]XIE C L,LIANG Y,WANG X.Survey of Deep Learning Applied in Code Representation[J].Computer Engineering and Applications,2021,57(20):53-63.
[10]BELLON S,KOSCHKE R,ANTONIOl G,et al.Comparisonand evaluation of clone detection tools[J].IEEE Transactions on Software Engineering,2007,33(9):577-591.
[11]CHEN Q Y,LI S P,YAN M,et al.Code Clone Detection:A Li-terature Review[J].Journal of Software,2019,30(4):962-980.
[12]LE Q Y,LIU J X,SUN X P,et al.Survey of Research Progress of Code Clone Detection[J].Computer Science,2021,48(S2):509-522.
[13]WHALE G.Plague:Plagiarism Detection Using Program Structure[D].Sydeny:University of New South Wales,1988.
[14]MCCREIGHT E M.A Space-Economical Suffix Tree Construction Algorithm[J].Journal of the ACM,1976,23(2):262-272.
[15]UKKONEN E.On-line construction of suffix trees[J].Algorithmica,1995,14(3):249-260.
[16]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//The 38th ACMSIGPLAN Conference on Programming Language Design and Implementation.ACM,2017:79-94.
[17]DAVID Y,PARTUSH N,YAHAV E.Statistical similarity ofbinaries[C]//The 37th ACM SIGPLAN Conference on Programming Language Design and Implementation.PLDI,2016:266-280.
[18]NETHERCOTE N,SEWARD J.Valgrind:a framework forheavyweight dynamic binary instrumentation[C]//The 28th ACM SIGPLAN Conference on Programming Language Design and Implementation.PLDI,2007:89-100.
[19]ZHANG L H,GUI S L,MU F J,et al.Clone Detection Algorithm for Binary Executable Code with Suffix Tree[J].Compu-ter Science,2019,46(10):141-147.
[20]ROY C K,CORDY J R.NICAD:Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]//2008 16th IEEE International Conference on Program Comprehension.IEEE,2008:172-181.
[21]XIONG M,XUE Y X,XU Y.A binary code similarity analysis method based on code embedding[J].Cyber Security And Data Governance,2023,42(3):58-67.
[22]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//International Conference on Learning Representations.ICLR,2013.
[23]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//The 31st International Conference on Machine Learning.PMLR,2014:1188-1196.
[24]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[C]//Network and Distributed Systems Security Symposium.NDSS,2019.
[25]MASSARELLI L,GIUSEPPE A D L,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2019:309-329.
[26]MASSARELLI L,GIUSEPPE A D L,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//The 2nd Workshop on Binary Analysis Research.BAR,2019.
[27]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//The 26th International Conference on Neural Information Processing Systems.NIPS,2013:3111-3119.
[28]LIN Z,FENG M,NOGUEIRA DOS SANTOS C,et al.A structured self-attentive sentence embedding[C]//International Conference on Learning Representations.ICLR,2017.
[29]BROMLEY J,GUYON I,LECUN Y,et al.Signature verification using a “Siamese” time delay neural network[C]//The 6th International Conference on Neural Information Processing Systems.NIPS,1994:737-744.
[30]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//The 2019 IEEE Symposium on Security and Privacy.IEEE,2019:472-489.
[31]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[C]//North American Chapter of the Association for Computational Linguistics.NAACL-HLT,2019:4171-4186.
[32]LIU Y H,OTT M,GOYAL N,et al.RoBERTa:A Robustly Optimized BERT Pretraining Approach[C]//International Conference on Learning Representations.ICLR,2020.
[33]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings using Siamese BERT-Networks[C]//The 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.EMNLP,2019:3982-3992.
[34]LI X,QU Y,YIN H.Palmtree:learning an assembly language model for instruction embedding[C]//The 2021 ACM SIGSAC Conference on Computer and Communications Security.ACM,2021:3236-3251.
[35]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//The 33rd ACM/IEEE International Conference on Automated Software Engineering.IEEE,2018:667-678.
[36]WANG H,QU W,KATZ G,et al.jTrans:jump-aware trans-former for binary code similarity detection[C]//The 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.ACM,2022:1-13.
[37]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//The 31st International Conference on Neural Information Processing Systems.NIPS,2017:5998-6008.
[38]YAN Y T,YU L,WANG T Y,et al.Research on Binary Code Similarity Detection Based on Jump-SBERT[J].Computer Science,2024,51(5):355-362.
[39]PALMER D D,OSTENDORF M.Improving out-of-vocabulary name resolution[J].Computer Speech & Language,2005,19(1):107-128.
[40]WANG T Y,PAN Z L,YU L,et al.Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation[J].Computer Science,2023,50(4):288-297.
[41]LI T,WANG J S.Binary code similarity detection via attention mechanism and Child-Sum Tree-LSTM[J].Cyber Security and Data Governance,2023,42(11):8-14,34.
[42]AHMED M,SAMEE M,MERCER R.Improving Tree-LSTMwith Tree Attention[C]//2019 IEEE 13th International Confe-rence on Semantic Computing.ICSC,2019:247-254.
[43]HUANG C S,ZHU G B,GE G J,et al.FastBCSD:Fast and Efficient Neural Network for Binary Code Similarity Detection[J].arXiv:2306.14168,2023.
[44]KIM Y.Convolutional Neural Networks for Sentence Classifica-tion[C]//the 2014 Conference on Empirical Methods in Natural Language Processing.EMNLP,2014:1746-1751.
[45]TOLSTIKHIN I,HOULSBY N,KOLESNIKOV A,et al.MLP-mixer:an all-MLP architecture for vision[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2021:24261-24272.
[46]WANG H,GAO Z Y,ZHANG C,et al.CEBin:A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection[C]//The ACM SIGSOFT International Symposium on Software Testing and Analysis.ISSTA,2024.
[47]TAKU K.Subword regularization:Improving neural networktranslation models with multiple subword candidates[C]//The 56th Annual Meeting of the Association for Computational Linguistics(Long Papers).Association for Computational Linguistics,2018:66-75.
[48]Zynamics.com.BinDiff[EB/OL].(2024-03-09) [2024-03-09].https://www.zynamics.com/bindiff.html.
[49]Joxeankoret.Diaphora:A Free and Open Source Program Diffing Tool[EB/OL].(2024-03-12) [2024-03-12].http://diaphora.re/.
[50]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.KDD,2016:461-470.
[51]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//ACM Sigsac Conference on Computer & Communications Security.CCS,2016:480-491.
[52]NG A Y,JORDAN M I,WEISS Y.On spectral clustering:ana-lysis and an algorithm[C]//Proceedings of the 15th International Conference on Neural Information Processing Systems:Natural and Synthetic.Cambridge,MA:MIT,2001:849-856.
[53]CHATFIELD K,LEMPITSKY V S,VEDALDI A,et al.The devil is in the details:an evaluation of recent feature encoding methods[C]//British Machine Vision Conference 2011.NIPS,2011.
[54]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross platform binary code similarity detection[C]//The 2017 ACM SIGSAC Conference on Computer and Communications Security.CCS,2017:363-376.
[55]DAI H J,DAI B,SONG L.Discriminative Embeddings of Latent Variable Models for Structured Data[C]//The 33rd International Conference on International Conference on Machine Learning.ICML,2016:2702-2711.
[56]GAO J,YANG X,FU Y,et al.VulSeeker:a semantic learning based vulnerability seeker for cross-platform binary[C]//The 33rd ACM/IEEE International Conference on Automated Software Engineering.ACM,2018:896-899.
[57]JIANG S,FU C,QIAN Y K,et al.IFAttn:Binary code similarity analysis based on interpretable features with attention[J].Computers & Security,2022,120:102804.
[58]KIM D,KIM E,CHA S K,et al.Revisiting Binary Code Simila-rity Analysis Using Interpretable Feature Engineering and Lessons Learned[C]//IEEE Transactions on Software Enginee-ring.IEEE,2022:1661-1682.
[59]JIA A,FAN M,XU X,et al.Cross-Inlining Binary FunctionSimilarity Detection[C]//The IEEE/ACM 46th International Conference on Software Engineering.ICSE,2024:1-13.
[60]KINABLE J,KOSTAKIS O.Malware Classification based onCall Graph Clustering[J].Journal in Computer Virology,2010,7:233-245.
[61]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//the 2nd Workshop on Binary Analysis Research.BAR,2019.
[62]YU Z P,CAO R,TANG Q Y,et al.Order matters:Semantic-aware neural networks for binary code similarity detection[C]//The AAAI Conference on Artificial Intelligence.AAAI,2020:1145-1152.
[63]CHANDRAMOHAN M,XUE Y X,XU Z Z,et al.Bingo:Crossarchitecture cross-os binary search[C]//The 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.ACM,2016:678-689.
[64]LI Y J,GU C J,DULLIEN T,et al.Graph Matching Networks for Learning the Similarity of Graph Structured Objects[C]//The 36th International Conference on Machine Learning.ICML,2019:3835-3845.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!