计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 355-362.doi: 10.11896/jsjkx.230400011

• 信息安全 • 上一篇    下一篇

基于Jump-SBERT的二进制代码相似性检测技术研究

严尹彤, 于璐, 王泰彦, 李宇薇, 潘祖烈   

  1. 国防科技大学电子对抗学院 合肥 230037
    网络空间安全态势感知与评估安徽省重点实验室 合肥 230037
  • 收稿日期:2023-04-03 修回日期:2023-07-28 出版日期:2024-05-15 发布日期:2024-05-08
  • 通讯作者: 潘祖烈(panzulie17@nudt.edu.cn)
  • 作者简介:(yanyintong.edu@nudt.edu.cn)
  • 基金资助:
    国家自然科学基金青年科学基金(62202484)

Study on Binary Code Similarity Detection Based on Jump-SBERT

YAN Yintong, YU Lu, WANG Taiyan, LI Yuwei, PAN Zulie   

  1. College of Electronic Engineering,National University of Defense Technology,Hefei 230037,ChinaAnhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation,Hefei 230037,China
  • Received:2023-04-03 Revised:2023-07-28 Online:2024-05-15 Published:2024-05-08
  • About author:YAN Yintong,born in 1997,postgra-duate.His main research interests include network security and binary code similarity detection.
    PAN Zulie,born in 1976,Ph.D,professor.His main research interests include network security,vulnerability disco-very and computer science.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(62202484).

摘要: 二进制代码相似性检测技术在不同的安全领域中有着重要的作用。针对现有的二进制代码相似性检测方法面临计算开销大且精度低、二进制函数语义信息识别不全面和评估数据集单一等问题,提出了一种基于Jump-SBERT的二进制代码相似性检测技术。Jump-SBERT有两个主要创新点,一是利用孪生网络构建SBERT网络结构,该网络结构能够在降低模型的计算开销的同时保持计算精度不变;二是引入了跳转识别机制,使Jump-SBERT可以学习到二进制函数的图结构信息,从而更加全面地捕获二进制函数的语义信息。实验结果表明,Jump-SBERT在小函数池(32个函数)中的识别准确率可达96.3%,在大函数池(10 000个函数)中的识别准确率可达85.1%,比最先进(State-of-the-Art,SOTA)的方法高出36.13%,且Jump-SBERT在大规模二进制代码相似性检测中的表现更加稳定。消融实验表明,两个主要创新点对Jump-SBERT均有积极作用,其中,跳转识别机制的贡献最高可达9.11%。

关键词: 二进制代码, 相似性检测, 语义信息, SBERT网络结构, 跳转识别机制

Abstract: Binary code similarity detection technology plays an important role in different security fields.Aiming at the problems of the existing binary code similarity detection methods,such as high computational cost and low accuracy,incomplete semantic information recognition of binary function and single evaluation data set,a binary code similarity detection technique based on Jump-SBERT is proposed.Jump-SBERT has two main innovations.One is to use twin networks to build SBERT network structure,which can reduce the calculation cost of the model while keeping the calculation accuracy unchanged.The other is to introduce jump recognition mechanism,which enables Jump-SBERT to learn the graph structure information of binary functions.Thus,the semantic information of binary function can be captured more comprehensively.Experimental results show that the re-cognition accuracy of Jump-SBERT can reach 96.3% in the small function pool(32 functions) and 85.1% in the large function pool(10 000 functions),which is 36.13% higher than state-of-the-art(SOTA) methods.Jump-SBERT is more stable in large-scale binary code similarity detection.Ablation experiments show that both of the two main innovation points have positive effects on Jump-SBERT,and the contribution of jump recognition mechanism is up to 9.11%.

Key words: Binary code, Similarity detection, Semantic information, SBERT network structure, Jump recognition mechanism

中图分类号: 

  • TP312
[1]MIYANI D,HUANG Z,LIE D.Binpro:A tool for binary source code provenance[J].arXiv:1711.00830,2017.
[2]SHAHKAR A.On matching binary to source code[D].Mon-treal:Concordia University,2016.
[3]DAVID Y,PARTUSH N,YAHAV E.Firmup:Precise staticdetection of common vulnerabilities in firmware[J].ACM SIGPLAN Notices,2018,53(2):392-404.
[4]GAO J,YANG X,FU Y,et al.VulSeeker:A semantic learning based vulnerability seeker for cross-platform binary[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:896-899.
[5]HUANG H,YOUSSEF A M,DEBBABI M.Binsequence:Fast,accurate and scalable binary code reuse detection[C]//Procee-dings of the 2017 ACM on Asia Conferenceon Computer and Communications Security.2017:155-166.
[6]SHALEV N,PARTUSH N.Binary similarity detection usingmachine learning[C]//Proceedings of the 13th Workshop on Programming Languages and Analysis for Security.2018:42-47.
[7]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone searchagainst code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489.
[8]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//16th International Conference(DIMVA 2019).Springer International Publishing,2019:309-329.
[9]FENG Q,ZHOU R,XU C,et al.Scalable graph-based bugsearch for firmware images[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.2016:480-491.
[10]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSACConference on Computer and Communications Security.2017:363-376.
[11]XIU H,YAN X,WANG X,et al.Hierarchical graph matching network for graph similarity computation[J].arXiv:2006.16551,2020.
[12]REIMERS N,GUREVYCH I.Sentence-bert:Sentence embed-dings using siamese bert-networks[J].arXiv:1908.10084,2019.
[13]WANG H,QU W,KATZ G,et al.jTrans:jump-aware trans-former for binary code similarity detection[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.2022:1-13.
[14]RRDMOND K,LUO L,ZENG Q.A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis[J].arXiv:1812.09652,2018.
[15]ZUO F,LI X,YOUNG P,et al.Neural machine translation in-spired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018.
[16]ZHANG X,SUN W,PANG J,et al.Similarity metric method for binary basic blocks of cross-instruction set architecture[C]//Proceedings of 2020 Workshop on Binary Analysis Research.2020.
[17]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[18]YU Z,CAO R,TANG Q,et al.Order matters:Semantic-aware neural networks for binary code similarity detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:1145-1152.
[19]PEI K,XUAN Z,YANG J,et al.Trex:Learning execution semantics from micro-traces for binary similarity[J].arXiv:2012.08680,2020.
[20]LI X,QU Y,YIN H.Palmtree:Learning an assembly language model for instruction embedding[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:3236-3251.
[21]HAQ I U,CABALLERO J.A survey of binary code similarity[J].ACM Computing Surveys(CSUR),2021,54(3):1-38.
[22]PAN Z,WANG T,YU L,et al.Position Distribution Matters:A Graph-Based Binary Function Similarity Analysis Method[J].Electronics,2022,11(15):24-46.
[23]YANG S,CHENG L,ZENG Y,et al.Asteria:Deep learning-based AST-encoding for cross-platform binary code similarity detection[C]//2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2021:224-236.
[24]YU Z,ZHENG W,WANG J,et al.Codecmr:Cross-modal re-trieval for function-level binary source code matching[J].Advances in Neural Information Processing Systems,2020,33:3872-3883.
[25]LI Y,GU C,DULLIEN T,et al.Graph matching networks for learning the similarity of graph structured objects[C]//International Conference on Machine Learning.PMLR,2019:3835-3845.
[26]LI Y,TARLOW D,BROCKSCHMIDT M,et al.Gated graphsequence neural networks[J].arXiv:1511.05493,2015.
[27]MARCELLI A,GRAZIANO M,UGARTE-PEDRERO X,et al.How machine learning is solving the binary function similarity problem[C]//31st USENIX Security Symposium(USENIX Security 22).2022:2099-2116.
[28]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation.2017:79-94.
[29]FENG Q,WANG M,ZHANG M,et al.Extracting conditional formulas for cross-platform bug search[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.2017:346-359.
[30]LIN J,WANG D,CHANG R,et al.EnBinDiff:Identifying Data-only Patches for Binaries[J].IEEE Transactions on Dependable and Secure Computing,2021,20(1):343-359.
[31]HEMEL A,KALLEBERG K T,VERMAAS R,et al.Finding Software License Violations Through Binary Code Clone Detection-A Retrospective[J].ACM SIGSOFT Software Enginee-ring Notes,2021,46(3):24-25.
[32]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678.
[33]CESARE S,XIANG Y,ZHOU W.Control flow-based malware variant detection[J].IEEE Transactions on Dependable and Secure Computing,2013,11(4):307-317.
[34]DAREM A,ABAWAJY J,MAKKAR A,et al.Visualization and deep-learning-based malware variant detection using OpCode-level features[J].Future Generation Computer Systems,2021,125:314-323.
[35]LUO L,MING J,WU D,et al.Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection[J].IEEE Transactions on Software Engineering,2017,43(12):1157-1177.
[36]KARGEN U,SHAHMEHRRI N.Towards robust instruction-level trace alignment of binary code[C]//2017 32nd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2017:342-352.
[37]PENG J,LI F,LIU B,et al.1dvul:Discovering 1-day vulnerabilities through binary patches[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2019:605-616.
[38]XU Y,XU Z,CHEN B,et al.Patch based vulnerability matching for binary programs[C]//Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis.2020:376-387.
[39]DUAN Y,LI X,WANG J,et al.Deepbindiff:Learning program-wide code representations for binary diffing[C]//Network and Distributed System Security Symposium.2020.
[40]YANG J,FU C,LIU X Y,et al.Codee:a tensor embeddingscheme for binary code search[J].IEEE Transactions on Software Engineering,2021,48(7):2224-2244.
[41]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//Proceedings of the 2nd Workshop on Binary Analysis Research(BAR).2019:1-11.
[42]THAKUR N,REIMERS N,DAXENBERGER J,et al.Augmented sbert:Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks[J].arXiv:2010.08240,2020.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!