计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 355-362.doi: 10.11896/jsjkx.230400011
严尹彤, 于璐, 王泰彦, 李宇薇, 潘祖烈
YAN Yintong, YU Lu, WANG Taiyan, LI Yuwei, PAN Zulie
摘要: 二进制代码相似性检测技术在不同的安全领域中有着重要的作用。针对现有的二进制代码相似性检测方法面临计算开销大且精度低、二进制函数语义信息识别不全面和评估数据集单一等问题,提出了一种基于Jump-SBERT的二进制代码相似性检测技术。Jump-SBERT有两个主要创新点,一是利用孪生网络构建SBERT网络结构,该网络结构能够在降低模型的计算开销的同时保持计算精度不变;二是引入了跳转识别机制,使Jump-SBERT可以学习到二进制函数的图结构信息,从而更加全面地捕获二进制函数的语义信息。实验结果表明,Jump-SBERT在小函数池(32个函数)中的识别准确率可达96.3%,在大函数池(10 000个函数)中的识别准确率可达85.1%,比最先进(State-of-the-Art,SOTA)的方法高出36.13%,且Jump-SBERT在大规模二进制代码相似性检测中的表现更加稳定。消融实验表明,两个主要创新点对Jump-SBERT均有积极作用,其中,跳转识别机制的贡献最高可达9.11%。
中图分类号:
[1]MIYANI D,HUANG Z,LIE D.Binpro:A tool for binary source code provenance[J].arXiv:1711.00830,2017. [2]SHAHKAR A.On matching binary to source code[D].Mon-treal:Concordia University,2016. [3]DAVID Y,PARTUSH N,YAHAV E.Firmup:Precise staticdetection of common vulnerabilities in firmware[J].ACM SIGPLAN Notices,2018,53(2):392-404. [4]GAO J,YANG X,FU Y,et al.VulSeeker:A semantic learning based vulnerability seeker for cross-platform binary[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:896-899. [5]HUANG H,YOUSSEF A M,DEBBABI M.Binsequence:Fast,accurate and scalable binary code reuse detection[C]//Procee-dings of the 2017 ACM on Asia Conferenceon Computer and Communications Security.2017:155-166. [6]SHALEV N,PARTUSH N.Binary similarity detection usingmachine learning[C]//Proceedings of the 13th Workshop on Programming Languages and Analysis for Security.2018:42-47. [7]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone searchagainst code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489. [8]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//16th International Conference(DIMVA 2019).Springer International Publishing,2019:309-329. [9]FENG Q,ZHOU R,XU C,et al.Scalable graph-based bugsearch for firmware images[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.2016:480-491. [10]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSACConference on Computer and Communications Security.2017:363-376. [11]XIU H,YAN X,WANG X,et al.Hierarchical graph matching network for graph similarity computation[J].arXiv:2006.16551,2020. [12]REIMERS N,GUREVYCH I.Sentence-bert:Sentence embed-dings using siamese bert-networks[J].arXiv:1908.10084,2019. [13]WANG H,QU W,KATZ G,et al.jTrans:jump-aware trans-former for binary code similarity detection[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.2022:1-13. [14]RRDMOND K,LUO L,ZENG Q.A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis[J].arXiv:1812.09652,2018. [15]ZUO F,LI X,YOUNG P,et al.Neural machine translation in-spired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018. [16]ZHANG X,SUN W,PANG J,et al.Similarity metric method for binary basic blocks of cross-instruction set architecture[C]//Proceedings of 2020 Workshop on Binary Analysis Research.2020. [17]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [18]YU Z,CAO R,TANG Q,et al.Order matters:Semantic-aware neural networks for binary code similarity detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:1145-1152. [19]PEI K,XUAN Z,YANG J,et al.Trex:Learning execution semantics from micro-traces for binary similarity[J].arXiv:2012.08680,2020. [20]LI X,QU Y,YIN H.Palmtree:Learning an assembly language model for instruction embedding[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:3236-3251. [21]HAQ I U,CABALLERO J.A survey of binary code similarity[J].ACM Computing Surveys(CSUR),2021,54(3):1-38. [22]PAN Z,WANG T,YU L,et al.Position Distribution Matters:A Graph-Based Binary Function Similarity Analysis Method[J].Electronics,2022,11(15):24-46. [23]YANG S,CHENG L,ZENG Y,et al.Asteria:Deep learning-based AST-encoding for cross-platform binary code similarity detection[C]//2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2021:224-236. [24]YU Z,ZHENG W,WANG J,et al.Codecmr:Cross-modal re-trieval for function-level binary source code matching[J].Advances in Neural Information Processing Systems,2020,33:3872-3883. [25]LI Y,GU C,DULLIEN T,et al.Graph matching networks for learning the similarity of graph structured objects[C]//International Conference on Machine Learning.PMLR,2019:3835-3845. [26]LI Y,TARLOW D,BROCKSCHMIDT M,et al.Gated graphsequence neural networks[J].arXiv:1511.05493,2015. [27]MARCELLI A,GRAZIANO M,UGARTE-PEDRERO X,et al.How machine learning is solving the binary function similarity problem[C]//31st USENIX Security Symposium(USENIX Security 22).2022:2099-2116. [28]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation.2017:79-94. [29]FENG Q,WANG M,ZHANG M,et al.Extracting conditional formulas for cross-platform bug search[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.2017:346-359. [30]LIN J,WANG D,CHANG R,et al.EnBinDiff:Identifying Data-only Patches for Binaries[J].IEEE Transactions on Dependable and Secure Computing,2021,20(1):343-359. [31]HEMEL A,KALLEBERG K T,VERMAAS R,et al.Finding Software License Violations Through Binary Code Clone Detection-A Retrospective[J].ACM SIGSOFT Software Enginee-ring Notes,2021,46(3):24-25. [32]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678. [33]CESARE S,XIANG Y,ZHOU W.Control flow-based malware variant detection[J].IEEE Transactions on Dependable and Secure Computing,2013,11(4):307-317. [34]DAREM A,ABAWAJY J,MAKKAR A,et al.Visualization and deep-learning-based malware variant detection using OpCode-level features[J].Future Generation Computer Systems,2021,125:314-323. [35]LUO L,MING J,WU D,et al.Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection[J].IEEE Transactions on Software Engineering,2017,43(12):1157-1177. [36]KARGEN U,SHAHMEHRRI N.Towards robust instruction-level trace alignment of binary code[C]//2017 32nd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2017:342-352. [37]PENG J,LI F,LIU B,et al.1dvul:Discovering 1-day vulnerabilities through binary patches[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2019:605-616. [38]XU Y,XU Z,CHEN B,et al.Patch based vulnerability matching for binary programs[C]//Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis.2020:376-387. [39]DUAN Y,LI X,WANG J,et al.Deepbindiff:Learning program-wide code representations for binary diffing[C]//Network and Distributed System Security Symposium.2020. [40]YANG J,FU C,LIU X Y,et al.Codee:a tensor embeddingscheme for binary code search[J].IEEE Transactions on Software Engineering,2021,48(7):2224-2244. [41]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//Proceedings of the 2nd Workshop on Binary Analysis Research(BAR).2019:1-11. [42]THAKUR N,REIMERS N,DAXENBERGER J,et al.Augmented sbert:Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks[J].arXiv:2010.08240,2020. |
|