Computer Science ›› 2024, Vol. 51 ›› Issue (5): 355-362.doi: 10.11896/jsjkx.230400011

• Information Security • Previous Articles     Next Articles

Study on Binary Code Similarity Detection Based on Jump-SBERT

YAN Yintong, YU Lu, WANG Taiyan, LI Yuwei, PAN Zulie   

  1. College of Electronic Engineering,National University of Defense Technology,Hefei 230037,ChinaAnhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation,Hefei 230037,China
  • Received:2023-04-03 Revised:2023-07-28 Online:2024-05-15 Published:2024-05-08
  • About author:YAN Yintong,born in 1997,postgra-duate.His main research interests include network security and binary code similarity detection.
    PAN Zulie,born in 1976,Ph.D,professor.His main research interests include network security,vulnerability disco-very and computer science.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(62202484).

Abstract: Binary code similarity detection technology plays an important role in different security fields.Aiming at the problems of the existing binary code similarity detection methods,such as high computational cost and low accuracy,incomplete semantic information recognition of binary function and single evaluation data set,a binary code similarity detection technique based on Jump-SBERT is proposed.Jump-SBERT has two main innovations.One is to use twin networks to build SBERT network structure,which can reduce the calculation cost of the model while keeping the calculation accuracy unchanged.The other is to introduce jump recognition mechanism,which enables Jump-SBERT to learn the graph structure information of binary functions.Thus,the semantic information of binary function can be captured more comprehensively.Experimental results show that the re-cognition accuracy of Jump-SBERT can reach 96.3% in the small function pool(32 functions) and 85.1% in the large function pool(10 000 functions),which is 36.13% higher than state-of-the-art(SOTA) methods.Jump-SBERT is more stable in large-scale binary code similarity detection.Ablation experiments show that both of the two main innovation points have positive effects on Jump-SBERT,and the contribution of jump recognition mechanism is up to 9.11%.

Key words: Binary code, Similarity detection, Semantic information, SBERT network structure, Jump recognition mechanism

CLC Number: 

  • TP312
[1]MIYANI D,HUANG Z,LIE D.Binpro:A tool for binary source code provenance[J].arXiv:1711.00830,2017.
[2]SHAHKAR A.On matching binary to source code[D].Mon-treal:Concordia University,2016.
[3]DAVID Y,PARTUSH N,YAHAV E.Firmup:Precise staticdetection of common vulnerabilities in firmware[J].ACM SIGPLAN Notices,2018,53(2):392-404.
[4]GAO J,YANG X,FU Y,et al.VulSeeker:A semantic learning based vulnerability seeker for cross-platform binary[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:896-899.
[5]HUANG H,YOUSSEF A M,DEBBABI M.Binsequence:Fast,accurate and scalable binary code reuse detection[C]//Procee-dings of the 2017 ACM on Asia Conferenceon Computer and Communications Security.2017:155-166.
[6]SHALEV N,PARTUSH N.Binary similarity detection usingmachine learning[C]//Proceedings of the 13th Workshop on Programming Languages and Analysis for Security.2018:42-47.
[7]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone searchagainst code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489.
[8]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//16th International Conference(DIMVA 2019).Springer International Publishing,2019:309-329.
[9]FENG Q,ZHOU R,XU C,et al.Scalable graph-based bugsearch for firmware images[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.2016:480-491.
[10]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSACConference on Computer and Communications Security.2017:363-376.
[11]XIU H,YAN X,WANG X,et al.Hierarchical graph matching network for graph similarity computation[J].arXiv:2006.16551,2020.
[12]REIMERS N,GUREVYCH I.Sentence-bert:Sentence embed-dings using siamese bert-networks[J].arXiv:1908.10084,2019.
[13]WANG H,QU W,KATZ G,et al.jTrans:jump-aware trans-former for binary code similarity detection[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.2022:1-13.
[14]RRDMOND K,LUO L,ZENG Q.A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis[J].arXiv:1812.09652,2018.
[15]ZUO F,LI X,YOUNG P,et al.Neural machine translation in-spired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018.
[16]ZHANG X,SUN W,PANG J,et al.Similarity metric method for binary basic blocks of cross-instruction set architecture[C]//Proceedings of 2020 Workshop on Binary Analysis Research.2020.
[17]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[18]YU Z,CAO R,TANG Q,et al.Order matters:Semantic-aware neural networks for binary code similarity detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:1145-1152.
[19]PEI K,XUAN Z,YANG J,et al.Trex:Learning execution semantics from micro-traces for binary similarity[J].arXiv:2012.08680,2020.
[20]LI X,QU Y,YIN H.Palmtree:Learning an assembly language model for instruction embedding[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:3236-3251.
[21]HAQ I U,CABALLERO J.A survey of binary code similarity[J].ACM Computing Surveys(CSUR),2021,54(3):1-38.
[22]PAN Z,WANG T,YU L,et al.Position Distribution Matters:A Graph-Based Binary Function Similarity Analysis Method[J].Electronics,2022,11(15):24-46.
[23]YANG S,CHENG L,ZENG Y,et al.Asteria:Deep learning-based AST-encoding for cross-platform binary code similarity detection[C]//2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2021:224-236.
[24]YU Z,ZHENG W,WANG J,et al.Codecmr:Cross-modal re-trieval for function-level binary source code matching[J].Advances in Neural Information Processing Systems,2020,33:3872-3883.
[25]LI Y,GU C,DULLIEN T,et al.Graph matching networks for learning the similarity of graph structured objects[C]//International Conference on Machine Learning.PMLR,2019:3835-3845.
[26]LI Y,TARLOW D,BROCKSCHMIDT M,et al.Gated graphsequence neural networks[J].arXiv:1511.05493,2015.
[27]MARCELLI A,GRAZIANO M,UGARTE-PEDRERO X,et al.How machine learning is solving the binary function similarity problem[C]//31st USENIX Security Symposium(USENIX Security 22).2022:2099-2116.
[28]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation.2017:79-94.
[29]FENG Q,WANG M,ZHANG M,et al.Extracting conditional formulas for cross-platform bug search[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.2017:346-359.
[30]LIN J,WANG D,CHANG R,et al.EnBinDiff:Identifying Data-only Patches for Binaries[J].IEEE Transactions on Dependable and Secure Computing,2021,20(1):343-359.
[31]HEMEL A,KALLEBERG K T,VERMAAS R,et al.Finding Software License Violations Through Binary Code Clone Detection-A Retrospective[J].ACM SIGSOFT Software Enginee-ring Notes,2021,46(3):24-25.
[32]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678.
[33]CESARE S,XIANG Y,ZHOU W.Control flow-based malware variant detection[J].IEEE Transactions on Dependable and Secure Computing,2013,11(4):307-317.
[34]DAREM A,ABAWAJY J,MAKKAR A,et al.Visualization and deep-learning-based malware variant detection using OpCode-level features[J].Future Generation Computer Systems,2021,125:314-323.
[35]LUO L,MING J,WU D,et al.Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection[J].IEEE Transactions on Software Engineering,2017,43(12):1157-1177.
[36]KARGEN U,SHAHMEHRRI N.Towards robust instruction-level trace alignment of binary code[C]//2017 32nd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2017:342-352.
[37]PENG J,LI F,LIU B,et al.1dvul:Discovering 1-day vulnerabilities through binary patches[C]//2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN).IEEE,2019:605-616.
[38]XU Y,XU Z,CHEN B,et al.Patch based vulnerability matching for binary programs[C]//Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis.2020:376-387.
[39]DUAN Y,LI X,WANG J,et al.Deepbindiff:Learning program-wide code representations for binary diffing[C]//Network and Distributed System Security Symposium.2020.
[40]YANG J,FU C,LIU X Y,et al.Codee:a tensor embeddingscheme for binary code search[J].IEEE Transactions on Software Engineering,2021,48(7):2224-2244.
[41]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Investigating graph embedding neural networks with unsupervised features extraction for binary analysis[C]//Proceedings of the 2nd Workshop on Binary Analysis Research(BAR).2019:1-11.
[42]THAKUR N,REIMERS N,DAXENBERGER J,et al.Augmented sbert:Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks[J].arXiv:2010.08240,2020.
[1] HENG Hongjun, MIAO Jing. Fusion of Semantic and Syntactic Graph Convolutional Networks for Joint Entity and Relation Extraction [J]. Computer Science, 2023, 50(9): 295-302.
[2] WANG Taiyan, PAN Zulie, YU Lu, SONG Jingbin. Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation [J]. Computer Science, 2023, 50(4): 288-297.
[3] CHEN Shurui, LIANG Ziran, RAO Yanghui. Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning [J]. Computer Science, 2023, 50(3): 72-82.
[4] LIU Qingju, PAN Qingxian, TONG Xiangrong, YU Song, PAN Yanan. Bidirectional Quality Control Strategies Based on CIDA and PI-cosine in Crowdsourcing [J]. Computer Science, 2023, 50(10): 282-290.
[5] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[6] GUO Liang, YANG Xing-yao, YU Jiong, HAN Chen, HUANG Zhong-hao. Hybrid Recommender System Based on Attention Mechanisms and Gating Network [J]. Computer Science, 2022, 49(6): 158-164.
[7] PAN Zhi-hao, ZENG Bi, LIAO Wen-xiong, WEI Peng-fei, WEN Song. Interactive Attention Graph Convolutional Networks for Aspect-based Sentiment Classification [J]. Computer Science, 2022, 49(3): 294-300.
[8] SHAO Hai-lin, JI Yi, LIU Chun-ping, XU Yun-long. Scene Text Detection Algorithm Based on Enhanced Feature Pyramid Network [J]. Computer Science, 2022, 49(2): 248-255.
[9] LYU Xiao-shao, SHU Hui, KANG Fei, HUANG Yu-yao. Reverse Location of Software Online Upgrade Function Based on Semantic Guidance [J]. Computer Science, 2022, 49(12): 353-361.
[10] LIANG Yao, XIE Chun-li, WANG Wen-jie. Code Similarity Measurement Based on Graph Embedding [J]. Computer Science, 2022, 49(11A): 211000186-6.
[11] CHENG Hua-ling, CHEN Yan-ping, YANG Wei-zhe, QIN Yong-bin, HUANG Rui-zhang. Relation Extraction Based on Multidimensional Semantic Mapping [J]. Computer Science, 2022, 49(11): 206-211.
[12] WU Lan, WANG Han, LI Bin-quan. Unsupervised Domain Adaptive Method Based on Optimal Selection of Self-supervised Tasks [J]. Computer Science, 2021, 48(6A): 357-363.
[13] FANG Lei, WU Ze-hui, WEI Qiang. Summary of Binary Code Similarity Detection Techniques [J]. Computer Science, 2021, 48(5): 1-8.
[14] ZHENG Jian-yun, PANG Jian-min, ZHOU Xin, WANG Jun. Enhanced Binary Vulnerability Mining Based on Constraint Derivation [J]. Computer Science, 2021, 48(3): 320-326.
[15] FANG Lei, WEI Qiang, WU Ze-hui, DU Jiang, ZHANG Xing-ming. Neural Network-based Binary Function Similarity Detection [J]. Computer Science, 2021, 48(10): 286-293.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!