Computer Science ›› 2023, Vol. 50 ›› Issue (4): 288-297.doi: 10.11896/jsjkx.220300271

• Information Security • Previous Articles     Next Articles

Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation

WANG Taiyan1,2, PAN Zulie1,2, YU Lu1,2, SONG Jingbin3   

  1. 1 College of Electronic Engineering,National University of Defense Technology,Hefei 230037,China
    2 Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation,Hefei 230037,China
    3 PLA 31401,Changchun 130022,China
  • Received:2022-03-29 Revised:2022-07-22 Online:2023-04-15 Published:2023-04-06
  • About author:WANG Taiyan,born in 1998,postgra-duate.His main research interests include network security and binary code similarity detection.
    PAN Zulie,born in 1976,Ph.D,professor.His main research interests include network security,vulnerability disco-very,and computer science.
  • Supported by:
    National Key R&D Program of China(2021YFB3100500).

Abstract: Binary code similarity detection has been widely used in vulnerability searching,malware detection,advanced program analysis and other fields in recent years,while program code is similar to natural language in a degree,researchers start to use pre-training and other natural language processing related technologies to improve accuracy.A binary code similarity detection method based on pre-training assembly instruction representation is proposed to deal with the accuracy bottleneck due to insufficient consideration of instruction probability features.It includes tokenization method for multi-arch assembly instructions,and pre-trai-ning tasks that considering control flow,data flow,instruction logic and probability of occurrence,to achieve better vectorized representation of instructions.Downstream binary code similarity detection task is improved by combining pre-training method to gain accuracy boost.Experiments show that,compared with the existing methods,the proposed method improves instruction representing performance by 23.7% at the maximum,and improves block searching ability and similarity detection performance by up to 33.97% and 400% respectively.

Key words: Binary code, Similarity detection, Instruction representation, Tokenization, Pre-training task

CLC Number: 

  • TP313
[1]ZHANG X F,ZHU C.Empirical Study of Code Smell Impact on Software Evolution[J].Journal of Software,2019,30(5):363-376.
[2]FENG Q,WANG M,ZHANG M,et al.Extracting conditionalformulas for cross-platform bug search[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.2017:346-359.
[3]RAFF E,BARKER J,SYLVESTER J,et al.Malware detection by eating a whole exe[C]//Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[4]FANG L,WU Z H,WEI Q.Summary of Binary Code Similarity Detection Techniques[J].Computer Science,2021,48(5):1-8.
[5]FANG L,WEI Q,WU Z H,et al.Neural Network-based Binary Function Similarity Detection[J].Computer Science,2021,48(10):286-293.
[6]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binarycode similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678.
[7]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[8]GAO J,YANG X,FU Y,et al.VulSeeker:A semantic learning based vulnerability seeker for cross-platform binary[C]//2018 33rd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2018:896-899.
[9]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[10]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489.
[11]LI X,QU Y,YIN H.Palmtree:learning an assembly language model for instruction embedding[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:3236-3251.
[12]CHOPRA S,HADSELL R,LECUN Y.Learning a similaritymetric discriminatively,with application to face verification[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05).IEEE,2005:539-546.
[13]MASSARELLI L,LUNA G A D,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2019:309-329.
[14]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018.
[15]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196.
[16]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[17]HAN X,ZHANG Z,DING N,et al.Pre-trained models:Past,present and future[J].AI Open,2021,2:225-250.
[18]LI Z J,FAN Y,WU X J.Survey of Natural Language ProcessingPre-training Techniques[J].Computer Science,2020,47(3):162-173.
[19]RIBEIRO L F R,SAVERESE P H P,FIGUEIREDO D R.struc2vec:Learning node representations from structural identity[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data mining.2017:385-394.
[20]joxeankoret.Diaphora A Free and Open Source Program Diffing Tool[EB/OL].(2022-03-19)[2022-03-19].http://diaphora.re/.
[21]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//Proceedings of the 22nd ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining.2016:461-470.
[22]zynamics.BinDiff[EB/OL].(2022-03-19)[2022-03-19].https://www.zynamics.com/bindiff.html.
[23]hex-rays.com.IDA Pro-Hex Rays[EB/OL].(2022-03-19)[2022-03-19].https://hex-rays.com/IDA-pro/.
[1] SU Qi, WANG Hongling, WANG Zhongqing. Unsupervised Script Summarization Based on Pre-trained Model [J]. Computer Science, 2023, 50(2): 310-316.
[2] LIANG Yao, XIE Chun-li, WANG Wen-jie. Code Similarity Measurement Based on Graph Embedding [J]. Computer Science, 2022, 49(11A): 211000186-6.
[3] FANG Lei, WU Ze-hui, WEI Qiang. Summary of Binary Code Similarity Detection Techniques [J]. Computer Science, 2021, 48(5): 1-8.
[4] ZHENG Jian-yun, PANG Jian-min, ZHOU Xin, WANG Jun. Enhanced Binary Vulnerability Mining Based on Constraint Derivation [J]. Computer Science, 2021, 48(3): 320-326.
[5] FANG Lei, WEI Qiang, WU Ze-hui, DU Jiang, ZHANG Xing-ming. Neural Network-based Binary Function Similarity Detection [J]. Computer Science, 2021, 48(10): 286-293.
[6] XIONG Hao,YAN Hai-hua,HUANG Yong-gang,GUO Tao,LI Zhou-jun. Code Similarity Detection Approach Based on Back-propagation Neural Network [J]. Computer Science, 2010, 37(3): 159-164.
[7] TIAN Shuo, LIANG Hong-liang. Survey of Static Analysis Methods for Binary Code Vulnerability [J]. Computer Science, 2009, 36(7): 8-14.
[8] . [J]. Computer Science, 2006, 33(11): 246-248.
[9] . [J]. Computer Science, 2006, 33(11): 210-211.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!