计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 288-297.doi: 10.11896/jsjkx.220300271

• 信息安全 • 上一篇    下一篇

基于预训练汇编指令表征的二进制代码相似性检测方法

王泰彦1,2, 潘祖烈1,2, 于璐1,2, 宋景彬3   

  1. 1 国防科技大学电子对抗学院 合肥 230037
    2 网络空间安全态势感知与评估安徽省重点实验室 合肥 230037
    3 31401部队 长春 130022
  • 收稿日期:2022-03-29 修回日期:2022-07-22 出版日期:2023-04-15 发布日期:2023-04-06
  • 通讯作者: 潘祖烈(panzulie17@nudt.edu.cn)
  • 作者简介:(wangty@nudt.edu.cn)
  • 基金资助:
    国家重点研发计划(2021YFB3100500)

Binary Code Similarity Detection Method Based on Pre-training Assembly Instruction Representation

WANG Taiyan1,2, PAN Zulie1,2, YU Lu1,2, SONG Jingbin3   

  1. 1 College of Electronic Engineering,National University of Defense Technology,Hefei 230037,China
    2 Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation,Hefei 230037,China
    3 PLA 31401,Changchun 130022,China
  • Received:2022-03-29 Revised:2022-07-22 Online:2023-04-15 Published:2023-04-06
  • About author:WANG Taiyan,born in 1998,postgra-duate.His main research interests include network security and binary code similarity detection.
    PAN Zulie,born in 1976,Ph.D,professor.His main research interests include network security,vulnerability disco-very,and computer science.
  • Supported by:
    National Key R&D Program of China(2021YFB3100500).

摘要: 二进制代码相似性检测技术近年来被广泛用于漏洞函数搜索、恶意代码检测与高级程序分析等领域,而由于程序代码与自然语言有一定程度的相似性,研究人员开始借助预训练等自然语言处理的相关技术来提高检测准确度。针对现有方法中未考虑程序指令概率特征导致的准确率提升瓶颈,提出了一种基于预训练汇编指令表征技术的二进制代码相似性检测方法。设计了面向多架构汇编指令的分词方法,并在控制流与数据流关系基础上,考虑指令间顺序出现的概率与各个指令单元使用的频率等特征设计预训练任务,以实现对指令更好的向量化表征;结合预训练汇编指令表征方法,对二进制代码相似性检测下游任务进行改进,使用表征向量替换统计特征作为指令与基本块的表征,以提高检测准确率。实验结果表明,与现有方法相比,所提方法在指令表征能力方面最高提升23.7%,在基本块搜索准确度上最高提升33.97%,在二进制代码相似性检测的检出数量上最高增加4倍。

关键词: 二进制代码, 相似性检测, 指令表征, 分词方法, 预训练任务

Abstract: Binary code similarity detection has been widely used in vulnerability searching,malware detection,advanced program analysis and other fields in recent years,while program code is similar to natural language in a degree,researchers start to use pre-training and other natural language processing related technologies to improve accuracy.A binary code similarity detection method based on pre-training assembly instruction representation is proposed to deal with the accuracy bottleneck due to insufficient consideration of instruction probability features.It includes tokenization method for multi-arch assembly instructions,and pre-trai-ning tasks that considering control flow,data flow,instruction logic and probability of occurrence,to achieve better vectorized representation of instructions.Downstream binary code similarity detection task is improved by combining pre-training method to gain accuracy boost.Experiments show that,compared with the existing methods,the proposed method improves instruction representing performance by 23.7% at the maximum,and improves block searching ability and similarity detection performance by up to 33.97% and 400% respectively.

Key words: Binary code, Similarity detection, Instruction representation, Tokenization, Pre-training task

中图分类号: 

  • TP313
[1]ZHANG X F,ZHU C.Empirical Study of Code Smell Impact on Software Evolution[J].Journal of Software,2019,30(5):363-376.
[2]FENG Q,WANG M,ZHANG M,et al.Extracting conditionalformulas for cross-platform bug search[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.2017:346-359.
[3]RAFF E,BARKER J,SYLVESTER J,et al.Malware detection by eating a whole exe[C]//Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.2018.
[4]FANG L,WU Z H,WEI Q.Summary of Binary Code Similarity Detection Techniques[J].Computer Science,2021,48(5):1-8.
[5]FANG L,WEI Q,WU Z H,et al.Neural Network-based Binary Function Similarity Detection[J].Computer Science,2021,48(10):286-293.
[6]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binarycode similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678.
[7]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[8]GAO J,YANG X,FU Y,et al.VulSeeker:A semantic learning based vulnerability seeker for cross-platform binary[C]//2018 33rd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2018:896-899.
[9]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[10]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489.
[11]LI X,QU Y,YIN H.Palmtree:learning an assembly language model for instruction embedding[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:3236-3251.
[12]CHOPRA S,HADSELL R,LECUN Y.Learning a similaritymetric discriminatively,with application to face verification[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05).IEEE,2005:539-546.
[13]MASSARELLI L,LUNA G A D,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2019:309-329.
[14]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018.
[15]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196.
[16]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[17]HAN X,ZHANG Z,DING N,et al.Pre-trained models:Past,present and future[J].AI Open,2021,2:225-250.
[18]LI Z J,FAN Y,WU X J.Survey of Natural Language ProcessingPre-training Techniques[J].Computer Science,2020,47(3):162-173.
[19]RIBEIRO L F R,SAVERESE P H P,FIGUEIREDO D R.struc2vec:Learning node representations from structural identity[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data mining.2017:385-394.
[20]joxeankoret.Diaphora A Free and Open Source Program Diffing Tool[EB/OL].(2022-03-19)[2022-03-19].http://diaphora.re/.
[21]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//Proceedings of the 22nd ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining.2016:461-470.
[22]zynamics.BinDiff[EB/OL].(2022-03-19)[2022-03-19].https://www.zynamics.com/bindiff.html.
[23]hex-rays.com.IDA Pro-Hex Rays[EB/OL].(2022-03-19)[2022-03-19].https://hex-rays.com/IDA-pro/.
[1] 苏琦, 王红玲, 王中卿.
基于预训练模型的无监督剧本摘要
Unsupervised Script Summarization Based on Pre-trained Model
计算机科学, 2023, 50(2): 310-316. https://doi.org/10.11896/jsjkx.211100039
[2] 梁瑶, 谢春丽, 王文捷.
基于图嵌入的代码相似性度量
Code Similarity Measurement Based on Graph Embedding
计算机科学, 2022, 49(11A): 211000186-6. https://doi.org/10.11896/jsjkx.211000186
[3] 方磊, 武泽慧, 魏强.
二进制代码相似性检测技术综述
Summary of Binary Code Similarity Detection Techniques
计算机科学, 2021, 48(5): 1-8. https://doi.org/10.11896/jsjkx.200400085
[4] 郑建云, 庞建民, 周鑫, 王军.
基于约束推导式的增强型二进制漏洞挖掘
Enhanced Binary Vulnerability Mining Based on Constraint Derivation
计算机科学, 2021, 48(3): 320-326. https://doi.org/10.11896/jsjkx.200700047
[5] 方磊, 魏强, 武泽慧, 杜江, 张兴明.
基于神经网络的二进制函数相似性检测技术
Neural Network-based Binary Function Similarity Detection
计算机科学, 2021, 48(10): 286-293. https://doi.org/10.11896/jsjkx.200900185
[6] 帕尔哈提江·斯迪克, 马建峰, 孙聪.
一种面向二进制的细粒度控制流完整性方法
Fine-grained Control Flow Integrity Method on Binaries
计算机科学, 2019, 46(11A): 417-420.
[7] 曹阳,袁鑫攀,龙军.
连接位极大似然动态过滤算法
Dynamic Filtering Algorithm of Connected Bit Maximum Likelihood Minwise Hash
计算机科学, 2016, 43(Z6): 410-412. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.097
[8] 陈林博,江建慧,张丹青.
利用返回地址保护机制防御代码复用类攻击
Prevention of Code Reuse Attacks through Return Address Protection
计算机科学, 2013, 40(9): 93-98.
[9] 熊浩,晏海华,黄永刚,郭涛,李舟军.
一种基于BP神经网络的代码相似性检测方法
Code Similarity Detection Approach Based on Back-propagation Neural Network
计算机科学, 2010, 37(3): 159-164.
[10] .
一种基于重定位信息的二次反汇编算法

计算机科学, 2007, 34(7): 284-287.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!