计算机科学 ›› 2023, Vol. 50 ›› Issue (4): 288-297.doi: 10.11896/jsjkx.220300271
王泰彦1,2, 潘祖烈1,2, 于璐1,2, 宋景彬3
WANG Taiyan1,2, PAN Zulie1,2, YU Lu1,2, SONG Jingbin3
摘要: 二进制代码相似性检测技术近年来被广泛用于漏洞函数搜索、恶意代码检测与高级程序分析等领域,而由于程序代码与自然语言有一定程度的相似性,研究人员开始借助预训练等自然语言处理的相关技术来提高检测准确度。针对现有方法中未考虑程序指令概率特征导致的准确率提升瓶颈,提出了一种基于预训练汇编指令表征技术的二进制代码相似性检测方法。设计了面向多架构汇编指令的分词方法,并在控制流与数据流关系基础上,考虑指令间顺序出现的概率与各个指令单元使用的频率等特征设计预训练任务,以实现对指令更好的向量化表征;结合预训练汇编指令表征方法,对二进制代码相似性检测下游任务进行改进,使用表征向量替换统计特征作为指令与基本块的表征,以提高检测准确率。实验结果表明,与现有方法相比,所提方法在指令表征能力方面最高提升23.7%,在基本块搜索准确度上最高提升33.97%,在二进制代码相似性检测的检出数量上最高增加4倍。
中图分类号:
[1]ZHANG X F,ZHU C.Empirical Study of Code Smell Impact on Software Evolution[J].Journal of Software,2019,30(5):363-376. [2]FENG Q,WANG M,ZHANG M,et al.Extracting conditionalformulas for cross-platform bug search[C]//Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.2017:346-359. [3]RAFF E,BARKER J,SYLVESTER J,et al.Malware detection by eating a whole exe[C]//Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.2018. [4]FANG L,WU Z H,WEI Q.Summary of Binary Code Similarity Detection Techniques[J].Computer Science,2021,48(5):1-8. [5]FANG L,WEI Q,WU Z H,et al.Neural Network-based Binary Function Similarity Detection[J].Computer Science,2021,48(10):286-293. [6]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binarycode similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678. [7]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376. [8]GAO J,YANG X,FU Y,et al.VulSeeker:A semantic learning based vulnerability seeker for cross-platform binary[C]//2018 33rd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2018:896-899. [9]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013. [10]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489. [11]LI X,QU Y,YIN H.Palmtree:learning an assembly language model for instruction embedding[C]//Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.2021:3236-3251. [12]CHOPRA S,HADSELL R,LECUN Y.Learning a similaritymetric discriminatively,with application to face verification[C]//2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05).IEEE,2005:539-546. [13]MASSARELLI L,LUNA G A D,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Cham:Springer,2019:309-329. [14]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018. [15]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//International Conference on Machine Learning.PMLR,2014:1188-1196. [16]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [17]HAN X,ZHANG Z,DING N,et al.Pre-trained models:Past,present and future[J].AI Open,2021,2:225-250. [18]LI Z J,FAN Y,WU X J.Survey of Natural Language ProcessingPre-training Techniques[J].Computer Science,2020,47(3):162-173. [19]RIBEIRO L F R,SAVERESE P H P,FIGUEIREDO D R.struc2vec:Learning node representations from structural identity[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data mining.2017:385-394. [20]joxeankoret.Diaphora A Free and Open Source Program Diffing Tool[EB/OL].(2022-03-19)[2022-03-19].http://diaphora.re/. [21]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//Proceedings of the 22nd ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining.2016:461-470. [22]zynamics.BinDiff[EB/OL].(2022-03-19)[2022-03-19].https://www.zynamics.com/bindiff.html. [23]hex-rays.com.IDA Pro-Hex Rays[EB/OL].(2022-03-19)[2022-03-19].https://hex-rays.com/IDA-pro/. |
[1] | 苏琦, 王红玲, 王中卿. 基于预训练模型的无监督剧本摘要 Unsupervised Script Summarization Based on Pre-trained Model 计算机科学, 2023, 50(2): 310-316. https://doi.org/10.11896/jsjkx.211100039 |
[2] | 梁瑶, 谢春丽, 王文捷. 基于图嵌入的代码相似性度量 Code Similarity Measurement Based on Graph Embedding 计算机科学, 2022, 49(11A): 211000186-6. https://doi.org/10.11896/jsjkx.211000186 |
[3] | 方磊, 武泽慧, 魏强. 二进制代码相似性检测技术综述 Summary of Binary Code Similarity Detection Techniques 计算机科学, 2021, 48(5): 1-8. https://doi.org/10.11896/jsjkx.200400085 |
[4] | 郑建云, 庞建民, 周鑫, 王军. 基于约束推导式的增强型二进制漏洞挖掘 Enhanced Binary Vulnerability Mining Based on Constraint Derivation 计算机科学, 2021, 48(3): 320-326. https://doi.org/10.11896/jsjkx.200700047 |
[5] | 方磊, 魏强, 武泽慧, 杜江, 张兴明. 基于神经网络的二进制函数相似性检测技术 Neural Network-based Binary Function Similarity Detection 计算机科学, 2021, 48(10): 286-293. https://doi.org/10.11896/jsjkx.200900185 |
[6] | 帕尔哈提江·斯迪克, 马建峰, 孙聪. 一种面向二进制的细粒度控制流完整性方法 Fine-grained Control Flow Integrity Method on Binaries 计算机科学, 2019, 46(11A): 417-420. |
[7] | 曹阳,袁鑫攀,龙军. 连接位极大似然动态过滤算法 Dynamic Filtering Algorithm of Connected Bit Maximum Likelihood Minwise Hash 计算机科学, 2016, 43(Z6): 410-412. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.097 |
[8] | 陈林博,江建慧,张丹青. 利用返回地址保护机制防御代码复用类攻击 Prevention of Code Reuse Attacks through Return Address Protection 计算机科学, 2013, 40(9): 93-98. |
[9] | 熊浩,晏海华,黄永刚,郭涛,李舟军. 一种基于BP神经网络的代码相似性检测方法 Code Similarity Detection Approach Based on Back-propagation Neural Network 计算机科学, 2010, 37(3): 159-164. |
[10] | . 一种基于重定位信息的二次反汇编算法 计算机科学, 2007, 34(7): 284-287. |
|