基于语义的多架构二进制函数名预测方法

doi:10.11896/jsjkx.220800175

计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 369-376.doi: 10.11896/jsjkx.220800175

基于语义的多架构二进制函数名预测方法

邵文强¹, 蔡瑞杰¹, 宋恩舟², 郭茜茜¹, 刘胜利¹

1 数学工程与先进计算国家重点实验室郑州450001
2 国家数字交换系统工程技术研究中心郑州450001

收稿日期:2022-08-17 修回日期:2022-11-26 出版日期:2023-10-10 发布日期:2023-10-10
通讯作者: 刘胜利(mr_liushengli@163.com)
作者简介:(zzuswq@126.com)
基金资助:
科技委基础加强项目(2019-JCJQ-ZD-113)

Semantic-based Multi-architecture Binary Function Name Prediction Method

SHAO Wenqiang¹, CAI Ruijie¹, SONG Enzhou², GUO Xixi¹, LIU Shengli¹

1 State Key Laboratory of Mathematical Engineering and Advanced Computing,Zhengzhou 450001,China
2 National Digital Switching System Engineering Technological R&D Center,Zhengzhou 450001,China

Received:2022-08-17 Revised:2022-11-26 Online:2023-10-10 Published:2023-10-10
About author:SHAO Wenqiang,born in 1996,postgraduate.His main main research intere-sts include binary code analysis and machine learning.LIU Shengli,born in 1973,Ph.D professor.His main research interests include network device security and network attack detection.
Supported by:
Foundation Strengthening Key Project of Science & Technology Commission(2019-JCJQ-ZD-113).

摘要/Abstract

摘要： 丰富的可读性源信息对逆向工作具有重要意义,尤其是高质量的函数名对程序理解非常重要。然而,软件发布者无论是出于防止逆向或者精简软件大小的角度,往往会发布剥离掉源级调试信息的可执行文件,可读性信息缺失导致逆向分析难度加大。因此,提出了一种多架构函数名预测(Multi-architecture Function Name Prediction,MFNP)方法,利用LLVM RetDec反编译X86,ARM,MIPS架构的二进制文件为中间语言(IR)文件解决不同架构之间存在差异的问题。对中间语言.ll文件中的函数名进行形态上、语义上的相似性比较,对函数名进行相似性融合来降低函数名数据稀疏性。将携带顺序指令语义信息的基本块以及以基本块为基本单位的函数体控制流图作为函数体的语义特征,结合神经网络来实现X86,MIPS,ARM这3种架构下剥离二进制文件的函数名预测。相比DEBIN,所提方法额外支持MIPS架构下的剥离二进制函数名预测工作,其在Precision和F1方面相比NERO提高了13.86%和11.93%。最后验证了MFNP选用以基本块为基本单位提取的顺序指令序列和控制流图作为语义特征的有效性。

关键词: 静态二进制分析, 中间表示, 函数名称预测, 自然语言处理, 神经网络

Abstract: Rich readable source information is important for reverse work,especially high-quality function names are important for program understanding.However,software publishers often release executable stripped of source-level debugging information,either to prevent reversals or to streamline the size of the software,which makes reverse analysis more difficult due to the lack of readable information.To this end,a multi-architecture function name prediction(MFNP) method is proposed to resolve the differences between architectures using LLVM RetDec to decompile X86,ARM,and MIPS architecture binaries into intermediate language(IR) files.Morphological and semantic similarity comparison of function names in readable intermediate language.ll files,and similarity fusion of function names to reduce function name data sparsity.The basic blocks carrying the semantic information of sequential instructions and the control flow graph of function bodies with basic blocks as the basic units are used as semantic features of function bodies,combined with neural networks to achieve function name prediction of stripped binaries in three architectures,X86,MIPS and ARM.Compared to DEBIN,it additionally supports stripped binary function name prediction work under MIPS architecture DEBIN.The improvement in Precision and F1 is 13.86% and 11.93% respectively compared with NERO.The effectiveness of MFNP in selecting sequential instruction sequences and control flow graphs extracted with basic blocks as the basic unit as semantic features is verified.

Key words: Static binary analysis, Intermediate representations, Function name prediction, Natural language processing, Neural networks

中图分类号:

TP311

邵文强, 蔡瑞杰, 宋恩舟, 郭茜茜, 刘胜利. 基于语义的多架构二进制函数名预测方法[J]. 计算机科学, 2023, 50(10): 369-376. https://doi.org/10.11896/jsjkx.220800175

SHAO Wenqiang, CAI Ruijie, SONG Enzhou, GUO Xixi, LIU Shengli. Semantic-based Multi-architecture Binary Function Name Prediction Method[J]. Computer Science, 2023, 50(10): 369-376. https://doi.org/10.11896/jsjkx.220800175

参考文献

[1]VOTIPKA D,RABIN S,MICINSKI K,et al.An ObservationalInvestigation of Reverse {Engineers'} Processes[C]//29th USENIX Security Symposium(USENIX Security 20).2020:1875-1892.
[2]GELLENBECK E M,COOK C R.Aninvestigation of procedure and variable names as beacons during program comprehension[C]//Empirical Studies of Programmers:Fourth Workshop.Ablex Publishing,Norwood,NJ,1991:65-81.
[3]ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]//Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering.2015:38-49.
[4]ALON U,SADAKA R,LEVY O,et al.Structural languagemodels for any-code generation[J].arXiv:1910.00577,2019.
[5]FOWLER M.Refactoring:improving the design of existing code[M].Hoboken:Addison-Wesley Professional,2018.
[6]HØST E W,ØSTVOLD B M.Debugging method names[C]//European Conference on Object-Oriented Programming.Berlin:Springer,2009:294-317.
[7]JACOBSON E R,ROSENBLUM N,MILLER B P.Labeling library functions in stripped binaries[C]//Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools.2011:1-8.
[8]LIU Z,WANG S.How far we have come:testing decompilation correctness of C decompilers[C]//Proceedings of the29th ACM SIGSOFT International Symposium on Software Testing and Analysis.2020:475-487.
[9]AHMAD W U,CHAKRABORTY S,RAY B,et al.A transformer-based approach for source code summarization[J].ar-Xiv:2005.00653,2020.
[10]LECLAIR A,JIANG S,MCMILLAN C.A neural model forgenerating natural language summaries of program subroutines[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:795-806.
[11]HE J,IVANOV P,TSANKOV P,et al.Debin:Predicting debug information in stripped binaries[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:1667-1680.
[12]DAVID Y,ALON U,YAHAV E.Neural reverse engineering of stripped binaries using augmented control flow graphs[C]//Proceedings of the ACM on Programming Languages.2020:1-28.
[13]VENKATAKEERTHY S,AGGARWAL R,JAIN S,et al.IR2VEC:LLVM IR Based Scalable Program Embeddings[J].ACM Transactions on Architecture and Code Optimization,2020,17(4):1-27.
[14]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489.
[15]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Lear-ning distributed representations of code[C]//Proceedings of the ACM on Programming Languages.2019:1-29.
[16]PATRICK-EVANS J,CAVALLARO L,KINDER J.Probabilistic naming of functions in stripped binaries[C]//Annual Computer Security Applications Conference.2020:373-385.
[17]GAO H,CHENG S,XUE Y,et al.A lightweight framework for function name reassignment based on large-scale stripped binaries[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis.2021:607-619.
[18]BAN G,XU L,XIAO Y,et al.B2SMatcher:fine-Grained version identification of open-Source software in binary files[J].Cybersecurity,2021,4(1):1-21.
[19]DUAN Y,LI X,WANG J,et al.Deepbindiff:Learning program-wide code representations for binary diffing[C]//Network and Distributed System Security Symposium.2020.
[20]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[21]PEROZZI B,AL-RFOU R,SKIENA S.Deepwalk:Online lear-ning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2014:701-710.
[22]ZHAO J,NAGARAKATTE S,MARTIN M M K,et al.Formal verification of SSA-based optimizations for LLVM[C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation.2013:175-186.
[23]BELYAEV M,TSESKO V.LLVM-based static analysis toolusing type and effect systems[J].Automatic Control and Computer Sciences,2012,46(7):324-330.
[24]LI Z,ZOU D,XU S,et al.Vuldeepecker:A deep learning-based system for vulnerabilitydetection[J].arXiv:1801.01681,2018.
[25]KIDANE L,KUMAR S,TSVETKOV Y.An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation[J].arXiv:2103.16789,2021.
[26]ZHANG Q,LU H,SAK H,et al.Transformer transducer:Astreamable speech recognition model with transformer encoders and rnn-t loss[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7829-7833.
[27]NEELIMA D,KARTHIK J,ARAVIND JOHN K,et al.SoftComputing-Based Intrusion Detection Approaches:An Analytical Study[M]//Soft Computing in Data Analytics.Singapore:Springer,2019:635-651.
[28]HUANG Z,ZHU Z,MA W,et al.B2GAN:Bidirectional-branch generative adversarial network for text image super-resolution with structure preservation[J].Optik,2022,261:169093.
[29]ILYA S,ORIOL V,QUOC VL.Sequence to sequence learning with neural networks[C]//Proceedings of the 27th Interna-tional Conference on Neural Information Processing Systems－Volume 2(NIPS'14).MIT Press,Cambridge,MA,USA,2014:3104-3112.
[30]ASHISH V,NOAM S,NIKI P,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS'17).Red Hook,NY,USA,2017:6000-6010.
[31]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[32]KLEIN G,KIM Y,DENG Y,et al.Opennmt:Open-source toolkit for neural machine translation[J].arXiv:1701.02810,2017.
[33]ALLAMANIS M,PENG H,SUTTON C.A convolutional at-tention network for extreme summarization of source code[C]//International Conference on Machine Learning.PMLR,2016:2091-2100.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于语义的多架构二进制函数名预测方法

Semantic-based Multi-architecture Binary Function Name Prediction Method

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0