计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 369-376.doi: 10.11896/jsjkx.220800175
邵文强1, 蔡瑞杰1, 宋恩舟2, 郭茜茜1, 刘胜利1
SHAO Wenqiang1, CAI Ruijie1, SONG Enzhou2, GUO Xixi1, LIU Shengli1
摘要: 丰富的可读性源信息对逆向工作具有重要意义,尤其是高质量的函数名对程序理解非常重要。然而,软件发布者无论是出于防止逆向或者精简软件大小的角度,往往会发布剥离掉源级调试信息的可执行文件,可读性信息缺失导致逆向分析难度加大。因此,提出了一种多架构函数名预测(Multi-architecture Function Name Prediction,MFNP)方法,利用LLVM RetDec反编译X86,ARM,MIPS架构的二进制文件为中间语言(IR)文件解决不同架构之间存在差异的问题。对中间语言.ll文件中的函数名进行形态上、语义上的相似性比较,对函数名进行相似性融合来降低函数名数据稀疏性。将携带顺序指令语义信息的基本块以及以基本块为基本单位的函数体控制流图作为函数体的语义特征,结合神经网络来实现X86,MIPS,ARM这3种架构下剥离二进制文件的函数名预测。相比DEBIN,所提方法额外支持MIPS架构下的剥离二进制函数名预测工作,其在Precision和F1方面相比NERO提高了13.86%和11.93%。最后验证了MFNP选用以基本块为基本单位提取的顺序指令序列和控制流图作为语义特征的有效性。
中图分类号:
[1]VOTIPKA D,RABIN S,MICINSKI K,et al.An ObservationalInvestigation of Reverse {Engineers'} Processes[C]//29th USENIX Security Symposium(USENIX Security 20).2020:1875-1892. [2]GELLENBECK E M,COOK C R.Aninvestigation of procedure and variable names as beacons during program comprehension[C]//Empirical Studies of Programmers:Fourth Workshop.Ablex Publishing,Norwood,NJ,1991:65-81. [3]ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]//Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering.2015:38-49. [4]ALON U,SADAKA R,LEVY O,et al.Structural languagemodels for any-code generation[J].arXiv:1910.00577,2019. [5]FOWLER M.Refactoring:improving the design of existing code[M].Hoboken:Addison-Wesley Professional,2018. [6]HØST E W,ØSTVOLD B M.Debugging method names[C]//European Conference on Object-Oriented Programming.Berlin:Springer,2009:294-317. [7]JACOBSON E R,ROSENBLUM N,MILLER B P.Labeling library functions in stripped binaries[C]//Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools.2011:1-8. [8]LIU Z,WANG S.How far we have come:testing decompilation correctness of C decompilers[C]//Proceedings of the29th ACM SIGSOFT International Symposium on Software Testing and Analysis.2020:475-487. [9]AHMAD W U,CHAKRABORTY S,RAY B,et al.A transformer-based approach for source code summarization[J].ar-Xiv:2005.00653,2020. [10]LECLAIR A,JIANG S,MCMILLAN C.A neural model forgenerating natural language summaries of program subroutines[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:795-806. [11]HE J,IVANOV P,TSANKOV P,et al.Debin:Predicting debug information in stripped binaries[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:1667-1680. [12]DAVID Y,ALON U,YAHAV E.Neural reverse engineering of stripped binaries using augmented control flow graphs[C]//Proceedings of the ACM on Programming Languages.2020:1-28. [13]VENKATAKEERTHY S,AGGARWAL R,JAIN S,et al.IR2VEC:LLVM IR Based Scalable Program Embeddings[J].ACM Transactions on Architecture and Code Optimization,2020,17(4):1-27. [14]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489. [15]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Lear-ning distributed representations of code[C]//Proceedings of the ACM on Programming Languages.2019:1-29. [16]PATRICK-EVANS J,CAVALLARO L,KINDER J.Probabilistic naming of functions in stripped binaries[C]//Annual Computer Security Applications Conference.2020:373-385. [17]GAO H,CHENG S,XUE Y,et al.A lightweight framework for function name reassignment based on large-scale stripped binaries[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis.2021:607-619. [18]BAN G,XU L,XIAO Y,et al.B2SMatcher:fine-Grained version identification of open-Source software in binary files[J].Cybersecurity,2021,4(1):1-21. [19]DUAN Y,LI X,WANG J,et al.Deepbindiff:Learning program-wide code representations for binary diffing[C]//Network and Distributed System Security Symposium.2020. [20]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376. [21]PEROZZI B,AL-RFOU R,SKIENA S.Deepwalk:Online lear-ning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2014:701-710. [22]ZHAO J,NAGARAKATTE S,MARTIN M M K,et al.Formal verification of SSA-based optimizations for LLVM[C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation.2013:175-186. [23]BELYAEV M,TSESKO V.LLVM-based static analysis toolusing type and effect systems[J].Automatic Control and Computer Sciences,2012,46(7):324-330. [24]LI Z,ZOU D,XU S,et al.Vuldeepecker:A deep learning-based system for vulnerabilitydetection[J].arXiv:1801.01681,2018. [25]KIDANE L,KUMAR S,TSVETKOV Y.An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation[J].arXiv:2103.16789,2021. [26]ZHANG Q,LU H,SAK H,et al.Transformer transducer:Astreamable speech recognition model with transformer encoders and rnn-t loss[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7829-7833. [27]NEELIMA D,KARTHIK J,ARAVIND JOHN K,et al.SoftComputing-Based Intrusion Detection Approaches:An Analytical Study[M]//Soft Computing in Data Analytics.Singapore:Springer,2019:635-651. [28]HUANG Z,ZHU Z,MA W,et al.B2GAN:Bidirectional-branch generative adversarial network for text image super-resolution with structure preservation[J].Optik,2022,261:169093. [29]ILYA S,ORIOL V,QUOC VL.Sequence to sequence learning with neural networks[C]//Proceedings of the 27th Interna-tional Conference on Neural Information Processing Systems-Volume 2(NIPS'14).MIT Press,Cambridge,MA,USA,2014:3104-3112. [30]ASHISH V,NOAM S,NIKI P,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS'17).Red Hook,NY,USA,2017:6000-6010. [31]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014. [32]KLEIN G,KIM Y,DENG Y,et al.Opennmt:Open-source toolkit for neural machine translation[J].arXiv:1701.02810,2017. [33]ALLAMANIS M,PENG H,SUTTON C.A convolutional at-tention network for extreme summarization of source code[C]//International Conference on Machine Learning.PMLR,2016:2091-2100. |
|