基于语义的多架构二进制函数名预测方法

doi:10.11896/jsjkx.220800175

Abstract

Abstract: Rich readable source information is important for reverse work,especially high-quality function names are important for program understanding.However,software publishers often release executable stripped of source-level debugging information,either to prevent reversals or to streamline the size of the software,which makes reverse analysis more difficult due to the lack of readable information.To this end,a multi-architecture function name prediction(MFNP) method is proposed to resolve the differences between architectures using LLVM RetDec to decompile X86,ARM,and MIPS architecture binaries into intermediate language(IR) files.Morphological and semantic similarity comparison of function names in readable intermediate language.ll files,and similarity fusion of function names to reduce function name data sparsity.The basic blocks carrying the semantic information of sequential instructions and the control flow graph of function bodies with basic blocks as the basic units are used as semantic features of function bodies,combined with neural networks to achieve function name prediction of stripped binaries in three architectures,X86,MIPS and ARM.Compared to DEBIN,it additionally supports stripped binary function name prediction work under MIPS architecture DEBIN.The improvement in Precision and F1 is 13.86% and 11.93% respectively compared with NERO.The effectiveness of MFNP in selecting sequential instruction sequences and control flow graphs extracted with basic blocks as the basic unit as semantic features is verified.

Key words: Static binary analysis, Intermediate representations, Function name prediction, Natural language processing, Neural networks

CLC Number:

TP311

SHAO Wenqiang, CAI Ruijie, SONG Enzhou, GUO Xixi, LIU Shengli. Semantic-based Multi-architecture Binary Function Name Prediction Method[J].Computer Science, 2023, 50(10): 369-376.

References

[1]VOTIPKA D,RABIN S,MICINSKI K,et al.An ObservationalInvestigation of Reverse {Engineers'} Processes[C]//29th USENIX Security Symposium(USENIX Security 20).2020:1875-1892.
[2]GELLENBECK E M,COOK C R.Aninvestigation of procedure and variable names as beacons during program comprehension[C]//Empirical Studies of Programmers:Fourth Workshop.Ablex Publishing,Norwood,NJ,1991:65-81.
[3]ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]//Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering.2015:38-49.
[4]ALON U,SADAKA R,LEVY O,et al.Structural languagemodels for any-code generation[J].arXiv:1910.00577,2019.
[5]FOWLER M.Refactoring:improving the design of existing code[M].Hoboken:Addison-Wesley Professional,2018.
[6]HØST E W,ØSTVOLD B M.Debugging method names[C]//European Conference on Object-Oriented Programming.Berlin:Springer,2009:294-317.
[7]JACOBSON E R,ROSENBLUM N,MILLER B P.Labeling library functions in stripped binaries[C]//Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools.2011:1-8.
[8]LIU Z,WANG S.How far we have come:testing decompilation correctness of C decompilers[C]//Proceedings of the29th ACM SIGSOFT International Symposium on Software Testing and Analysis.2020:475-487.
[9]AHMAD W U,CHAKRABORTY S,RAY B,et al.A transformer-based approach for source code summarization[J].ar-Xiv:2005.00653,2020.
[10]LECLAIR A,JIANG S,MCMILLAN C.A neural model forgenerating natural language summaries of program subroutines[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:795-806.
[11]HE J,IVANOV P,TSANKOV P,et al.Debin:Predicting debug information in stripped binaries[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:1667-1680.
[12]DAVID Y,ALON U,YAHAV E.Neural reverse engineering of stripped binaries using augmented control flow graphs[C]//Proceedings of the ACM on Programming Languages.2020:1-28.
[13]VENKATAKEERTHY S,AGGARWAL R,JAIN S,et al.IR2VEC:LLVM IR Based Scalable Program Embeddings[J].ACM Transactions on Architecture and Code Optimization,2020,17(4):1-27.
[14]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:472-489.
[15]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Lear-ning distributed representations of code[C]//Proceedings of the ACM on Programming Languages.2019:1-29.
[16]PATRICK-EVANS J,CAVALLARO L,KINDER J.Probabilistic naming of functions in stripped binaries[C]//Annual Computer Security Applications Conference.2020:373-385.
[17]GAO H,CHENG S,XUE Y,et al.A lightweight framework for function name reassignment based on large-scale stripped binaries[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis.2021:607-619.
[18]BAN G,XU L,XIAO Y,et al.B2SMatcher:fine-Grained version identification of open-Source software in binary files[J].Cybersecurity,2021,4(1):1-21.
[19]DUAN Y,LI X,WANG J,et al.Deepbindiff:Learning program-wide code representations for binary diffing[C]//Network and Distributed System Security Symposium.2020.
[20]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[21]PEROZZI B,AL-RFOU R,SKIENA S.Deepwalk:Online lear-ning of social representations[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2014:701-710.
[22]ZHAO J,NAGARAKATTE S,MARTIN M M K,et al.Formal verification of SSA-based optimizations for LLVM[C]//Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation.2013:175-186.
[23]BELYAEV M,TSESKO V.LLVM-based static analysis toolusing type and effect systems[J].Automatic Control and Computer Sciences,2012,46(7):324-330.
[24]LI Z,ZOU D,XU S,et al.Vuldeepecker:A deep learning-based system for vulnerabilitydetection[J].arXiv:1801.01681,2018.
[25]KIDANE L,KUMAR S,TSVETKOV Y.An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation[J].arXiv:2103.16789,2021.
[26]ZHANG Q,LU H,SAK H,et al.Transformer transducer:Astreamable speech recognition model with transformer encoders and rnn-t loss[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:7829-7833.
[27]NEELIMA D,KARTHIK J,ARAVIND JOHN K,et al.SoftComputing-Based Intrusion Detection Approaches:An Analytical Study[M]//Soft Computing in Data Analytics.Singapore:Springer,2019:635-651.
[28]HUANG Z,ZHU Z,MA W,et al.B2GAN:Bidirectional-branch generative adversarial network for text image super-resolution with structure preservation[J].Optik,2022,261:169093.
[29]ILYA S,ORIOL V,QUOC VL.Sequence to sequence learning with neural networks[C]//Proceedings of the 27th Interna-tional Conference on Neural Information Processing Systems－Volume 2(NIPS'14).MIT Press,Cambridge,MA,USA,2014:3104-3112.
[30]ASHISH V,NOAM S,NIKI P,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS'17).Red Hook,NY,USA,2017:6000-6010.
[31]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[32]KLEIN G,KIM Y,DENG Y,et al.Opennmt:Open-source toolkit for neural machine translation[J].arXiv:1701.02810,2017.
[33]ALLAMANIS M,PENG H,SUTTON C.A convolutional at-tention network for extreme summarization of source code[C]//International Conference on Machine Learning.PMLR,2016:2091-2100.

Related Articles 15

[1]	ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[2]	ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[3]	ZHU Yuying, GUO Yan, WAN Yizhao, TIAN Kai. New Word Detection Based on Branch Entropy-Segmentation Probability Model [J]. Computer Science, 2023, 50(7): 221-228.
[4]	LUO Huilan, LONG Jun, LIANG Miaomiao. Attentional Feature Fusion Approach for Siamese Network Based Object Tracking [J]. Computer Science, 2023, 50(6A): 220300237-9.
[5]	HUANG Yujiao, CHEN Mingkai, ZHENG Yuan, FAN Xinggang, XIAO Jie, LONG Haixia. Text Classification Based on Weakened Graph Convolutional Networks [J]. Computer Science, 2023, 50(6A): 220700039-5.
[6]	WEI Tao, LI Zhihua, WANG Changjie, CHENG Shunhang. Cybersecurity Threat Intelligence Mining Algorithm for Open Source Heterogeneous Data [J]. Computer Science, 2023, 50(6): 330-337.
[7]	WANG Lin, MENG Zuqiang, YANG Lina. Chinese Sentiment Analysis Based on CNN-BiLSTM Model of Multi-level and Multi-scale Feature Extraction [J]. Computer Science, 2023, 50(5): 248-254.
[8]	SHAO Yunfei, SONG You, WANG Baohui. Study on Degree of Node Based Personalized Propagation of Neural Predictions forSocial Networks [J]. Computer Science, 2023, 50(4): 16-21.
[9]	ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[10]	YU Jian, ZHAO Mankun, GAO Jie, WANG Congyuan, LI Yarong, ZHANG Wenbin. Study on Graph Neural Networks Social Recommendation Based on High-order and Temporal Features [J]. Computer Science, 2023, 50(3): 49-64.
[11]	WANG Xiaofei, FAN Xueqiang, LI Zhangwei. Improving RNA Base Interactions Prediction Based on Transfer Learning and Multi-view Feature Fusion [J]. Computer Science, 2023, 50(3): 164-172.
[12]	MEI Pengcheng, YANG Jibin, ZHANG Qiang, HUANG Xiang. Sound Event Joint Estimation Method Based on Three-dimension Convolution [J]. Computer Science, 2023, 50(3): 191-198.
[13]	ZHENG Cheng, MEI Liang, ZHAO Yiyan, ZHANG Suhang. Text Classification Method Based on Bidirectional Attention and Gated Graph Convolutional Networks [J]. Computer Science, 2023, 50(1): 221-228.
[14]	NING Han-yang, MA Miao, YANG Bo, LIU Shi-chang. Research Progress and Analysis on Intelligent Cryptology [J]. Computer Science, 2022, 49(9): 288-296.
[15]	ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Semantic-based Multi-architecture Binary Function Name Prediction Method

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0