Computer Science ›› 2021, Vol. 48 ›› Issue (10): 286-293.doi: 10.11896/jsjkx.200900185

• Information Security • Previous Articles     Next Articles

Neural Network-based Binary Function Similarity Detection

FANG Lei1, WEI Qiang1, WU Ze-hui1, DU Jiang1, ZHANG Xing-ming2   

  1. 1 State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450001,China
    2 Zhejiang Lab,Hangzhou 310001,China
  • Received:2020-09-25 Revised:2020-12-14 Online:2021-10-15 Published:2021-10-18
  • About author:FANG Lei,born in 1989,postgraduate,assistant engineer.His main research interests include security of network information and so on.
    WEI Qiang,born in 1979,Ph.D,professor,Ph.D supervisor.His main research interests include security of network information and so on.
  • Supported by:
    National Key Research and Development of China (2016QY07X1404,2017YFB0802901) and Advanced Indus-trial Internet Security Platform Project(2018FD0ZX01).

Abstract: Binary code similarity detection has extensive and important applications in program traceability and security audit.In recent years,the application of neural network technology in binary code similarity detection has broken through the performance bottleneck encountered by traditional detection techno-logy in large-scale detection tasks,making code similarity detection technology based on neural network embedding gradually become a research hotspot.This paper proposes a neural network-based binary function similarity detection technology.This paper first uses a uniform intermediate representation to eliminate the diffe-rences in instruction architecture of assembly code.Secondly,at the basic block level,it uses a word embedding model in natural language processing to learn the intermediate representation code and obtain the basic block semantic embedding.Then,at the function level,it uses an improved graph neural network model to learn the control flow information of the function,taking consideration of the basic block semantics at the same time,and to obtain the final function embedding.Finally,the similarity between two functions is measured by calculating the cosine distance between the two function embeddingvectors.This paper also implements a prototype system based on this technology.Experiments show that the program code representation learning process of this technology can avoid the introduction of human bias,the improved graph neural network is more suitable for learning the control flow information of functions,and the scalability and detection accuracy of our system are both improved,compared with the existing schemes.

Key words: Binary function, Graph neural network, Representational learning, Similarity detection

CLC Number: 

  • TP313
[1]ALLEN F E.Control Flow Analysis[J].ACM Sigplan Notices,1970,5(7):1-19.
[2]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//Proceedings of AcmSigsac Conference on Computer & Communications Security.New York:ACM,2016:480-491.
[3]Valgrind.Python Bindings for Valgrind's VEX IR[EB/OL].(2020-07-28)[2020-09-08].https://github.com/angr/pyvex.
[4]Valgrind.ValgrindHome[EB/OL].(2020-07-13)[2020-07-13].https://www.valgrind.org/.
[5]NETHERCOTE N,SEWARD J.Valgrind:AFramework forHeavyweight Dynamic Binary Instrumentation[C]//Procee-dings of ACM SIGPLAN Conference on Programming Language Design and Implementation.New York:ACM,2007:89-100.
[6]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents [C]//Proceedings of International Conference on Machine Learning.Beijing:JMLR,2014:1188-1196.
[7]DAI H,DAI B,SONG L.Discriminative Embeddings of Latent Variable Models for Structured Data [C]//Proceedings of International Conference on Machine Learning.New York:JMLR,2016:2702-2711.
[8]GILMER J,SCHOENHOLZ S S,RILEY P,et al.Neural Message Passing for Quantum Chemistry[C]//Proceedings of International Conference on Machine Learning.Sydney:JMLR,2017:1263-1272.
[9]XU X,LIU C,FENG Q,et al.Neural Network-based GraphEmbedding for Cross-platform Binary Code Similarity Detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security.New York:ACM,2017:363-376.
[10]LIU B,HUO W,ZHANG C,et al.αDiff:Cross-version BinaryCode Similarity Detection with DNN[C]//Proceedings of ACM/IEEE International Conference on Automated Software Engineering.New York:ACM,2018:667-678.
[11]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization[C]//Proceedings of IEEE Symposium on Security and Privacy.San Francisco:IEEE,2019:472-489.
[12]ZUO F,LI X,YOUNG P,et al.Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs[C]//Network and Distributed Systems Security Symposium.2019.
[13]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space [C]//Internatio-nal Conference on Learning Representations.2013.
[14]Google.Tool for Computing Continuous Distributed Representations of Words[EB/OL].(2013-07-30)[2020-03-07].https://code.google.com/archive/p/word2vec/.
[15]HOCHREITER S,SCHMIDHUBER J.Long Short-term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[16]MASSARELLI L,DI LUNA G A,PETRONI F,et al.SAFE:Self-attentive Function Embeddings for Binary Similarity[C]//Proceedings of International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Gothenburg:Springer,2019:309-329.
[17]LIN Z,FENG M,SANTOS C N,et al.A Structured Self-attentive Sentence Embedding[C]//International Conference on Learning Representations.arXiv:1703.03130,2017.
[18]LUO Z,WANG B,TANG Y,et al.Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things[J].Applied Sciences,2019,9(16):3283-3304.
[19]YU Z,CAO R,TANG Q,et al.Order Matters:Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Palo Alto:AAAI,2020:1145-1152.
[20]POEPLAU S,FRANCILLON A.Systematic Comparison ofSymbolic Execution Systems:Intermediate Representation and Its Generation[C]//Proceedings of Annual Computer Security Applications Conference.New York:ACM,2019:163-176.
[21]MORIN F,BENGIO Y.Hierarchical Probabilistic Neural Network Language Model[C]//Proceedings of International Conference on Artificial Intelligence and Statistics.AISTATS,2005:246-252.
[22]MNIH A,HINTON G E.A Scalable Hierarchical DistributedLanguage Model[C]//Proceedings of International Conference on Neural Information Processing Systems.Red Hook:Curran Associates,2008:1081-1088.
[23]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[C]//Proceedings of International Conference on Neural Information Processing Systems.Red Hook:Curran Associates,2013:3111-3119.
[24]RUMELHART D E,HINTON G E,WILLIAMS R J.Learning Representations by Back-propagating Errors[J].Nature,1986,323(6088):533-536.
[25]GREFF K,SRIVASTAVA R K,KOUTNIK J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks,2017,28(10):2222-2232.
[26]CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the Properties of Neural Machine Translation:Encoder-Decoder Approaches[C]//Proceedings of Workshop on Syntax,Semantics and Structure in Statistical Translationar.Doha:Association for Computational Linguistics,2014:103-111.
[27]XU K,HU W,LESKOVEC J,et al.How Powerful are Graph Neural Networks?[C]//Proceedings of the 7th International Conference on Learning Representations.2019:1-17.
[1] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[2] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[3] QI Xiu-xiu, WANG Jia-hao, LI Wen-xiong, ZHOU Fan. Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning [J]. Computer Science, 2022, 49(7): 18-24.
[4] YANG Bing-xin, GUO Yan-rong, HAO Shi-jie, Hong Ri-chang. Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition [J]. Computer Science, 2022, 49(7): 57-63.
[5] XIONG Zhong-min, SHU Gui-wen, GUO Huai-yu. Graph Neural Network Recommendation Model Integrating User Preferences [J]. Computer Science, 2022, 49(6): 165-171.
[6] DENG Zhao-yang, ZHONG Guo-qiang, WANG Dong. Text Classification Based on Attention Gated Graph Neural Network [J]. Computer Science, 2022, 49(6): 326-334.
[7] YU Ai-xin, FENG Xiu-fang, SUN Jing-yu. Social Trust Recommendation Algorithm Combining Item Similarity [J]. Computer Science, 2022, 49(5): 144-151.
[8] LI Yong, WU Jing-peng, ZHANG Zhong-ying, ZHANG Qiang. Link Prediction for Node Featureless Networks Based on Faster Attention Mechanism [J]. Computer Science, 2022, 49(4): 43-48.
[9] CAO He-xin, ZHAO Liang, LI Xue-feng. Technical Research of Graph Neural Network for Text-to-SQL Parsing [J]. Computer Science, 2022, 49(4): 110-115.
[10] MIAO Xu-peng, ZHOU Yue, SHAO Ying-xia, CUI Bin. GSO:A GNN-based Deep Learning Computation Graph Substitutions Optimization Framework [J]. Computer Science, 2022, 49(3): 86-91.
[11] ZHANG Ren-zhi, ZHU Yan. Malicious User Detection Method for Social Network Based on Active Learning [J]. Computer Science, 2021, 48(6): 332-337.
[12] FANG Lei, WU Ze-hui, WEI Qiang. Summary of Binary Code Similarity Detection Techniques [J]. Computer Science, 2021, 48(5): 1-8.
[13] LI Si-di, GUO Bing-hui, YANG Xiao-bo. Study on Financial Credit Information Based on Graph Neural Network [J]. Computer Science, 2021, 48(4): 85-90.
[14] ZHANG Jian-xiong, SONG Kun, HE Peng, LI Bing. Identification of Key Classes in Software Systems Based on Graph Neural Networks [J]. Computer Science, 2021, 48(12): 149-158.
[15] NING Yi-xin, XIE Hui, JIANG Huo-wen. Survey of Graph Neural Network in Community Detection [J]. Computer Science, 2021, 48(11A): 11-16.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!