计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 286-293.doi: 10.11896/jsjkx.200900185

• 信息安全 • 上一篇    下一篇

基于神经网络的二进制函数相似性检测技术

方磊1, 魏强1, 武泽慧1, 杜江1, 张兴明2   

  1. 1 信息工程大学数学工程与先进计算国家重点实验室 郑州450001
    2 之江实验室 杭州310001
  • 收稿日期:2020-09-25 修回日期:2020-12-14 出版日期:2021-10-15 发布日期:2021-10-18
  • 通讯作者: 魏强(prof_weiqiang@163.com)
  • 作者简介:nanbeiyouzi@live.com
  • 基金资助:
    国家重点研发计划(2016QY07X1404,2017YFB0802901);之江实验室“先进工业互联网安全平台”项目(2018FD0ZX01)

Neural Network-based Binary Function Similarity Detection

FANG Lei1, WEI Qiang1, WU Ze-hui1, DU Jiang1, ZHANG Xing-ming2   

  1. 1 State Key Laboratory of Mathematical Engineering and Advanced Computing,PLA Information Engineering University,Zhengzhou 450001,China
    2 Zhejiang Lab,Hangzhou 310001,China
  • Received:2020-09-25 Revised:2020-12-14 Online:2021-10-15 Published:2021-10-18
  • About author:FANG Lei,born in 1989,postgraduate,assistant engineer.His main research interests include security of network information and so on.
    WEI Qiang,born in 1979,Ph.D,professor,Ph.D supervisor.His main research interests include security of network information and so on.
  • Supported by:
    National Key Research and Development of China (2016QY07X1404,2017YFB0802901) and Advanced Indus-trial Internet Security Platform Project(2018FD0ZX01).

摘要: 二进制代码相似性检测在程序的追踪溯源和安全审计中都有着广泛而重要的应用。近年来,神经网络技术被应用于二进制代码相似性检测,突破了传统检测技术在大规模检测任务中遇到的性能瓶颈,因此基于神经网络嵌入的代码相似性检测技术逐渐成为热门研究。文中提出了一种基于神经网络的二进制函数相似性检测技术,该技术首先利用统一的中间表示来消除不同汇编代码在指令架构上的差异;其次在程序基本块级别,利用自然语言处理的词嵌入模型来学习中间表示代码,以获得基本块语义嵌入;然后在函数级别,利用改进的图神经网络模型来学习函数的控制流信息,同时兼顾基本块的语义,获得最终的函数嵌入;最后通过计算两个函数嵌入向量间的余弦距离来度量函数间的相似性。文中实现了一个基于该技术的原型系统,实验表明该技术的程序代码表征学习过程能够避免人为偏见的引入,改进的图神经网络更适合学习函数的控制流信息,系统的可扩展性和检测的准确率较现有方案都得到了提升。

关键词: 表征学习, 二进制函数, 图神经网络, 相似性检测

Abstract: Binary code similarity detection has extensive and important applications in program traceability and security audit.In recent years,the application of neural network technology in binary code similarity detection has broken through the performance bottleneck encountered by traditional detection techno-logy in large-scale detection tasks,making code similarity detection technology based on neural network embedding gradually become a research hotspot.This paper proposes a neural network-based binary function similarity detection technology.This paper first uses a uniform intermediate representation to eliminate the diffe-rences in instruction architecture of assembly code.Secondly,at the basic block level,it uses a word embedding model in natural language processing to learn the intermediate representation code and obtain the basic block semantic embedding.Then,at the function level,it uses an improved graph neural network model to learn the control flow information of the function,taking consideration of the basic block semantics at the same time,and to obtain the final function embedding.Finally,the similarity between two functions is measured by calculating the cosine distance between the two function embeddingvectors.This paper also implements a prototype system based on this technology.Experiments show that the program code representation learning process of this technology can avoid the introduction of human bias,the improved graph neural network is more suitable for learning the control flow information of functions,and the scalability and detection accuracy of our system are both improved,compared with the existing schemes.

Key words: Binary function, Graph neural network, Representational learning, Similarity detection

中图分类号: 

  • TP313
[1]ALLEN F E.Control Flow Analysis[J].ACM Sigplan Notices,1970,5(7):1-19.
[2]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//Proceedings of AcmSigsac Conference on Computer & Communications Security.New York:ACM,2016:480-491.
[3]Valgrind.Python Bindings for Valgrind's VEX IR[EB/OL].(2020-07-28)[2020-09-08].https://github.com/angr/pyvex.
[4]Valgrind.ValgrindHome[EB/OL].(2020-07-13)[2020-07-13].https://www.valgrind.org/.
[5]NETHERCOTE N,SEWARD J.Valgrind:AFramework forHeavyweight Dynamic Binary Instrumentation[C]//Procee-dings of ACM SIGPLAN Conference on Programming Language Design and Implementation.New York:ACM,2007:89-100.
[6]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents [C]//Proceedings of International Conference on Machine Learning.Beijing:JMLR,2014:1188-1196.
[7]DAI H,DAI B,SONG L.Discriminative Embeddings of Latent Variable Models for Structured Data [C]//Proceedings of International Conference on Machine Learning.New York:JMLR,2016:2702-2711.
[8]GILMER J,SCHOENHOLZ S S,RILEY P,et al.Neural Message Passing for Quantum Chemistry[C]//Proceedings of International Conference on Machine Learning.Sydney:JMLR,2017:1263-1272.
[9]XU X,LIU C,FENG Q,et al.Neural Network-based GraphEmbedding for Cross-platform Binary Code Similarity Detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security.New York:ACM,2017:363-376.
[10]LIU B,HUO W,ZHANG C,et al.αDiff:Cross-version BinaryCode Similarity Detection with DNN[C]//Proceedings of ACM/IEEE International Conference on Automated Software Engineering.New York:ACM,2018:667-678.
[11]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization[C]//Proceedings of IEEE Symposium on Security and Privacy.San Francisco:IEEE,2019:472-489.
[12]ZUO F,LI X,YOUNG P,et al.Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs[C]//Network and Distributed Systems Security Symposium.2019.
[13]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space [C]//Internatio-nal Conference on Learning Representations.2013.
[14]Google.Tool for Computing Continuous Distributed Representations of Words[EB/OL].(2013-07-30)[2020-03-07].https://code.google.com/archive/p/word2vec/.
[15]HOCHREITER S,SCHMIDHUBER J.Long Short-term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[16]MASSARELLI L,DI LUNA G A,PETRONI F,et al.SAFE:Self-attentive Function Embeddings for Binary Similarity[C]//Proceedings of International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Gothenburg:Springer,2019:309-329.
[17]LIN Z,FENG M,SANTOS C N,et al.A Structured Self-attentive Sentence Embedding[C]//International Conference on Learning Representations.arXiv:1703.03130,2017.
[18]LUO Z,WANG B,TANG Y,et al.Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things[J].Applied Sciences,2019,9(16):3283-3304.
[19]YU Z,CAO R,TANG Q,et al.Order Matters:Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Palo Alto:AAAI,2020:1145-1152.
[20]POEPLAU S,FRANCILLON A.Systematic Comparison ofSymbolic Execution Systems:Intermediate Representation and Its Generation[C]//Proceedings of Annual Computer Security Applications Conference.New York:ACM,2019:163-176.
[21]MORIN F,BENGIO Y.Hierarchical Probabilistic Neural Network Language Model[C]//Proceedings of International Conference on Artificial Intelligence and Statistics.AISTATS,2005:246-252.
[22]MNIH A,HINTON G E.A Scalable Hierarchical DistributedLanguage Model[C]//Proceedings of International Conference on Neural Information Processing Systems.Red Hook:Curran Associates,2008:1081-1088.
[23]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[C]//Proceedings of International Conference on Neural Information Processing Systems.Red Hook:Curran Associates,2013:3111-3119.
[24]RUMELHART D E,HINTON G E,WILLIAMS R J.Learning Representations by Back-propagating Errors[J].Nature,1986,323(6088):533-536.
[25]GREFF K,SRIVASTAVA R K,KOUTNIK J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks,2017,28(10):2222-2232.
[26]CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the Properties of Neural Machine Translation:Encoder-Decoder Approaches[C]//Proceedings of Workshop on Syntax,Semantics and Structure in Statistical Translationar.Doha:Association for Computational Linguistics,2014:103-111.
[27]XU K,HU W,LESKOVEC J,et al.How Powerful are Graph Neural Networks?[C]//Proceedings of the 7th International Conference on Learning Representations.2019:1-17.
[1] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 黄丽, 朱焱, 李春平.
基于异构网络表征学习的作者学术行为预测
Author’s Academic Behavior Prediction Based on Heterogeneous Network Representation Learning
计算机科学, 2022, 49(9): 76-82. https://doi.org/10.11896/jsjkx.210900078
[3] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[4] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[5] 齐秀秀, 王佳昊, 李文雄, 周帆.
基于概率元学习的矩阵补全预测融合算法
Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning
计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[6] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[7] 熊中敏, 舒贵文, 郭怀宇.
融合用户偏好的图神经网络推荐模型
Graph Neural Network Recommendation Model Integrating User Preferences
计算机科学, 2022, 49(6): 165-171. https://doi.org/10.11896/jsjkx.210400276
[8] 邓朝阳, 仲国强, 王栋.
基于注意力门控图神经网络的文本分类
Text Classification Based on Attention Gated Graph Neural Network
计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218
[9] 彭云聪, 秦小林, 张力戈, 顾勇翔.
面向图像分类的小样本学习算法综述
Survey on Few-shot Learning Algorithms for Image Classification
计算机科学, 2022, 49(5): 1-9. https://doi.org/10.11896/jsjkx.210500128
[10] 余皑欣, 冯秀芳, 孙静宇.
结合物品相似性的社交信任推荐算法
Social Trust Recommendation Algorithm Combining Item Similarity
计算机科学, 2022, 49(5): 144-151. https://doi.org/10.11896/jsjkx.210300217
[11] 李勇, 吴京鹏, 张钟颖, 张强.
融合快速注意力机制的节点无特征网络链路预测算法
Link Prediction for Node Featureless Networks Based on Faster Attention Mechanism
计算机科学, 2022, 49(4): 43-48. https://doi.org/10.11896/jsjkx.210800276
[12] 曹合心, 赵亮, 李雪峰.
图神经网络在Text-to-SQL解析中的技术研究
Technical Research of Graph Neural Network for Text-to-SQL Parsing
计算机科学, 2022, 49(4): 110-115. https://doi.org/10.11896/jsjkx.210200173
[13] 苗旭鹏, 周跃, 邵蓥侠, 崔斌.
GSO:基于图神经网络的深度学习计算图子图替换优化框架
GSO:A GNN-based Deep Learning Computation Graph Substitutions Optimization Framework
计算机科学, 2022, 49(3): 86-91. https://doi.org/10.11896/jsjkx.210700199
[14] 叶洪良, 朱皖宁, 洪蕾.
基于CQT和梅尔频谱的带有人声的音乐风格转换方法
Music Style Transfer Method with Human Voice Based on CQT and Mel-spectrum
计算机科学, 2021, 48(6A): 326-330. https://doi.org/10.11896/jsjkx.200900104
[15] 张人之, 朱焱.
基于主动学习的社交网络恶意用户检测方法
Malicious User Detection Method for Social Network Based on Active Learning
计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!