计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 286-293.doi: 10.11896/jsjkx.200900185
方磊1, 魏强1, 武泽慧1, 杜江1, 张兴明2
FANG Lei1, WEI Qiang1, WU Ze-hui1, DU Jiang1, ZHANG Xing-ming2
摘要: 二进制代码相似性检测在程序的追踪溯源和安全审计中都有着广泛而重要的应用。近年来,神经网络技术被应用于二进制代码相似性检测,突破了传统检测技术在大规模检测任务中遇到的性能瓶颈,因此基于神经网络嵌入的代码相似性检测技术逐渐成为热门研究。文中提出了一种基于神经网络的二进制函数相似性检测技术,该技术首先利用统一的中间表示来消除不同汇编代码在指令架构上的差异;其次在程序基本块级别,利用自然语言处理的词嵌入模型来学习中间表示代码,以获得基本块语义嵌入;然后在函数级别,利用改进的图神经网络模型来学习函数的控制流信息,同时兼顾基本块的语义,获得最终的函数嵌入;最后通过计算两个函数嵌入向量间的余弦距离来度量函数间的相似性。文中实现了一个基于该技术的原型系统,实验表明该技术的程序代码表征学习过程能够避免人为偏见的引入,改进的图神经网络更适合学习函数的控制流信息,系统的可扩展性和检测的准确率较现有方案都得到了提升。
中图分类号:
[1]ALLEN F E.Control Flow Analysis[J].ACM Sigplan Notices,1970,5(7):1-19. [2]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//Proceedings of AcmSigsac Conference on Computer & Communications Security.New York:ACM,2016:480-491. [3]Valgrind.Python Bindings for Valgrind's VEX IR[EB/OL].(2020-07-28)[2020-09-08].https://github.com/angr/pyvex. [4]Valgrind.ValgrindHome[EB/OL].(2020-07-13)[2020-07-13].https://www.valgrind.org/. [5]NETHERCOTE N,SEWARD J.Valgrind:AFramework forHeavyweight Dynamic Binary Instrumentation[C]//Procee-dings of ACM SIGPLAN Conference on Programming Language Design and Implementation.New York:ACM,2007:89-100. [6]LE Q,MIKOLOV T.Distributed Representations of Sentencesand Documents [C]//Proceedings of International Conference on Machine Learning.Beijing:JMLR,2014:1188-1196. [7]DAI H,DAI B,SONG L.Discriminative Embeddings of Latent Variable Models for Structured Data [C]//Proceedings of International Conference on Machine Learning.New York:JMLR,2016:2702-2711. [8]GILMER J,SCHOENHOLZ S S,RILEY P,et al.Neural Message Passing for Quantum Chemistry[C]//Proceedings of International Conference on Machine Learning.Sydney:JMLR,2017:1263-1272. [9]XU X,LIU C,FENG Q,et al.Neural Network-based GraphEmbedding for Cross-platform Binary Code Similarity Detection[C]//Proceedings of ACM SIGSAC Conference on Computer and Communications Security.New York:ACM,2017:363-376. [10]LIU B,HUO W,ZHANG C,et al.αDiff:Cross-version BinaryCode Similarity Detection with DNN[C]//Proceedings of ACM/IEEE International Conference on Automated Software Engineering.New York:ACM,2018:667-678. [11]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization[C]//Proceedings of IEEE Symposium on Security and Privacy.San Francisco:IEEE,2019:472-489. [12]ZUO F,LI X,YOUNG P,et al.Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs[C]//Network and Distributed Systems Security Symposium.2019. [13]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space [C]//Internatio-nal Conference on Learning Representations.2013. [14]Google.Tool for Computing Continuous Distributed Representations of Words[EB/OL].(2013-07-30)[2020-03-07].https://code.google.com/archive/p/word2vec/. [15]HOCHREITER S,SCHMIDHUBER J.Long Short-term Me-mory[J].Neural Computation,1997,9(8):1735-1780. [16]MASSARELLI L,DI LUNA G A,PETRONI F,et al.SAFE:Self-attentive Function Embeddings for Binary Similarity[C]//Proceedings of International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Gothenburg:Springer,2019:309-329. [17]LIN Z,FENG M,SANTOS C N,et al.A Structured Self-attentive Sentence Embedding[C]//International Conference on Learning Representations.arXiv:1703.03130,2017. [18]LUO Z,WANG B,TANG Y,et al.Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things[J].Applied Sciences,2019,9(16):3283-3304. [19]YU Z,CAO R,TANG Q,et al.Order Matters:Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Palo Alto:AAAI,2020:1145-1152. [20]POEPLAU S,FRANCILLON A.Systematic Comparison ofSymbolic Execution Systems:Intermediate Representation and Its Generation[C]//Proceedings of Annual Computer Security Applications Conference.New York:ACM,2019:163-176. [21]MORIN F,BENGIO Y.Hierarchical Probabilistic Neural Network Language Model[C]//Proceedings of International Conference on Artificial Intelligence and Statistics.AISTATS,2005:246-252. [22]MNIH A,HINTON G E.A Scalable Hierarchical DistributedLanguage Model[C]//Proceedings of International Conference on Neural Information Processing Systems.Red Hook:Curran Associates,2008:1081-1088. [23]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[C]//Proceedings of International Conference on Neural Information Processing Systems.Red Hook:Curran Associates,2013:3111-3119. [24]RUMELHART D E,HINTON G E,WILLIAMS R J.Learning Representations by Back-propagating Errors[J].Nature,1986,323(6088):533-536. [25]GREFF K,SRIVASTAVA R K,KOUTNIK J,et al.LSTM:A Search Space Odyssey[J].IEEE Transactions on Neural Networks,2017,28(10):2222-2232. [26]CHO K,VAN MERRIENBOER B,BAHDANAU D,et al.On the Properties of Neural Machine Translation:Encoder-Decoder Approaches[C]//Proceedings of Workshop on Syntax,Semantics and Structure in Statistical Translationar.Doha:Association for Computational Linguistics,2014:103-111. [27]XU K,HU W,LESKOVEC J,et al.How Powerful are Graph Neural Networks?[C]//Proceedings of the 7th International Conference on Learning Representations.2019:1-17. |
[1] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[2] | 黄丽, 朱焱, 李春平. 基于异构网络表征学习的作者学术行为预测 Author’s Academic Behavior Prediction Based on Heterogeneous Network Representation Learning 计算机科学, 2022, 49(9): 76-82. https://doi.org/10.11896/jsjkx.210900078 |
[3] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[4] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[5] | 齐秀秀, 王佳昊, 李文雄, 周帆. 基于概率元学习的矩阵补全预测融合算法 Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning 计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126 |
[6] | 杨炳新, 郭艳蓉, 郝世杰, 洪日昌. 基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用 Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition 计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070 |
[7] | 熊中敏, 舒贵文, 郭怀宇. 融合用户偏好的图神经网络推荐模型 Graph Neural Network Recommendation Model Integrating User Preferences 计算机科学, 2022, 49(6): 165-171. https://doi.org/10.11896/jsjkx.210400276 |
[8] | 邓朝阳, 仲国强, 王栋. 基于注意力门控图神经网络的文本分类 Text Classification Based on Attention Gated Graph Neural Network 计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218 |
[9] | 彭云聪, 秦小林, 张力戈, 顾勇翔. 面向图像分类的小样本学习算法综述 Survey on Few-shot Learning Algorithms for Image Classification 计算机科学, 2022, 49(5): 1-9. https://doi.org/10.11896/jsjkx.210500128 |
[10] | 余皑欣, 冯秀芳, 孙静宇. 结合物品相似性的社交信任推荐算法 Social Trust Recommendation Algorithm Combining Item Similarity 计算机科学, 2022, 49(5): 144-151. https://doi.org/10.11896/jsjkx.210300217 |
[11] | 李勇, 吴京鹏, 张钟颖, 张强. 融合快速注意力机制的节点无特征网络链路预测算法 Link Prediction for Node Featureless Networks Based on Faster Attention Mechanism 计算机科学, 2022, 49(4): 43-48. https://doi.org/10.11896/jsjkx.210800276 |
[12] | 曹合心, 赵亮, 李雪峰. 图神经网络在Text-to-SQL解析中的技术研究 Technical Research of Graph Neural Network for Text-to-SQL Parsing 计算机科学, 2022, 49(4): 110-115. https://doi.org/10.11896/jsjkx.210200173 |
[13] | 苗旭鹏, 周跃, 邵蓥侠, 崔斌. GSO:基于图神经网络的深度学习计算图子图替换优化框架 GSO:A GNN-based Deep Learning Computation Graph Substitutions Optimization Framework 计算机科学, 2022, 49(3): 86-91. https://doi.org/10.11896/jsjkx.210700199 |
[14] | 叶洪良, 朱皖宁, 洪蕾. 基于CQT和梅尔频谱的带有人声的音乐风格转换方法 Music Style Transfer Method with Human Voice Based on CQT and Mel-spectrum 计算机科学, 2021, 48(6A): 326-330. https://doi.org/10.11896/jsjkx.200900104 |
[15] | 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法 Malicious User Detection Method for Social Network Based on Active Learning 计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151 |
|