计算机科学 ›› 2021, Vol. 48 ›› Issue (5): 1-8.doi: 10.11896/jsjkx.200400085

• 计算机软件* 上一篇    下一篇

二进制代码相似性检测技术综述

方磊, 武泽慧, 魏强   

  1. 信息工程大学数学工程与先进计算国家重点实验室 郑州450001
  • 收稿日期:2020-04-20 修回日期:2020-07-30 出版日期:2021-05-15 发布日期:2021-05-09
  • 通讯作者: 魏强(prof_weiqiang@163.com)
  • 基金资助:
    国家重点研发课题(2017YFB0803202);之江实验室“先进工业互联网安全平台”项目(2018FD0ZX01);河南省软科学研究计划项目(192102210128)

Summary of Binary Code Similarity Detection Techniques

FANG Lei, WU Ze-hui, WEI Qiang   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,Information Engineering University,Zhengzhou 450001,China
  • Received:2020-04-20 Revised:2020-07-30 Online:2021-05-15 Published:2021-05-09
  • About author:FANG Lei,born in 1989,postgraduate,assistant engineer.His main research interests include security of network information and so on.(nanbeiyouzi@qq.com)
    WEI Qiang,born in 1979,Ph.D,professor,Ph.D supervisor.His main research interests include security of network information and so on.
  • Supported by:
    National Key Research and Development Project(2017YFB0803202),Advanced Industrial Internet Security Platform Project(2018FD0ZX01) and Henan Soft Science Research Program Project(192102210128).

摘要: 代码相似性检测常用于代码预测、知识产权保护和漏洞搜索等领域,可分为源代码相似性检测和二进制代码相似性检测。软件的源代码通常难以获得,因此针对二进制代码的相似性检测技术能够适用的场景更加广泛,学术界也先后提出了多种检测技术,文中对近年来该领域的研究进行了综述。首先总结代码相似性检测的基本流程和需要解决的难题(如跨编译器、跨编译器优化配置、跨指令架构检测);然后根据关注的代码信息的不同,将当前的二进制代码相似性检测技术分为4类,即基于文本的、基于属性度量的、基于程序逻辑的和基于语义的检测技术,并列举了部分代表性方法和工具(如Karta,discovRE,Ge-nius,Gemini,SAFE等);最后根据发展脉络和最新研究成果,对该领域的发展方向进行了分析和论述。

关键词: 代码相似性检测, 二进制程序, 软件安全

Abstract: Code similarity detection is commonly used in code prediction,intellectual property protection and vulnerability scan,etc.It includes source code similarity detection and binary code similarity detection.Since the source code is usually difficult to access,binary code similarity detection is more widely applicable,and a variety of detection techniques have been proposed in academia.We review researches of this field in recent years.First,we summarize the basic process of code similarity detection and challenges it faces,which include the cross-compiler,cross-optimization and cross-architecture detecting.Then,in consideration of different code information concerned,we propose to classify current binary code similarity detection techniques into 4 categories,including text-based,attribute-based measurement,program logic-based and semantic-based detection technologies,and list some representative methods and tools,such as Karta,discovRE,Genius,Gemini,SAFE,etc.Finally,according to the development context and the latest researches,we analyze and discuss the development direction of this field.

Key words: Binary program, Code similarity detection, Software security

中图分类号: 

  • TP311
[1]Synopsys,Inc.2020 Open Source Security and Risk Analysis Report[EB/OL].(2020-06-08)[2020-07-08].https://www.synopsys.com/software-integrity/resources/analyst-reports/2020-open-source-security-risk-analysis.html.
[2]WHALE G.Plague:Plagiarism Detection Using Program Structure[R].Dept.of Computer Science Technical Report 8805.University of NSW, Kensington, Australasian, 1988.
[3]XIONG H,YAN H H,GUO T,et al.Code Similarity Detection:A Surve[J].Computer Scienc,2010,37(8):9-14.
[4]ZHANG D,LUO P.Survey of Code Similarity Detection Methods and Tools[J/OL].Computer Science.[2020-03-02].http://kns.cnki.net/kcms/detail/50.1075.TP.20200115.1646.004.html.
[5]CAO Y Z,JIN M Z,LIU C.Overview on Clones Detection[J].Computer Engineering & Science,2006(S2):9-13.
[6]XU H Y,LEI Z Z,LI D.Survey of Code Obfuscation[J].Computer & Digital Engineering,2007,35(10):4-7.
[7]Eyal Itkin.Karta:Matching Open Sources in Binaries[EB/OL].(2019-03-21)[2020-03-04].https://research.checkpoint.com/2019/karta-matching-open-sources-in-binaries/.
[8]Hex-Rays.About IDA[EB/OL].(2020-03-29)[2020-03-29].https://www.hex-rays.com/products/ida/.
[9]OHJ.DarunGrim:A Patch Analysis and Binary Diffing Tool[EB/OL].(2020-06-18)[2020-07-10].http://www.darungrim.org/.
[10]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678.
[11]KRIZHEVSKY A,SUTSKEVER I,HINTON G E,et al.ImageNet Classification with Deep Convolutional Neural Networks[C]//Advances in Neural Information Processing Systems.2012:1097-1105.
[12]ESCHWEILER S,YAKDAN K,GERHARDS-PADILLA E.discovRE:Efficient Cross-Architecture Identification of Bugs in Binary Code[C/OL]//The Network and Distributed System Security Symposium(NDSS 2016).2016.http://dx.doi.org/10.14722/ndss.2016.23185.
[13]MUJA M,LOWE D G.Fast approximate nearest neighbors with automatic algorithm configuration[C]//International Conference on Computer Vision Theory and Applications.2009:331-340.
[14]ALLEN F E.Control flow analysis[J].ACM Sigplan Notices,1970,5(7):1-19.
[15]Zynamics.BinDiff Home[EB/OL].(2020-05-05)[2020-07-11].https://www.zynamics.com/bindiff.html.
[16]FLAKE H.Structural comparison of executable objects[C]//Detection of Intrusions and Malware & Vulnerability Assessment.2004:161-173.
[17]DULLIEN T,ROLLES R.Graph-based comparison of executable objects [J].Symposium Sur la Sécurité Des Technologies De L'information Et Des Communications,2005,5(1):3.
[18]MARIMONT R B,SHAPIRO M B.Nearest Neighbour Searches and the Curse of Dimensionality[J].IMA Journal of Applied Mathematics,1979,24(1):59-70.
[19]PEWNY J,GARMANY B,GAWLIK R,et al.Cross-architecture bug search in binary executables[C]//2015 IEEE Symposium on Security and Privacy.IEEE,2015:709-724.
[20]CHANDRAMOHAN M,XUE Y,XU Z,et al.Bingo:Cross-architecture cross-os binary search[C]//Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.2016:678-689.
[21]WANG X,JHI Y C,ZHU S,et al.Behavior based software theft detection[C]//ACM Conference on Computer and Communications Security(CCS 2009).Chicago,Illinois,USA,DBLP,2009:280-290.
[22]PEWNY J,SCHUSTER F,BERNHARD L,et al.Leveragingsemantic signatures for bug search in binary programs[C]//Proceedings of the 30th Annual Computer Security Applications Conference.2014:406-415.
[23]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//Acm Sigsac Conference on Computer & Communications Security.2016:480-491.
[24]DAVID Y,PARTUSH N,YAHAV E.Statistical similarity ofbinaries[J].ACM SIGPLAN Notices,2016,51(6):266-280.
[25]GAO D,REITER M K,SONG D.Binhunt:Automatically finding semantic differences in binary programs[C]//International Conference on Information and Communications Security.Springer,Berlin,Heidelberg,2008:238-255.
[26]MING J,PAN M,GAO D.iBinHunt:Binary hunting with inter-procedural control flow[C]//International Conference on Information Security and Cryptology.Springer,Berlin,Heidelberg,2012:92-109.
[27]NG A Y,JORDAN M I,WEISS Y,et al.On Spectral Clustering:Analysis and an algorithm[C]//Advances in Neural Information Processing Systems.2002:849-856.
[28]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[29]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Springer,Cham,2019:309-329.
[30]BROMLEY J,GUYON I,LECUN Y,et al.Signature verification using a “siamese” time delay neural network[C]//Advances in Neural Information Processing Systems.1994:737-744.
[31]DAI H,DAI B,SONG L.Discriminative embeddings of latentvariable models for structured data[C]//International Conference on Machine Learning.2016:2702-2711.
[32]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy (SP).IEEE,2019:472-489.
[33]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//International Conference on Machine Learning.2014:1188-1196.
[34]Google.Tool for computing continuous distributed representations of words[EB/OL].(2013-07-30)[2020-03-07].https://code.google.com/archive/p/word2vec/.
[35]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[36]LIN Z,FENG M,SANTOS C N,et al.A structured self-attentive sentence embedding[J].arXiv:1703.03130,2017.
[37]LUO Z,WANG B,TANG Y,et al.Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things[J].Applied Sciences,2019,9(16):3283.
[38]Valgrind.Valgrind Home[EB/OL].(2020-07-13)[2020-07-13].https://www.valgrind.org/.
[39]NETHERCOTE N,SEWARD J.Valgrind:a framework forheavyweight dynamic binary instrumentation[C]//Programming Language Design and Implementation,2007,42(6):89-100.
[40]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation.2017:79-94.
[41]BARNETT M,CHANG B Y E,DELINE R,et al.Boogie:Amodular reusable verifier for object-oriented programs[C]//International Symposium on Formal Methods for Components and Objects.Springer,Berlin,Heidelberg,2005:364-387.
[42]XIAO Y,CAO S,CAO Z,et al.Matching Similar Functions in Different Versions of a Malware[C]//2016 IEEE Trustcom/BigDataSE/ISPA.IEEE,2016:252-259.
[43]LUO L,MING J,WU D,et al.Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection[J].IEEE Transactions on Software Engineering,2017(12):1-1.
[44]ALRABAEE S,SHIRANI P,WANG L,et al.SIGMA:A Semantic Integrated Graph Matching Approach for Identifying Reused Functions in Binary Code[J].Digital Investigation:The Internatnional Journal of Digital Forensics & Incident Response,2015,12(1):61-71.
[45]QIU J,SU X,MA P.Library functions identification in binary code by using graph isomorphism testings[C]//2015 IEEE 22nd International Conference on Software Analysis,Evolution,and Reengineering (SANER).IEEE,2015:261-270.
[46]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:461-470.
[47]HU Y,ZHANG Y,LI J,et al.Binary code clone detection across architectures and compiling configurations[C]//2017 IEEE/ACM 25th International Conference on Program Comprehension(ICPC).IEEE,2017:88-98.
[48]NOUH L,RAHIMIAN A,MOUHEB D,et al.Binsign:fingerprinting binary functions to support automated analysis of code executables[C]//IFIP International Conference on ICT Systems Security and Privacy Protection.Springer,Cham,2017:341-355.
[49]SHIRANI P,WANG L,DEBBABI M.BinShape:Scalable androbust binary library function identification using function shape[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Springer,Cham,2017:301-324.
[50]MING J,XU D,JIANG Y,et al.Binsim:Trace-based semanticbinary diffing via system call sliced segment equivalence checking[C]//26th USENIX Security Symposium.2017:253-270.
[51]WANG Y,SHEN J,LIN J,et al.Staged method of code similarity analysis for firmware vulnerability detection[J].IEEE Access,2019(7):14171-14185.
[52]ALRABAEE S,SHIRANI P,WANG L,et al.Fossil:a resilient and efficient system for identifying foss functions in malware binaries[J].ACM Transactions on Privacy and Security (TOPS),2018,21(2):1-34.
[53]LAGEMAN N,KILMER E D,WALLS R J,et al.BinDNN:Resilient Function Matching Using Deep Learning[C]//International Conference on Security and Privacy in Communication Systems.Springer,Cham,2016:517-537.
[54]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018.
[55]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[56]YU Z,CAO R,TANG Q,et al.Order Matters:Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:1145-1152.
[57]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[58]GILMER J,SCHOENHOLZ S S,RILEY P,et al.Neural Message Passing for Quantum Chemistry[C]//Proceedings of the 34th International Conference on Machine Learning(ICML'17).2017:1263-1272.
[1] 胡安祥, 尹小康, 朱肖雅, 刘胜利.
基于数据流特征的比较类函数识别方法
Strcmp-like Function Identification Method Based on Data Flow Feature Matching
计算机科学, 2022, 49(9): 326-332. https://doi.org/10.11896/jsjkx.220200163
[2] 郑培真,苑春春,刘超,吴际,杨海燕,胡宁.
面向软件安全性需求分析过程的追踪模型
Traceability Model Oriented to Software Safety Requirement Analysis Process
计算机科学, 2017, 44(4): 30-34. https://doi.org/10.11896/j.issn.1002-137X.2017.04.007
[3] 张雄,李舟军.
模糊测试技术研究综述
Survey of Fuzz Testing Technology
计算机科学, 2016, 43(5): 1-8. https://doi.org/10.11896/j.issn.1002-137X.2016.05.001
[4] 李沁,缪瑨.
可组合的描述符泄露类型检查
Compositional Type Checking of Descriptor Leaking
计算机科学, 2015, 42(10): 184-188.
[5] 庞红彪,李之博,高小雅.
远程多管火箭炮火控系统的软件安全性测试分析
Software Safety Test Analysis for Fire Control System of Remote Multi-barrel Rocket
计算机科学, 2013, 40(Z6): 361-364.
[6] 张 平,李清宝,崔 晨.
基于路径驱动的多路径分析算法
Exploring Multiple Execution Paths Based on Execution Path Driven
计算机科学, 2013, 40(2): 145-147.
[7] 牛伟纳,丁雪峰,刘智,张小松.
基于符号执行的二进制代码漏洞发现
Vulnerability Finding Using Symbolic Execution on Binary Programs
计算机科学, 2013, 40(10): 119-121.
[8] 褚文奎,丛伟,樊晓光,顾文灿.
基于系统思维的软件安全性需求开发框架
System Thinking Based Development Framework for Software Safety Requirements
计算机科学, 2012, 39(Z6): 412-415.
[9] 徐显亮,张凤鸣,褚文奎.
一种以安全性为中心的IMA软件体系结构设计方法
Safety-centered Architecture Design Method for IMA Software
计算机科学, 2012, 39(3): 128-130.
[10] 王国栋,陈平,茅兵,谢立.
基于攻击特征签名的自动生成
Automatic Generation of Attach-based Signature
计算机科学, 2012, 39(3): 118-123.
[11] 金英,刘鑫,张晶.
软件安全需求获取方法的研究
Research on Eliciting Security Requirement Methods
计算机科学, 2011, 38(5): 14-19.
[12] 樊晓光,褚文奎,张凤鸣.
软件安全性研究综述
Surveys of Software Safety
计算机科学, 2011, 38(5): 8-13.
[13] 肖海,陈平,矛兵,谢立.
基于运行时类型分析的整形漏洞二进制检测和定位系统
New Binary System for Detecting and Locating Integer-based Vulnerability on Run-time Type Analysis
计算机科学, 2011, 38(1): 140-144.
[14] 王春雷,刘强,赵刚,戴一奇.
一种基于模型检测的二进制程序脆弱性分析框架
Vulnerability Analysis Framework for Binaries Based on Model Checking
计算机科学, 2010, 37(4): 120-.
[15] 田硕,梁洪亮.
二进制程序安全缺陷静态分析方法的研究综述
Survey of Static Analysis Methods for Binary Code Vulnerability
计算机科学, 2009, 36(7): 8-14. https://doi.org/10.11896/j.issn.1002-137X.2009.07.002
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!