Computer Science ›› 2021, Vol. 48 ›› Issue (5): 1-8.doi: 10.11896/jsjkx.200400085

• Computer Software • Previous Articles     Next Articles

Summary of Binary Code Similarity Detection Techniques

FANG Lei, WU Ze-hui, WEI Qiang   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing,Information Engineering University,Zhengzhou 450001,China
  • Received:2020-04-20 Revised:2020-07-30 Online:2021-05-15 Published:2021-05-09
  • About author:FANG Lei,born in 1989,postgraduate,assistant engineer.His main research interests include security of network information and so on.(nanbeiyouzi@qq.com)
    WEI Qiang,born in 1979,Ph.D,professor,Ph.D supervisor.His main research interests include security of network information and so on.
  • Supported by:
    National Key Research and Development Project(2017YFB0803202),Advanced Industrial Internet Security Platform Project(2018FD0ZX01) and Henan Soft Science Research Program Project(192102210128).

Abstract: Code similarity detection is commonly used in code prediction,intellectual property protection and vulnerability scan,etc.It includes source code similarity detection and binary code similarity detection.Since the source code is usually difficult to access,binary code similarity detection is more widely applicable,and a variety of detection techniques have been proposed in academia.We review researches of this field in recent years.First,we summarize the basic process of code similarity detection and challenges it faces,which include the cross-compiler,cross-optimization and cross-architecture detecting.Then,in consideration of different code information concerned,we propose to classify current binary code similarity detection techniques into 4 categories,including text-based,attribute-based measurement,program logic-based and semantic-based detection technologies,and list some representative methods and tools,such as Karta,discovRE,Genius,Gemini,SAFE,etc.Finally,according to the development context and the latest researches,we analyze and discuss the development direction of this field.

Key words: Binary program, Code similarity detection, Software security

CLC Number: 

  • TP311
[1]Synopsys,Inc.2020 Open Source Security and Risk Analysis Report[EB/OL].(2020-06-08)[2020-07-08].https://www.synopsys.com/software-integrity/resources/analyst-reports/2020-open-source-security-risk-analysis.html.
[2]WHALE G.Plague:Plagiarism Detection Using Program Structure[R].Dept.of Computer Science Technical Report 8805.University of NSW, Kensington, Australasian, 1988.
[3]XIONG H,YAN H H,GUO T,et al.Code Similarity Detection:A Surve[J].Computer Scienc,2010,37(8):9-14.
[4]ZHANG D,LUO P.Survey of Code Similarity Detection Methods and Tools[J/OL].Computer Science.[2020-03-02].http://kns.cnki.net/kcms/detail/50.1075.TP.20200115.1646.004.html.
[5]CAO Y Z,JIN M Z,LIU C.Overview on Clones Detection[J].Computer Engineering & Science,2006(S2):9-13.
[6]XU H Y,LEI Z Z,LI D.Survey of Code Obfuscation[J].Computer & Digital Engineering,2007,35(10):4-7.
[7]Eyal Itkin.Karta:Matching Open Sources in Binaries[EB/OL].(2019-03-21)[2020-03-04].https://research.checkpoint.com/2019/karta-matching-open-sources-in-binaries/.
[8]Hex-Rays.About IDA[EB/OL].(2020-03-29)[2020-03-29].https://www.hex-rays.com/products/ida/.
[9]OHJ.DarunGrim:A Patch Analysis and Binary Diffing Tool[EB/OL].(2020-06-18)[2020-07-10].http://www.darungrim.org/.
[10]LIU B,HUO W,ZHANG C,et al.αdiff:cross-version binary code similarity detection with dnn[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:667-678.
[11]KRIZHEVSKY A,SUTSKEVER I,HINTON G E,et al.ImageNet Classification with Deep Convolutional Neural Networks[C]//Advances in Neural Information Processing Systems.2012:1097-1105.
[12]ESCHWEILER S,YAKDAN K,GERHARDS-PADILLA E.discovRE:Efficient Cross-Architecture Identification of Bugs in Binary Code[C/OL]//The Network and Distributed System Security Symposium(NDSS 2016).2016.http://dx.doi.org/10.14722/ndss.2016.23185.
[13]MUJA M,LOWE D G.Fast approximate nearest neighbors with automatic algorithm configuration[C]//International Conference on Computer Vision Theory and Applications.2009:331-340.
[14]ALLEN F E.Control flow analysis[J].ACM Sigplan Notices,1970,5(7):1-19.
[15]Zynamics.BinDiff Home[EB/OL].(2020-05-05)[2020-07-11].https://www.zynamics.com/bindiff.html.
[16]FLAKE H.Structural comparison of executable objects[C]//Detection of Intrusions and Malware & Vulnerability Assessment.2004:161-173.
[17]DULLIEN T,ROLLES R.Graph-based comparison of executable objects [J].Symposium Sur la Sécurité Des Technologies De L'information Et Des Communications,2005,5(1):3.
[18]MARIMONT R B,SHAPIRO M B.Nearest Neighbour Searches and the Curse of Dimensionality[J].IMA Journal of Applied Mathematics,1979,24(1):59-70.
[19]PEWNY J,GARMANY B,GAWLIK R,et al.Cross-architecture bug search in binary executables[C]//2015 IEEE Symposium on Security and Privacy.IEEE,2015:709-724.
[20]CHANDRAMOHAN M,XUE Y,XU Z,et al.Bingo:Cross-architecture cross-os binary search[C]//Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.2016:678-689.
[21]WANG X,JHI Y C,ZHU S,et al.Behavior based software theft detection[C]//ACM Conference on Computer and Communications Security(CCS 2009).Chicago,Illinois,USA,DBLP,2009:280-290.
[22]PEWNY J,SCHUSTER F,BERNHARD L,et al.Leveragingsemantic signatures for bug search in binary programs[C]//Proceedings of the 30th Annual Computer Security Applications Conference.2014:406-415.
[23]QIAN F,ZHOU R,XU C,et al.Scalable Graph-based BugSearch for Firmware Images[C]//Acm Sigsac Conference on Computer & Communications Security.2016:480-491.
[24]DAVID Y,PARTUSH N,YAHAV E.Statistical similarity ofbinaries[J].ACM SIGPLAN Notices,2016,51(6):266-280.
[25]GAO D,REITER M K,SONG D.Binhunt:Automatically finding semantic differences in binary programs[C]//International Conference on Information and Communications Security.Springer,Berlin,Heidelberg,2008:238-255.
[26]MING J,PAN M,GAO D.iBinHunt:Binary hunting with inter-procedural control flow[C]//International Conference on Information Security and Cryptology.Springer,Berlin,Heidelberg,2012:92-109.
[27]NG A Y,JORDAN M I,WEISS Y,et al.On Spectral Clustering:Analysis and an algorithm[C]//Advances in Neural Information Processing Systems.2002:849-856.
[28]XU X,LIU C,FENG Q,et al.Neural network-based graph embedding for cross-platform binary code similarity detection[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.2017:363-376.
[29]MASSARELLI L,DI LUNA G A,PETRONI F,et al.Safe:Self-attentive function embeddings for binary similarity[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Springer,Cham,2019:309-329.
[30]BROMLEY J,GUYON I,LECUN Y,et al.Signature verification using a “siamese” time delay neural network[C]//Advances in Neural Information Processing Systems.1994:737-744.
[31]DAI H,DAI B,SONG L.Discriminative embeddings of latentvariable models for structured data[C]//International Conference on Machine Learning.2016:2702-2711.
[32]DING S H H,FUNG B C M,CHARLAND P.Asm2vec:Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]//2019 IEEE Symposium on Security and Privacy (SP).IEEE,2019:472-489.
[33]LE Q,MIKOLOV T.Distributed representations of sentencesand documents[C]//International Conference on Machine Learning.2014:1188-1196.
[34]Google.Tool for computing continuous distributed representations of words[EB/OL].(2013-07-30)[2020-03-07].https://code.google.com/archive/p/word2vec/.
[35]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[36]LIN Z,FENG M,SANTOS C N,et al.A structured self-attentive sentence embedding[J].arXiv:1703.03130,2017.
[37]LUO Z,WANG B,TANG Y,et al.Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things[J].Applied Sciences,2019,9(16):3283.
[38]Valgrind.Valgrind Home[EB/OL].(2020-07-13)[2020-07-13].https://www.valgrind.org/.
[39]NETHERCOTE N,SEWARD J.Valgrind:a framework forheavyweight dynamic binary instrumentation[C]//Programming Language Design and Implementation,2007,42(6):89-100.
[40]DAVID Y,PARTUSH N,YAHAV E.Similarity of binariesthrough re-optimization[C]//Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation.2017:79-94.
[41]BARNETT M,CHANG B Y E,DELINE R,et al.Boogie:Amodular reusable verifier for object-oriented programs[C]//International Symposium on Formal Methods for Components and Objects.Springer,Berlin,Heidelberg,2005:364-387.
[42]XIAO Y,CAO S,CAO Z,et al.Matching Similar Functions in Different Versions of a Malware[C]//2016 IEEE Trustcom/BigDataSE/ISPA.IEEE,2016:252-259.
[43]LUO L,MING J,WU D,et al.Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection[J].IEEE Transactions on Software Engineering,2017(12):1-1.
[44]ALRABAEE S,SHIRANI P,WANG L,et al.SIGMA:A Semantic Integrated Graph Matching Approach for Identifying Reused Functions in Binary Code[J].Digital Investigation:The Internatnional Journal of Digital Forensics & Incident Response,2015,12(1):61-71.
[45]QIU J,SU X,MA P.Library functions identification in binary code by using graph isomorphism testings[C]//2015 IEEE 22nd International Conference on Software Analysis,Evolution,and Reengineering (SANER).IEEE,2015:261-270.
[46]DING S H H,FUNG B C M,CHARLAND P.Kam1n0:Mapreduce-based assembly clone search for reverse engineering[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:461-470.
[47]HU Y,ZHANG Y,LI J,et al.Binary code clone detection across architectures and compiling configurations[C]//2017 IEEE/ACM 25th International Conference on Program Comprehension(ICPC).IEEE,2017:88-98.
[48]NOUH L,RAHIMIAN A,MOUHEB D,et al.Binsign:fingerprinting binary functions to support automated analysis of code executables[C]//IFIP International Conference on ICT Systems Security and Privacy Protection.Springer,Cham,2017:341-355.
[49]SHIRANI P,WANG L,DEBBABI M.BinShape:Scalable androbust binary library function identification using function shape[C]//International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment.Springer,Cham,2017:301-324.
[50]MING J,XU D,JIANG Y,et al.Binsim:Trace-based semanticbinary diffing via system call sliced segment equivalence checking[C]//26th USENIX Security Symposium.2017:253-270.
[51]WANG Y,SHEN J,LIN J,et al.Staged method of code similarity analysis for firmware vulnerability detection[J].IEEE Access,2019(7):14171-14185.
[52]ALRABAEE S,SHIRANI P,WANG L,et al.Fossil:a resilient and efficient system for identifying foss functions in malware binaries[J].ACM Transactions on Privacy and Security (TOPS),2018,21(2):1-34.
[53]LAGEMAN N,KILMER E D,WALLS R J,et al.BinDNN:Resilient Function Matching Using Deep Learning[C]//International Conference on Security and Privacy in Communication Systems.Springer,Cham,2016:517-537.
[54]ZUO F,LI X,YOUNG P,et al.Neural machine translation inspired binary code similarity comparison beyond function pairs[J].arXiv:1808.04706,2018.
[55]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[56]YU Z,CAO R,TANG Q,et al.Order Matters:Semantic-Aware Neural Networks for Binary Code Similarity Detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:1145-1152.
[57]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[58]GILMER J,SCHOENHOLZ S S,RILEY P,et al.Neural Message Passing for Quantum Chemistry[C]//Proceedings of the 34th International Conference on Machine Learning(ICML'17).2017:1263-1272.
[1] HU An-xiang, YIN Xiao-kang, ZHU Xiao-ya, LIU Sheng-li. Strcmp-like Function Identification Method Based on Data Flow Feature Matching [J]. Computer Science, 2022, 49(9): 326-332.
[2] ZHANG Xiong and LI Zhou-jun. Survey of Fuzz Testing Technology [J]. Computer Science, 2016, 43(5): 1-8.
[3] . Exploring Multiple Execution Paths Based on Execution Path Driven [J]. Computer Science, 2013, 40(2): 145-147.
[4] NIU Wei-na,DING Xue-feng,LIU Zhi and ZHANG Xiao-song. Vulnerability Finding Using Symbolic Execution on Binary Programs [J]. Computer Science, 2013, 40(10): 119-121.
[5] WANG Guo-dong,CHEN Ping,MAO Bing,XIE Li. Automatic Generation of Attach-based Signature [J]. Computer Science, 2012, 39(3): 118-123.
[6] JIN Ying,LIU Xin,ZHANG Jing. Research on Eliciting Security Requirement Methods [J]. Computer Science, 2011, 38(5): 14-19.
[7] XIAO Hai,CHEN Ping,MAO Bing,XIE Li. New Binary System for Detecting and Locating Integer-based Vulnerability on Run-time Type Analysis [J]. Computer Science, 2011, 38(1): 140-144.
[8] . [J]. Computer Science, 2009, 36(4): 169-171.
[9] . [J]. Computer Science, 2009, 36(1): 252-255.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!