计算机科学 ›› 2024, Vol. 51 ›› Issue (7): 389-396.doi: 10.11896/jsjkx.230300117

• 信息安全 • 上一篇    下一篇

基于符号执行优化的PDF恶意指标提取技术

宋恩舟, 胡涛, 伊鹏, 王文博   

  1. 国家数字交换系统工程技术研究中心 郑州 450001
  • 收稿日期:2023-03-14 修回日期:2023-07-03 出版日期:2024-07-15 发布日期:2024-07-10
  • 通讯作者: 胡涛(hutaondsc@163.com)
  • 作者简介:(391032473@qq.com)
  • 基金资助:
    国家自然科学基金面上项目(62176264)

PDF Malicious Indicators Extraction Technique Based on Improved Symbolic Execution

SONG Enzhou, HU Tao, YI Peng, WANG Wenbo   

  1. National Digital Switching System Engineering Technological R&D Center,Zhengzhou 450001,China
  • Received:2023-03-14 Revised:2023-07-03 Online:2024-07-15 Published:2024-07-10
  • About author:SONG Enzhou,born in 1998,postgra-duate.His main research interests include malicious document analysis,binary code analysis and vulnerability mining.
    HU Tao,born in 1993,Ph.D,assistant researcher.His main research interests include intrinsic security,intrusion detection and SDN.
  • Supported by:
    National Natural Science Foundation of China(62176264).

摘要: 恶意PDF文档是APT组织常用的攻击方法,提取分析其内嵌JavaScript代码指标是判定文档恶意性的重要手段,然而攻击者可以采取高度混淆、虚拟机与沙箱检测等逃逸方法。因此,文中创新性地将符号执行方法用于PDF指标提取,提出了一种基于符号执行优化的PDF恶意指标提取技术,并实现了由代码解析、符号执行和指标提取3个模块组成的指标提取系统SYMBPDF。在代码解析模块中实现内嵌JavaScript代码提取与重组。在符号执行模块中设计代码改写方法,通过强制分支转移提高符号执行的代码覆盖率;设计并发策略和两种约束求解优化方法,以提高系统执行效率。在指标提取模块中实现恶意指标整合与记录。对1 271个恶意样本进行了指标提取与评估,指标提取成功率为92.2%,有效性为91.7%,代码覆盖率较优化前提升8.5%,系统性能较优化前提升32.3%。

关键词: 恶意文档, JavaScript代码, 指标提取, 符号执行, 代码改写, 约束求解优化

Abstract: The malicious PDF document is a common attack method used by APT organizations.Analyzing extracted indicators of embedded JavaScript code is an important means to determine the maliciousness of the documents.However,attackers can adopt high obfuscation,sandbox detection and other escape methods to interfere with analysis.Therefore,this paper innovatively applies symbolic execution method to PDF indicator extraction.We propose a PDF malicious indicator extraction technique based on improved symbolic execution and implement SYMBPDF,an indicator extraction system consisting of three modules:code parsing,symbolic execution and indicator extraction.In the code parsing module,we implement extraction and reorganization of inline Javascript code.In the symbolic execution module,we design the code rewriting method to force branch shifting,resulting in improving the code coverage of symbolic execution.We also design a concurrency strategy and two constraint solving optimization methods to improve the efficiency.In the indicator extraction module,we realize integration and recording of malicious indicators.In this paper,1 271 malicious samples are extracted and evaluated.The success rate of indicator extraction is 92.2%,the indicator effectiveness is 91.7%,the code coverage is 8.5% higher and the system performance is 32.3% higher than that of before optimization.

Key words: Malicious documents, JavaScript code, Indicator extraction, Symbolic execution, Code rewriting, Constraint solving optimization

中图分类号: 

  • TP311
[1]LEI J W,YI P,CHEN X,et al.PDF document detection model based on system calls and data provenance[J].Journal of Computer Applications,2022,42(12):3831-3840.
[2]LU X,WANG F,JIANG C,et al.A Universal Malicious Documents Static Detection Framework Based on Feature Generalization[J].Applied Sciences,2021,11(24):12134.
[3]NISSIM N,COHEN A,MOSKOVITCH R,et al.ALPD:Active Learning Framework for Enhancing the Detection of Malicious PDF Files[C]//2014 IEEE Joint Intelligence and Security Informatics Conference.Washington DC,USA:IEEE,2014:91-98.
[4]NISSIM N,COHEN A,GLEZER C,et al.Detection of Malicious PDF Files and Directions for Enhancements:A State-of-the Art Survey[J].Computers & Security,2015,48:246-266.
[5]YU M,JIANG J G,LI G,et al.A Survey of Research on Malicious Document Detection[J].Journal of Cyber Security,2021,6(3):54-76.
[6]WANG Y.The De-Obfuscation Method in the Static Detection of Malicious PDF Documents[C]//2021 7th Annual International Conference on Network and Information Systems for Computers.Guiyang,China:ICNISC,2021:44-47.
[7]CHEN K,WANG P,YEONJOON L,et al.Scalable Detection of Unknown Malware from Millions of Apps[J].Journal of Cyber Security,2016,1(1):24-38.
[8]GAO X,YU M,JIANG J G,et al.A Combined Malicious Documents Detecting Method Based on Emulators[J].Applied Mechanics and Materials,2014(602/603/604/605):1707-1712.
[9]FENG D,YU M,WANG Y.Detecting Malicious PDF FilesUsing Semi-Supervised Learning Method[C]//The 5th International Conference on Advanced Computer Science Applications and Technologies.Beijing,China:ACSAT,2017:135-155.
[10]ANDREASEN E,LIANG G,MØLLER A,et al.A survey of dynamic analysis and test generation for JavaScript[J].ACM Computing Surveys,2017,50(5):1-36.
[11]SIHWAIL R,OMAR,K,ZAINOL A,et al.Malware detection approach based on artifacts in memory image and dynamic ana-lysis[J].Applied Sciences,2019,9(18):3680-3691.
[12]ALAZAB A,KHRAISAT A,ALAZAB M,et al.Detection of Obfuscated Malicious JavaScript Code[J].Future Internet,2022,14(8):217-231.
[13]TZERMIAS Z,SYKIOTAKIS G,POLYCHRONAKIS M,et al.Combining Static and Dynamic Analysis for the Detection of Malicious Documents[C]//The Fourth European Workshop on System Security.New York,USA:EUROSEC,2011:1-6.
[14]CORONA I,MAIORCA D,ARIU D,et al.Lux0R:Detection of Malicious PDF-Embedded JavaScript Code through Discriminant Analysis of API References[C]//The 2014 Workshop on Artificial Intelligent and Security Workshop.New York,NY:ACM,2014:47-57.
[15]RUARO N,PAGANI F,ORTOLANI S,et al.SYMBEXCEL:Automated Analysis and Understanding of Malicious Excel 4.0 Macros[C]//2022 IEEE Symposium on Security and Privacy.San Francisco,CA:IEEE,2022:1066-1081.
[16]ISO32000-1:2020[EB/OL].https://www.pdfa.org/resource/iso-32000-pdf/.
[17]MAIORCA D,GIACINTO G,CORONA I.A Pattern Recognition System for Malicious PDFFiles Detection[C]//Interna-tional Workshop on Machine Learning and Data Mining in Pattern Recognition.Berlin:Springer,2012:510-524.
[18]LIN J Y,PAO H K.Multi-View Malicious Document Detection[C]//2013 Conference on Technologies and Applications of Artificial Intelligence.TAAI,2013:170-175.
[19]SUN B Y.Research on The PDF Document Security Detection Methods[D].Shanghai:Shanghai Jiao Tong University,2015.
[20]WANG T,MOU Z H,ZHANG Z H,et al.Detecting Obfuscated Malicious JavaScript Code Based on Function Call Information[J].Computer Simulation,2021,38(2):432-437.
[21]NDICHU S,KIM S,OZAWA S.Deobfuscation,unpacking,and decoding of obfuscated malicious JavaScript for machine learning models detectionperformance improvement[J].CAAI Transactions on Intelligence Technology,2020,5(3):184-192.
[22]FRAIWAN M,AL-SALMAN R,KHASAWNEH N,et al.Analysis and identifification of malicious javascript code[J].Information Security Journal:A Global Perspective,2012,21(1):1-11.
[23]LASKOV P,ŠRNDIĆ N.Static Detection of Malicious Java-Script-Bearing PDF Documents[C]//Proceedings of the 27th Annual Computer Security Applications Conference.New York,NY:ACM,2011:373-382.
[24]LI M,ZHOU Y,YU M,et al.Combining Static and DynamicAnalysis for the Detection of Malicious JavaScript-Bearing PDF Documents[C]//Proceedings of the 2016 International Confe-rence on Computer Science,Technology and Application.Shen-zhen,China:ICCITA,2017:475-482.
[25]LU X,ZHUGE J W,WANG R Y,et al.De-Obfuscation and Detection of Malicious PDF Files with High Accuracy[C]//2013 46th Hawaii International Conference on System Sciences.Wailea,Maui,USA:HICSS,2013:4890-4899.
[26]COVA M,KRUEGEL C,VIGNA G.Detection and analysis of drive-by-download attacks and malicious javascript code[C]//Proceedings of the 19th International Conference on World Wide Web.New York,NY:ACM,2010:281-290.
[27]MA H L,WANG W,HAN Z.Detecting and De-ObfuscationObfuscated Malicious JavaScript Code[J].Chinese Journal of Computers,2017,40(7):1699-1713.
[28]HU X,CHENG Y,DUAN Y,et al.JSForce:A Forced Execution Engine for Malicious JavaScript Detection[C]//Security and Privacy in Communication Networks:14th International Conference.Singapore:Springer,2018:704-720.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!