Computer Science ›› 2024, Vol. 51 ›› Issue (12): 53-62.doi: 10.11896/jsjkx.231100179

• Computer Software • Previous Articles     Next Articles

DeepGenFuzz:An Efficient PDF Application Fuzzing Test Case Generation Framework Based on Deep Learning

LIU Jiahao, JIANG He   

  1. School of Software, Dalian University of Technology, Dalian, Liaoning 116600, China
  • Received:2023-11-27 Revised:2024-04-28 Online:2024-12-15 Published:2024-12-10
  • About author:LIU Jiahao,born in 1999,postgraduate.His main research interests include software testing and deep learning.
    JIANG He,born in 1980,professor,Ph.D supervisor,is a member of CCF(No.08846D).His main research intere-sts include system software and intelligent software engineering.

Abstract: PDF file is a widely used and important document format.Due to the complexity of PDF files,defects in PDF-related applications can lead to serious consequences such as malicious attacks and incorrect information rendering.Therefore,testing PDF-related applications has become a hot research topic.The most effective method currently is grammar-based fuzz testing,but it often requires a significant amount of manual work to summarize and write complex grammar rules,which seriously hinders the efficient automation of test case generation.Deep learning techniques provide a feasible solution to this challenge.However,the quality of test cases generated by current methods is generally low,and the ability to find bugs is poor.To further improve this,three main challenges need to be addressed:data set filtering,balancing test case coverage improvement and test case size increase,and efficient mutation of test cases.Therefore,this paper proposes a deep learning-based efficient PDF application fuzz test case generation framework called DeepGenFuzz.It utilizes models such as CNN,Seq2Seq,and Transformer to generate high-quality PDF test cases through steps including data filtering,object generation,object appending,and efficient mutation.Evaluations on PDF applications like MuPDF show that DeepGenFuzz generates test cases with significantly higher average code coverage compared to state-of-the-art tools like Learn&Fuzz and IUST-DeepFuzz,reaching up to 8.12% ~ 61.03%.Its bug-finding capabilities are also far superior to those of Learn&Fuzz and IUST-DeepFuzz. Currently,31 previously unreported bugs have been discovered in the seven most popular PDF applications,among which 25 have been confirmed or fixed,covering all tested programs.

Key words: PDF application, Deep learning, Fuzz testing, Test cases, Code coverage

CLC Number: 

  • TP311
[1]SMUTZ C,STAVROU A.Malicious PDF detection using metadata and structural features[C]//Proceedings of the 28th An-nual Computer Security Applications Conference.2012:239-248.
[2]VANIEA K E,RADER E,WASH R.Betrayed by updates:how negative experiences affect future security[C]//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.2014:2671-2674.
[3]KUCHTA T,LUTELLIER T,WONG E,et al.On the correctness of electronic documents:studying,finding,and localizing inconsistency bugs in PDF readers and files[J].Empirical Software Engineering,2018,23:3187-3220.
[4]GANESH V,LEEK T,RINARD M.Taint-based directedwhitebox fuzzing[C]//2009 IEEE 31st International Confe-rence on Software Engineering.IEEE,2009:474-484.
[5]PHAM V T,BÖHME M,ROYCHOUDHURY A.Model-based whitebox fuzzing for program binaries[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering.2016:543-553.
[6]CHA S K,WOO M,BRUMLEY D.Program-adaptive mutational fuzzing[C]//2015 IEEE Symposium on Security and Privacy.IEEE,2015:725-741.
[7]KOIKE Y,KATSURA H,YAKURA H,et al.SLOPT:Bandit Optimization Framework for Mutation-Based Fuzzing[C]//Proceedings of the 38th Annual Computer Security Applications Conference.2022:519-533.
[8]SHE D,KRISHNA R,YAN L,et al.MTFuzz:fuzzing with amulti-task neural network[C]//Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2020:737-749.
[9]WU M,JIANG L,XIANG J,et al.Evaluating and improving neural program-smoothing-based fuzzing[C]//Proceedings of the 44th International Conference on Software Engineering.2022:847-858.
[10]SHE D,PEI K,EPSTEIN D,et al.Neuzz:Efficient fuzzing with neural program smoothing[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:803-817.
[11]BA J,DUCK G J,ROYCHOUDHURY A.Efficient GreyboxFuzzing to Detect Memory Errors[C]//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering.2022:1-12.
[12]LI Y,XUE Y,CHEN H,et al.Cerebro:context-aware adaptive fuzzing for effective vulnerability detection[C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2019:533-544.
[13]PHAM V T.AFLSmart++:Smarter Greybox Fuzzing[C]//2023 IEEE/ACM International Workshop on Search-Based and Fuzz Testing(SBFT).IEEE,2023:76-79.
[14]DINH S T,CHO H,MARTIN K,et al.Favocado:Fuzzing the Binding Code of JavaScript Engines Using Semantically Correct Test Cases[C]//NDSS.2021.
[15]GODEFROID P,PELEG H,SINGH R.Learn&Fuzz:Machinelearning for input fuzzing[C]//2017 32nd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2017:50-59.
[16]JITSUNARI Y,ARAHORI Y.Coverage-guided learning-assisted grammar-based fuzzing[C]//2019 IEEE International Conference on Software Testing,Verification and Validation Workshops(ICSTW).IEEE,2019:275-280.
[17]ZAKERI NASRABADI M,PARSA S,KALAEE A.Format-aware Learn&Fuzz:deep test data generation for efficient fuzzing[J].Neural Computing and Applications,2021,33:1497-1513.
[18]KHAN M E.Different approaches to white box testing tech-nique for finding errors[J].International Journal of Software Engineering and Its Applications,2011,5(3):1-14.
[19]ACHARYA S,PANDYA V.Bridge between black box andwhite box-gray box testing technique[J].International Journal of Electronics and Computer Science Engineering,2012,2(1):175-185.
[20]KHAN M E,KHAN F.A comparative study of white box,black box and grey box testing techniques[J].International Journal of Advanced Computer Science and Applications,2012,3(6):12-25.
[21]Adobe Systems Incorporated.PDF Reference(Version 1.7)Nov.2006[OL].https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf.
[22]PANG C,LIU H,WANG Y,et al.Generation-based fuzzing? Don’t build a new generator,reuse![J].Computers & Security,2023,129:103178.
[23]REBERT A,CHA S K,AVGERINOS T,et al.Optimizing seed selection for fuzzing[C]//23rd USENIX Security Symposium(USENIX Security 14).2014:861-875.
[24]ANDERSSON C,RUNESON P.A replicated quantitative analysis of fault distributions in complex software systems[J].IEEE Transactions on Software Engineering,2007,33(5):273-286.
[25]FENTON N E,OHLSSON N.Quantitative analysis of faults and failures in a complex software system[J].IEEE Transactions on Software engineering,2000,26(8):797-814.
[26]ZHANG H.On the distribution of software faults[J].IEEE Transactions on Software Engineering,2008,34(2):301-302.
[27]MURPHY-HILL E,ZIMMERMANN T,BIRD C,et al.The design of bug fixes[C]//2013 35th International Conference on Software Engineering(ICSE).IEEE,2013:332-341.
[28]WILLIAMS C,SPACCO J.Szz revisited:verifying when changes induce fixes[C]//Proceedings of the 2008 Workshop on Defects in Large Software Systems.2008:32-36.
[29]ZONG P,LV T,WANG D,et al.{FuzzGuard}:Filtering out unreachable inputs in directed grey-box fuzzing through deep learning[C]//29th USENIX Security Symposium(USENIX Security 20).2020:2255-2269.
[30]DIETTERICH T G.Ensemble learning[J].The Handbook of Brain Theory and Neural Networks,2002,2(1):110-125.
[31]KAELBLING L P,LITTMAN M L,MOORE A W.Reinforcement learning:A survey[J].Journal of Artificial Intelligence Research,1996,4:237-285.
[1] DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei. Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs [J]. Computer Science, 2024, 51(9): 71-79.
[2] XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[3] REN Jiadong, LI Shangyang, REN Rong, ZHANG Bing, WANG Qian. Web Access Control Vulnerability Detection Approach Based on Site Maps [J]. Computer Science, 2024, 51(9): 416-424.
[4] SUN Yumo, LI Xinhang, ZHAO Wenjie, ZHU Li, LIANG Ya’nan. Driving Towards Intelligent Future:The Application of Deep Learning in Rail Transit Innovation [J]. Computer Science, 2024, 51(8): 1-10.
[5] KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33.
[6] TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
[7] XIAO Xiao, BAI Zhengyao, LI Zekai, LIU Xuheng, DU Jiajin. Parallel Multi-scale with Attention Mechanism for Point Cloud Upsampling [J]. Computer Science, 2024, 51(8): 183-191.
[8] ZHANG Junsan, CHENG Ming, SHEN Xiuxuan, LIU Yuxue, WANG Leiquan. Diversified Label Matrix Based Medical Image Report Generation [J]. Computer Science, 2024, 51(8): 200-208.
[9] GUO Fangyuan, JI Genlin. Video Anomaly Detection Method Based on Dual Discriminators and Pseudo Video Generation [J]. Computer Science, 2024, 51(8): 217-223.
[10] CHEN Siyu, MA Hailong, ZHANG Jianhui. Encrypted Traffic Classification of CNN and BiGRU Based on Self-attention [J]. Computer Science, 2024, 51(8): 396-402.
[11] YANG Heng, LIU Qinrang, FAN Wang, PEI Xue, WEI Shuai, WANG Xuan. Study on Deep Learning Automatic Scheduling Optimization Based on Feature Importance [J]. Computer Science, 2024, 51(7): 22-28.
[12] LI Jiaying, LIANG Yudong, LI Shaoji, ZHANG Kunpeng, ZHANG Chao. Study on Algorithm of Depth Image Super-resolution Guided by High-frequency Information ofColor Images [J]. Computer Science, 2024, 51(7): 197-205.
[13] SHI Dianxi, GAO Yunqi, SONG Linna, LIU Zhe, ZHOU Chenlei, CHEN Ying. Deep-Init:Non Joint Initialization Method for Visual Inertial Odometry Based on Deep Learning [J]. Computer Science, 2024, 51(7): 327-336.
[14] FAN Yi, HU Tao, YI Peng. Host Anomaly Detection Framework Based on Multifaceted Information Fusion of SemanticFeatures for System Calls [J]. Computer Science, 2024, 51(7): 380-388.
[15] GAN Run, WEI Xianglin, WANG Chao, WANG Bin, WANG Min, FAN Jianhua. Backdoor Attack Method in Autoencoder End-to-End Communication System [J]. Computer Science, 2024, 51(7): 413-421.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!