计算机科学 ›› 2024, Vol. 51 ›› Issue (12): 53-62.doi: 10.11896/jsjkx.231100179
刘家豪, 江贺
LIU Jiahao, JIANG He
摘要: PDF文件是一种被广泛应用的重要文档格式。由于PDF文件的复杂性,PDF相关的应用程序中存在的缺陷可能会导致严重后果,例如遭遇恶意攻击、信息错误呈现等。因此,针对PDF相关应用程序的测试成为当前研究的热点问题。目前最有效的方法是基于语法的模糊测试。然而,基于语法的模糊测试往往需要花费大量手工工作对复杂的语法规则进行总结和编写,严重阻碍了测试用例高效地自动化生成。深度学习技术为突破这一障碍提供了可行路径,但目前的方法生成的测试用例普遍质量较低,查找bug能力较差。进一步对其进行改进需要应对3个主要挑战,即数据集的筛选、测试用例覆盖率提升和测试用例大小增加两者间的平衡、测试用例的高效变异。因此,提出了一个基于深度学习的高效PDF应用程序模糊测试用例生成框架DeepGenFuzz,利用CNN,Seq2Seq和Transformer等模型,通过数据筛选、对象生成、对象附加、高效变异等步骤生成高质量PDF测试用例。在MuPDF等PDF应用程序上的评估表明,DeepGenFuzz生成的测试用例平均代码覆盖率明显高于Learn&Fuzz和IUST-DeepFuzz等目前最先进的工具,最高可达8.12%~61.03%;bug查找能力也远远优于Learn&Fuzz和IUST-DeepFuzz等最先进的工具,目前已经报告了在7个最流行的PDF应用程序中发现的31个未曾被报告的bug,其中25个已经得到确认或修复,涵盖了所有被测程序。
中图分类号:
[1]SMUTZ C,STAVROU A.Malicious PDF detection using metadata and structural features[C]//Proceedings of the 28th An-nual Computer Security Applications Conference.2012:239-248. [2]VANIEA K E,RADER E,WASH R.Betrayed by updates:how negative experiences affect future security[C]//Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.2014:2671-2674. [3]KUCHTA T,LUTELLIER T,WONG E,et al.On the correctness of electronic documents:studying,finding,and localizing inconsistency bugs in PDF readers and files[J].Empirical Software Engineering,2018,23:3187-3220. [4]GANESH V,LEEK T,RINARD M.Taint-based directedwhitebox fuzzing[C]//2009 IEEE 31st International Confe-rence on Software Engineering.IEEE,2009:474-484. [5]PHAM V T,BÖHME M,ROYCHOUDHURY A.Model-based whitebox fuzzing for program binaries[C]//Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering.2016:543-553. [6]CHA S K,WOO M,BRUMLEY D.Program-adaptive mutational fuzzing[C]//2015 IEEE Symposium on Security and Privacy.IEEE,2015:725-741. [7]KOIKE Y,KATSURA H,YAKURA H,et al.SLOPT:Bandit Optimization Framework for Mutation-Based Fuzzing[C]//Proceedings of the 38th Annual Computer Security Applications Conference.2022:519-533. [8]SHE D,KRISHNA R,YAN L,et al.MTFuzz:fuzzing with amulti-task neural network[C]//Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2020:737-749. [9]WU M,JIANG L,XIANG J,et al.Evaluating and improving neural program-smoothing-based fuzzing[C]//Proceedings of the 44th International Conference on Software Engineering.2022:847-858. [10]SHE D,PEI K,EPSTEIN D,et al.Neuzz:Efficient fuzzing with neural program smoothing[C]//2019 IEEE Symposium on Security and Privacy(SP).IEEE,2019:803-817. [11]BA J,DUCK G J,ROYCHOUDHURY A.Efficient GreyboxFuzzing to Detect Memory Errors[C]//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering.2022:1-12. [12]LI Y,XUE Y,CHEN H,et al.Cerebro:context-aware adaptive fuzzing for effective vulnerability detection[C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2019:533-544. [13]PHAM V T.AFLSmart++:Smarter Greybox Fuzzing[C]//2023 IEEE/ACM International Workshop on Search-Based and Fuzz Testing(SBFT).IEEE,2023:76-79. [14]DINH S T,CHO H,MARTIN K,et al.Favocado:Fuzzing the Binding Code of JavaScript Engines Using Semantically Correct Test Cases[C]//NDSS.2021. [15]GODEFROID P,PELEG H,SINGH R.Learn&Fuzz:Machinelearning for input fuzzing[C]//2017 32nd IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2017:50-59. [16]JITSUNARI Y,ARAHORI Y.Coverage-guided learning-assisted grammar-based fuzzing[C]//2019 IEEE International Conference on Software Testing,Verification and Validation Workshops(ICSTW).IEEE,2019:275-280. [17]ZAKERI NASRABADI M,PARSA S,KALAEE A.Format-aware Learn&Fuzz:deep test data generation for efficient fuzzing[J].Neural Computing and Applications,2021,33:1497-1513. [18]KHAN M E.Different approaches to white box testing tech-nique for finding errors[J].International Journal of Software Engineering and Its Applications,2011,5(3):1-14. [19]ACHARYA S,PANDYA V.Bridge between black box andwhite box-gray box testing technique[J].International Journal of Electronics and Computer Science Engineering,2012,2(1):175-185. [20]KHAN M E,KHAN F.A comparative study of white box,black box and grey box testing techniques[J].International Journal of Advanced Computer Science and Applications,2012,3(6):12-25. [21]Adobe Systems Incorporated.PDF Reference(Version 1.7)Nov.2006[OL].https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf. [22]PANG C,LIU H,WANG Y,et al.Generation-based fuzzing? Don’t build a new generator,reuse![J].Computers & Security,2023,129:103178. [23]REBERT A,CHA S K,AVGERINOS T,et al.Optimizing seed selection for fuzzing[C]//23rd USENIX Security Symposium(USENIX Security 14).2014:861-875. [24]ANDERSSON C,RUNESON P.A replicated quantitative analysis of fault distributions in complex software systems[J].IEEE Transactions on Software Engineering,2007,33(5):273-286. [25]FENTON N E,OHLSSON N.Quantitative analysis of faults and failures in a complex software system[J].IEEE Transactions on Software engineering,2000,26(8):797-814. [26]ZHANG H.On the distribution of software faults[J].IEEE Transactions on Software Engineering,2008,34(2):301-302. [27]MURPHY-HILL E,ZIMMERMANN T,BIRD C,et al.The design of bug fixes[C]//2013 35th International Conference on Software Engineering(ICSE).IEEE,2013:332-341. [28]WILLIAMS C,SPACCO J.Szz revisited:verifying when changes induce fixes[C]//Proceedings of the 2008 Workshop on Defects in Large Software Systems.2008:32-36. [29]ZONG P,LV T,WANG D,et al.{FuzzGuard}:Filtering out unreachable inputs in directed grey-box fuzzing through deep learning[C]//29th USENIX Security Symposium(USENIX Security 20).2020:2255-2269. [30]DIETTERICH T G.Ensemble learning[J].The Handbook of Brain Theory and Neural Networks,2002,2(1):110-125. [31]KAELBLING L P,LITTMAN M L,MOORE A W.Reinforcement learning:A survey[J].Journal of Artificial Intelligence Research,1996,4:237-285. |
|