计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 241100144-9.doi: 10.11896/jsjkx.241100144
包晟宏, 姚有健, 李小丫, 陈文
BAO Shenghong, YAO Youjian, LI Xiaoya, CHEN Wen
摘要: 基于人工智能的漏洞检测方法相比于传统漏洞检测方法减少了对专家经验的依赖并提升了检测效率。通常漏洞检测模型的训练过程需要大量有标记的样本,然而在实际应用中,有标记的漏洞样本的收集较为困难。因此,在仅有少量有标记漏洞样本的条件下,如何有效地利用大量的未标记源码样本以提高漏洞检测模型的训练性能成为漏洞检测领域的重要问题。正类-未标记样本学习(PU Learning)是一种新型半监督学习模式,PU学习能够结合少量正类样本与大量未标记样本对随机森林等模型进行训练,通过对未标记样本的类别评分以产生大量的伪标记训练样本,用于提升模型的训练性能。然而,当部分未标记样本的类别得分在阈值附近时,PU Learning容易产生错误的伪标记。为了实现基于PU Learning的源码漏洞检测,提出了一种集成式PU学习方法PUEVD。PUEVD首先基于PU风险最小化构建随机森林,计算未标记源码样本的类别得分,并筛选出源码样本的典型特征集合;然后在典型特征中随机选择一组代码特征子集;最后在该子集上比较易误分类样本的取值与可信正类/可信负类源码样本的相似度的差异。基于集成学习原理,PUEVD将任意易误分类源码样本x在给定的代码特征子集上计算出的相似度差异当作x在一个弱分类器的分类结果,之后基于x在多组随机选择的特征子集所获得的一组差异值融合调整x的类别得分,进而将样本的类别分值优化到可信的区间内,以降低源码样本被误分类的风险。将PUEVD应用到软件漏洞挖掘过程,实现了少量已标记源码样本条件下的漏洞检测,在标准源码漏洞数据集,CWE399,libtiff,asterisk上的实验结果表明,PUEVD在AUC和F1分数上均优于传统方法,证明了PUEVD方法在漏洞检测应用中的有效性。
中图分类号:
[1]STEPHENS N,GROSEN J,SALLS C,et al.Driller:Augmen-ting fuzzing through selective symbolic execution[C]//Proceedings of the 2016 Network and Distributed System Security Symposium(NDSS).2016. [2]PENG H,SHOSHITAISHVILI Y,PAYER M.T-Fuzz:fuzzing by program transformation[C]//2018 IEEE Symposium on Security and Privacy(SP).IEEE,2018:697-710. [3]AGGARWAL A,JALOTE P.Integrating static and dynamic analysis for detecting vulnerabilities[C]//30th Annual International Computer Software and Applications Conference(COMPSAC’06).IEEE,2006,1:343-350. [4]CHAKRABORTY S,KRISHNA R,RAY B,et al.Deep Learning Based Vulnerability Detection:Are We There Yet?[J].IEEE Transaction on Software Engineering,2022,48(9):3280-3296. [5]LI Z,ZOU D,XU S,et al.Sysevr:A framework for using deep learning to detect software vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2021,19(4):2244-2258. [6]PARK J,LEE H,RYU S.A survey of parametric static analysis[J].ACM Computing Surveys(CSUR),2021,54(7):1-37. [7]CHEN B,JIANG J,WANG X,et al.Debiased self-training for semi-supervised learning[J].Advances in Neural Information Processing Systems,2022,35:32424-32437. [8]XIE Q,DAI Z,HOVY E,et al.Unsupervised data augmentation for consistency training[J].Advances in neural information processing systems,2020,33:6256-6268. [9]OLIVER A,ODENA A,RAFFEL C A,et al.Realistic evaluation of deep semi-supervised learning algorithms[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:3239-3250. [10]HAMMOUDEH Z,LOWD D.Learning from positive and unlabeled data with arbitrary positive shift[J].Advances in Neural Information Processing Systems,2020,33:13088-13099. [11]RADFORD A,METZ L,CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].arXiv:1511.06434,2015. [12]KINGMA D P,MOHAMED S,JIMENEZ REZENDE D,et al.Semi-supervised learning with deep generative models[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2017:3581-3589. [13]CARON M,BOJANOWSKI P,JOULIN A,et al.Deep clustering for unsupervised learning of visual features[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:132-149. [14]KIRYO R,NIU G,DU PLESSIS M C,et al.Positive-unlabeled learning with non-negative risk estimator[C]//31st Conference on Neural Iformation Processing Systems.2017:1-11. [15]BEKKER J,ROBBERECHTS P,DAVIS J.Beyond the selected completely at random assumption for learning from positive and unlabeled data[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham:Springer International Publishing,2019:71-85. [16]JASKIE K,SPANIAS A.Positive unlabeled learning[M].Morgan & Claypool Publishers,2022. [17] YANG P,LI X L,MEI J P,et al.Positive-unlabeled learning for disease gene identification[J].Bioinformatics,2012,28(20):2640-2647. [18]YU H,ZHAI C X,HAN J.Text classification from positive and unlabeled documents[C]//Proceedings of the Twelfth International Conference on Information and Knowledge Management.2003:232-239. [19]LI W,GUO Q,ELKAN C.A positive and unlabeled learning algorithm for one-class classification of remote-sensing data[J].IEEE Transactions on Geoscience and Remote Sensing,2010,49(2):717-725. [20]ZHANG B,ZUO W.Learning from positive and unlabeled examples:A survey[C]//2008 International Symposiums on Information Processing.IEEE,2008:650-654. [21]LIANG C,ZHANG Y,SHI P,et al.Learning very fast decision tree from uncertain data streams with positive and unlabeled samples[J].Information Sciences,2012,213:50-67. [22]HE J,ZHANG Y,LI X,et al.Bayesian classifiers for positiveunlabeled learning[C]//International Conference on Web-Age Information Management.Berlin,Heidelberg:Springer Berlin Heidelberg,2011:81-93. [23]LI C,HUA X L.Towards positive unlabeled learning for parallel data mining:a random forest framework[C]//Advanced Data Mining and Applications:10th International Conference(ADMA 2014).Guilin,China,Springer International Publishing,2014:573-587. [24]ZHANG R,HUANG S,QI Z,et al.Combining static and dynamic analysis to discover software vulnerabilities[C]//2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.IEEE,2011:175-181. [25]DE COMITÉ F,DENIS F,GILLERON R,et al.Positive and unlabeled examples help learning[C]//Algorithmic Learning Theory:10th International Conference(ALT’99).Tokyo,Japan,Springer Berlin Heidelberg,1999:219-230. [26]CHEN X,CHEN W,CHEN T,et al.Self-pu:Self boosted and calibrated positive-unlabeled training[C]//International Conference on Machine Learning.PMLR,2020:1510-1519. [27]MAGEE J F.Decision trees for decision making[M].Brighton,MA,USA:Harvard Business Review,1964. [28]WILTON J,KOAY A,KO R,et al.Positive-Unlabeled Learning using Random Forests via Recursive Greedy Risk Minimization[J].Advances in Neural Information Processing Systems,2022,35:24060-24071. [29]DONG X,YU Z,CAO W,et al.A survey on ensemble learning[J].Frontiers of Computer Science,2020,14:241-258. |
|