计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 241100144-9.doi: 10.11896/jsjkx.241100144

• 计算机软件&体系架构 • 上一篇    下一篇

集成式PU学习方法PUEVD及其在软件源码漏洞检测中的应用

包晟宏, 姚有健, 李小丫, 陈文   

  1. 四川大学网络空间安全学院 成都 610207
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 陈文(wenchen@scu.edu.cn)
  • 作者简介:(heitan@stu.scu.edu.cn)
  • 基金资助:
    国家重点研发计划(020YFB1805405)

Integrated PU Learning Method PUEVD and Its Application in Software Source CodeVulnerability Detection

BAO Shenghong, YAO Youjian, LI Xiaoya, CHEN Wen   

  1. School of Cyber Science and Engineering,Sichuan University,Chengdu 610207,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:BAO Shenghong,born in 2000,postgraduate.His main research interests include machine learning and anomaly detection.
    CHEN Wen,born in 1983,Ph.D,asso-ciate professor,Ph.D supervisor.His main research interests include cyber security and data mining.
  • Supported by:
    National Key Research and Development Program of China(020YFB1805405).

摘要: 基于人工智能的漏洞检测方法相比于传统漏洞检测方法减少了对专家经验的依赖并提升了检测效率。通常漏洞检测模型的训练过程需要大量有标记的样本,然而在实际应用中,有标记的漏洞样本的收集较为困难。因此,在仅有少量有标记漏洞样本的条件下,如何有效地利用大量的未标记源码样本以提高漏洞检测模型的训练性能成为漏洞检测领域的重要问题。正类-未标记样本学习(PU Learning)是一种新型半监督学习模式,PU学习能够结合少量正类样本与大量未标记样本对随机森林等模型进行训练,通过对未标记样本的类别评分以产生大量的伪标记训练样本,用于提升模型的训练性能。然而,当部分未标记样本的类别得分在阈值附近时,PU Learning容易产生错误的伪标记。为了实现基于PU Learning的源码漏洞检测,提出了一种集成式PU学习方法PUEVD。PUEVD首先基于PU风险最小化构建随机森林,计算未标记源码样本的类别得分,并筛选出源码样本的典型特征集合;然后在典型特征中随机选择一组代码特征子集;最后在该子集上比较易误分类样本的取值与可信正类/可信负类源码样本的相似度的差异。基于集成学习原理,PUEVD将任意易误分类源码样本x在给定的代码特征子集上计算出的相似度差异当作x在一个弱分类器的分类结果,之后基于x在多组随机选择的特征子集所获得的一组差异值融合调整x的类别得分,进而将样本的类别分值优化到可信的区间内,以降低源码样本被误分类的风险。将PUEVD应用到软件漏洞挖掘过程,实现了少量已标记源码样本条件下的漏洞检测,在标准源码漏洞数据集,CWE399,libtiff,asterisk上的实验结果表明,PUEVD在AUC和F1分数上均优于传统方法,证明了PUEVD方法在漏洞检测应用中的有效性。

关键词: PU学习, 漏洞检测, 集成学习, 得分调整, 半监督学习

Abstract: Compared to traditional methods,AI-based vulnerability detection reduces reliance on expert knowledge and improves detection efficiency.However,training these models typically requires many labeled samples,which are difficult to obtain in practice.Therefore,effectively utilizing large volumes of unlabeled code samples to enhance model performance under limited labeled data has become a critical issue in automatic software vulnerability detection.Positive-Unlabeled(PU) Learning,a semi-supervised approach,combines a small set of positive samples with a large number of unlabeled samples to train models like random forests.By assigning class scores to unlabeled samples,PU Learning generates pseudo-labeled data,improving training performance.However,PU Learning may generate incorrect labels when sample scores are close to the threshold.This paper proposes an integrated PU Learning method(PUEVD),to achieve semi-supervised vulnerability detection in source code.PUEVD first calculates class scores of unlabeled samples using a random forest,filters key features,and randomly selects feature subsets.For each subset,it calculates the similarity difference between misclassification-prone samples and reliable positive/negative samples.Based on ensemble learning,PUEVD aggregates these similarity differences across subsets to adjust and optimize class scores,reducing the risk of misclassification.Applied to vulnerability detection with limited labeled samples,PUEVD was validated on standard datasets,including CWE399,libtiff,and asterisk,showing improved AUC and F1 scores over traditional methods,thus demonstrating its effectiveness in vulnerability detection.

Key words: PU learning, Vulnerability detection, Ensemble learning, Scores-adjustment, Semi-supervised learning

中图分类号: 

  • TP181
[1]STEPHENS N,GROSEN J,SALLS C,et al.Driller:Augmen-ting fuzzing through selective symbolic execution[C]//Proceedings of the 2016 Network and Distributed System Security Symposium(NDSS).2016.
[2]PENG H,SHOSHITAISHVILI Y,PAYER M.T-Fuzz:fuzzing by program transformation[C]//2018 IEEE Symposium on Security and Privacy(SP).IEEE,2018:697-710.
[3]AGGARWAL A,JALOTE P.Integrating static and dynamic analysis for detecting vulnerabilities[C]//30th Annual International Computer Software and Applications Conference(COMPSAC’06).IEEE,2006,1:343-350.
[4]CHAKRABORTY S,KRISHNA R,RAY B,et al.Deep Learning Based Vulnerability Detection:Are We There Yet?[J].IEEE Transaction on Software Engineering,2022,48(9):3280-3296.
[5]LI Z,ZOU D,XU S,et al.Sysevr:A framework for using deep learning to detect software vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2021,19(4):2244-2258.
[6]PARK J,LEE H,RYU S.A survey of parametric static analysis[J].ACM Computing Surveys(CSUR),2021,54(7):1-37.
[7]CHEN B,JIANG J,WANG X,et al.Debiased self-training for semi-supervised learning[J].Advances in Neural Information Processing Systems,2022,35:32424-32437.
[8]XIE Q,DAI Z,HOVY E,et al.Unsupervised data augmentation for consistency training[J].Advances in neural information processing systems,2020,33:6256-6268.
[9]OLIVER A,ODENA A,RAFFEL C A,et al.Realistic evaluation of deep semi-supervised learning algorithms[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:3239-3250.
[10]HAMMOUDEH Z,LOWD D.Learning from positive and unlabeled data with arbitrary positive shift[J].Advances in Neural Information Processing Systems,2020,33:13088-13099.
[11]RADFORD A,METZ L,CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].arXiv:1511.06434,2015.
[12]KINGMA D P,MOHAMED S,JIMENEZ REZENDE D,et al.Semi-supervised learning with deep generative models[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2017:3581-3589.
[13]CARON M,BOJANOWSKI P,JOULIN A,et al.Deep clustering for unsupervised learning of visual features[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:132-149.
[14]KIRYO R,NIU G,DU PLESSIS M C,et al.Positive-unlabeled learning with non-negative risk estimator[C]//31st Conference on Neural Iformation Processing Systems.2017:1-11.
[15]BEKKER J,ROBBERECHTS P,DAVIS J.Beyond the selected completely at random assumption for learning from positive and unlabeled data[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham:Springer International Publishing,2019:71-85.
[16]JASKIE K,SPANIAS A.Positive unlabeled learning[M].Morgan & Claypool Publishers,2022.
[17] YANG P,LI X L,MEI J P,et al.Positive-unlabeled learning for disease gene identification[J].Bioinformatics,2012,28(20):2640-2647.
[18]YU H,ZHAI C X,HAN J.Text classification from positive and unlabeled documents[C]//Proceedings of the Twelfth International Conference on Information and Knowledge Management.2003:232-239.
[19]LI W,GUO Q,ELKAN C.A positive and unlabeled learning algorithm for one-class classification of remote-sensing data[J].IEEE Transactions on Geoscience and Remote Sensing,2010,49(2):717-725.
[20]ZHANG B,ZUO W.Learning from positive and unlabeled examples:A survey[C]//2008 International Symposiums on Information Processing.IEEE,2008:650-654.
[21]LIANG C,ZHANG Y,SHI P,et al.Learning very fast decision tree from uncertain data streams with positive and unlabeled samples[J].Information Sciences,2012,213:50-67.
[22]HE J,ZHANG Y,LI X,et al.Bayesian classifiers for positiveunlabeled learning[C]//International Conference on Web-Age Information Management.Berlin,Heidelberg:Springer Berlin Heidelberg,2011:81-93.
[23]LI C,HUA X L.Towards positive unlabeled learning for parallel data mining:a random forest framework[C]//Advanced Data Mining and Applications:10th International Conference(ADMA 2014).Guilin,China,Springer International Publishing,2014:573-587.
[24]ZHANG R,HUANG S,QI Z,et al.Combining static and dynamic analysis to discover software vulnerabilities[C]//2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.IEEE,2011:175-181.
[25]DE COMITÉ F,DENIS F,GILLERON R,et al.Positive and unlabeled examples help learning[C]//Algorithmic Learning Theory:10th International Conference(ALT’99).Tokyo,Japan,Springer Berlin Heidelberg,1999:219-230.
[26]CHEN X,CHEN W,CHEN T,et al.Self-pu:Self boosted and calibrated positive-unlabeled training[C]//International Conference on Machine Learning.PMLR,2020:1510-1519.
[27]MAGEE J F.Decision trees for decision making[M].Brighton,MA,USA:Harvard Business Review,1964.
[28]WILTON J,KOAY A,KO R,et al.Positive-Unlabeled Learning using Random Forests via Recursive Greedy Risk Minimization[J].Advances in Neural Information Processing Systems,2022,35:24060-24071.
[29]DONG X,YU Z,CAO W,et al.A survey on ensemble learning[J].Frontiers of Computer Science,2020,14:241-258.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!