Computer Science ›› 2025, Vol. 52 ›› Issue (6A): 241100144-9.doi: 10.11896/jsjkx.241100144

• Computer Software & Architecture • Previous Articles     Next Articles

Integrated PU Learning Method PUEVD and Its Application in Software Source CodeVulnerability Detection

BAO Shenghong, YAO Youjian, LI Xiaoya, CHEN Wen   

  1. School of Cyber Science and Engineering,Sichuan University,Chengdu 610207,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:BAO Shenghong,born in 2000,postgraduate.His main research interests include machine learning and anomaly detection.
    CHEN Wen,born in 1983,Ph.D,asso-ciate professor,Ph.D supervisor.His main research interests include cyber security and data mining.
  • Supported by:
    National Key Research and Development Program of China(020YFB1805405).

Abstract: Compared to traditional methods,AI-based vulnerability detection reduces reliance on expert knowledge and improves detection efficiency.However,training these models typically requires many labeled samples,which are difficult to obtain in practice.Therefore,effectively utilizing large volumes of unlabeled code samples to enhance model performance under limited labeled data has become a critical issue in automatic software vulnerability detection.Positive-Unlabeled(PU) Learning,a semi-supervised approach,combines a small set of positive samples with a large number of unlabeled samples to train models like random forests.By assigning class scores to unlabeled samples,PU Learning generates pseudo-labeled data,improving training performance.However,PU Learning may generate incorrect labels when sample scores are close to the threshold.This paper proposes an integrated PU Learning method(PUEVD),to achieve semi-supervised vulnerability detection in source code.PUEVD first calculates class scores of unlabeled samples using a random forest,filters key features,and randomly selects feature subsets.For each subset,it calculates the similarity difference between misclassification-prone samples and reliable positive/negative samples.Based on ensemble learning,PUEVD aggregates these similarity differences across subsets to adjust and optimize class scores,reducing the risk of misclassification.Applied to vulnerability detection with limited labeled samples,PUEVD was validated on standard datasets,including CWE399,libtiff,and asterisk,showing improved AUC and F1 scores over traditional methods,thus demonstrating its effectiveness in vulnerability detection.

Key words: PU learning, Vulnerability detection, Ensemble learning, Scores-adjustment, Semi-supervised learning

CLC Number: 

  • TP181
[1]STEPHENS N,GROSEN J,SALLS C,et al.Driller:Augmen-ting fuzzing through selective symbolic execution[C]//Proceedings of the 2016 Network and Distributed System Security Symposium(NDSS).2016.
[2]PENG H,SHOSHITAISHVILI Y,PAYER M.T-Fuzz:fuzzing by program transformation[C]//2018 IEEE Symposium on Security and Privacy(SP).IEEE,2018:697-710.
[3]AGGARWAL A,JALOTE P.Integrating static and dynamic analysis for detecting vulnerabilities[C]//30th Annual International Computer Software and Applications Conference(COMPSAC’06).IEEE,2006,1:343-350.
[4]CHAKRABORTY S,KRISHNA R,RAY B,et al.Deep Learning Based Vulnerability Detection:Are We There Yet?[J].IEEE Transaction on Software Engineering,2022,48(9):3280-3296.
[5]LI Z,ZOU D,XU S,et al.Sysevr:A framework for using deep learning to detect software vulnerabilities[J].IEEE Transactions on Dependable and Secure Computing,2021,19(4):2244-2258.
[6]PARK J,LEE H,RYU S.A survey of parametric static analysis[J].ACM Computing Surveys(CSUR),2021,54(7):1-37.
[7]CHEN B,JIANG J,WANG X,et al.Debiased self-training for semi-supervised learning[J].Advances in Neural Information Processing Systems,2022,35:32424-32437.
[8]XIE Q,DAI Z,HOVY E,et al.Unsupervised data augmentation for consistency training[J].Advances in neural information processing systems,2020,33:6256-6268.
[9]OLIVER A,ODENA A,RAFFEL C A,et al.Realistic evaluation of deep semi-supervised learning algorithms[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:3239-3250.
[10]HAMMOUDEH Z,LOWD D.Learning from positive and unlabeled data with arbitrary positive shift[J].Advances in Neural Information Processing Systems,2020,33:13088-13099.
[11]RADFORD A,METZ L,CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].arXiv:1511.06434,2015.
[12]KINGMA D P,MOHAMED S,JIMENEZ REZENDE D,et al.Semi-supervised learning with deep generative models[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2017:3581-3589.
[13]CARON M,BOJANOWSKI P,JOULIN A,et al.Deep clustering for unsupervised learning of visual features[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:132-149.
[14]KIRYO R,NIU G,DU PLESSIS M C,et al.Positive-unlabeled learning with non-negative risk estimator[C]//31st Conference on Neural Iformation Processing Systems.2017:1-11.
[15]BEKKER J,ROBBERECHTS P,DAVIS J.Beyond the selected completely at random assumption for learning from positive and unlabeled data[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham:Springer International Publishing,2019:71-85.
[16]JASKIE K,SPANIAS A.Positive unlabeled learning[M].Morgan & Claypool Publishers,2022.
[17] YANG P,LI X L,MEI J P,et al.Positive-unlabeled learning for disease gene identification[J].Bioinformatics,2012,28(20):2640-2647.
[18]YU H,ZHAI C X,HAN J.Text classification from positive and unlabeled documents[C]//Proceedings of the Twelfth International Conference on Information and Knowledge Management.2003:232-239.
[19]LI W,GUO Q,ELKAN C.A positive and unlabeled learning algorithm for one-class classification of remote-sensing data[J].IEEE Transactions on Geoscience and Remote Sensing,2010,49(2):717-725.
[20]ZHANG B,ZUO W.Learning from positive and unlabeled examples:A survey[C]//2008 International Symposiums on Information Processing.IEEE,2008:650-654.
[21]LIANG C,ZHANG Y,SHI P,et al.Learning very fast decision tree from uncertain data streams with positive and unlabeled samples[J].Information Sciences,2012,213:50-67.
[22]HE J,ZHANG Y,LI X,et al.Bayesian classifiers for positiveunlabeled learning[C]//International Conference on Web-Age Information Management.Berlin,Heidelberg:Springer Berlin Heidelberg,2011:81-93.
[23]LI C,HUA X L.Towards positive unlabeled learning for parallel data mining:a random forest framework[C]//Advanced Data Mining and Applications:10th International Conference(ADMA 2014).Guilin,China,Springer International Publishing,2014:573-587.
[24]ZHANG R,HUANG S,QI Z,et al.Combining static and dynamic analysis to discover software vulnerabilities[C]//2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.IEEE,2011:175-181.
[25]DE COMITÉ F,DENIS F,GILLERON R,et al.Positive and unlabeled examples help learning[C]//Algorithmic Learning Theory:10th International Conference(ALT’99).Tokyo,Japan,Springer Berlin Heidelberg,1999:219-230.
[26]CHEN X,CHEN W,CHEN T,et al.Self-pu:Self boosted and calibrated positive-unlabeled training[C]//International Conference on Machine Learning.PMLR,2020:1510-1519.
[27]MAGEE J F.Decision trees for decision making[M].Brighton,MA,USA:Harvard Business Review,1964.
[28]WILTON J,KOAY A,KO R,et al.Positive-Unlabeled Learning using Random Forests via Recursive Greedy Risk Minimization[J].Advances in Neural Information Processing Systems,2022,35:24060-24071.
[29]DONG X,YU Z,CAO W,et al.A survey on ensemble learning[J].Frontiers of Computer Science,2020,14:241-258.
[1] CHEN Qirui, WANG Baohui, DAI Chencheng. Research on Electrocardiogram Classification and Recognition Algorithm Based on Transfer Learning [J]. Computer Science, 2025, 52(6A): 240900073-8.
[2] LIU Chengming, LI Haixia, LI Shaochuan, LI Yinghao. Ensemble Learning Model for Stock Manipulation Detection Based on Multi-scale Data [J]. Computer Science, 2025, 52(6A): 240700108-8.
[3] DU Yuanhua, CHEN Pan, ZHOU Nan, SHI Kaibo, CHEN Eryang, ZHANG Yuanpeng. Correntropy Based Multi-view Low-rank Matrix Factorization and Constraint Graph Learning for Multi-view Data Clustering [J]. Computer Science, 2025, 52(6A): 240900131-10.
[4] ZHANG Xuming, SHI Yaqing, HUANG Song, WANG Xingya, HU Jinchang, LU Jiangtao. Survey of Open-source Software Component Vulnerability Detection and Automatic RepairTechnology [J]. Computer Science, 2025, 52(6): 1-20.
[5] WANG Xiao, LI Guanxiong, LI Na, YUAN Dongfeng. Semi-supervised Learning Flow Field Prediction Method Based on Gaussian Mixture Discrimination [J]. Computer Science, 2025, 52(6): 88-95.
[6] WU You, WANG Jing, LI Peipei, HU Xuegang. Semi-supervised Partial Multi-label Feature Selection [J]. Computer Science, 2025, 52(4): 161-168.
[7] SHEN Yaxin, GAO Lijian , MAO Qirong. Semi-supervised Sound Event Detection Based on Meta Learning [J]. Computer Science, 2025, 52(3): 222-230.
[8] HAN Wei, JIANG Shujuan, ZHOU Wei. Patch Correctness Verification Method Based on CodeBERT and Stacking Ensemble Learning [J]. Computer Science, 2025, 52(1): 250-258.
[9] REN Jiadong, LI Shangyang, REN Rong, ZHANG Bing, WANG Qian. Web Access Control Vulnerability Detection Approach Based on Site Maps [J]. Computer Science, 2024, 51(9): 416-424.
[10] LU Xulin, LI Zhihua. IoT Device Recognition Method Combining Multimodal IoT Device Fingerprint and Ensemble Learning [J]. Computer Science, 2024, 51(9): 371-382.
[11] ZHUO Peiyan, ZHANG Yaona, LIU Wei, LIU Zijin, SONG You. CTGANBoost:Credit Fraud Detection Based on CTGAN and Boosting [J]. Computer Science, 2024, 51(6A): 230600199-7.
[12] LIANG Meiyan, FAN Yingying, WANG Lin. Fine-grained Colon Pathology Images Classification Based on Heterogeneous Ensemble Learningwith Multi-distance Measures [J]. Computer Science, 2024, 51(6A): 230400043-7.
[13] LI Xinrui, ZHANG Yanfang, KANG Xiaodong, LI Bo, HAN Junling. Intelligent Diagnosis of Brain Tumor with MRI Based on Ensemble Learning [J]. Computer Science, 2024, 51(6A): 230600043-7.
[14] KANG Wei, LI Lihui, WEN Yimin. Semi-supervised Classification of Data Stream with Concept Drift Based on Clustering Model Reuse [J]. Computer Science, 2024, 51(4): 124-131.
[15] DAI Wei, CHAI Jing, LIU Yajiao. Semi-supervised Learning Algorithm Based on Maximum Margin and Manifold Hypothesis [J]. Computer Science, 2024, 51(2): 259-267.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!