计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 24-31.doi: 10.11896/jsjkx.221100171
徐金鹏1, 郭新峰1, 王瑞波2, 李济洪2
XU Jinpeng1, GUO Xinfeng1, WANG Ruibo2, LI Jihong2
摘要: 在软件缺陷预测任务中,通常基于C&K等静态软件特征数据集,使用机器学习分类算法来构建软件缺陷预测(SDP)模型。然而,大多数静态软件特征数据集中缺陷数较少,数据集的类不平衡问题较为严重,导致学习到的SDP模型的预测性能较差。文中基于生成对抗网络(GAN),并利用FID得分筛选生成正例样本数据,增强正例样本量,然后在组块正则化m×2交叉验证(m×2BCV)框架下,通过众数投票法聚合多个子模型的结果,最终构成SDP模型。以PROMISE数据库下的20个数据集为实验数据集,采用随机森林算法构建SDP聚合模型。实验结果表明,与传统的随机上采样、SMOTE、随机下采样相比,所提SDP聚合模型的F1平均值分别提高了10.2%,5.7%,3.4%,且F1的稳定性也得到相应提高;所提SDP聚合模型在20个数据集的评测中,有17个F1值最高。从AUC指标来看,所提方法与传统的采样方法没有明显差异。
中图分类号:
[1]LI L,REN Z K,SHI K X,et al.Cost Sensitive Boosting Software Defect Prediction Method[J].Computer Engineering,2022,48(3):175-180. [2]LI Z,JING X Y,ZHU X,et al.Progress on approaches to software defect prediction[J].IET Software,2018,12(3):161-175. [3]YU Q,JIANG S J.The Impact Study of Class Imbalance on the Performance of Software DefectPrediction Models[J].Chinese Journal of Computers,2018,4:809-824. [4]SONG Q,GUO Y,SHEPPERD M,et al.A comprehensive investigation of the role of imbalanced learning for software defect prediction[J].IEEE Transactions on Software Engineering,2018,45(12):1253-1269. [5]MALHOTRA R,KAMAL S.An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data[J].Neurocomputing,2019,343:120-140. [6]NEZHADSHOKOUHI M M,MAJIDI M A,RASOOLZADE-GAN A,et al.Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance[J].The Journal of Supercomputing,2020,76(1):602-635. [7]PAK C,WANG T,SU X,et al.An empirical study on software defect prediction using oversampling by smote[J].International Journal of Software Engineering and Knowledge Engineering,2018,28(6):811-830. [8]GOYAL S.Handling class-imbalance with KNN(neighbour-hood) under-sampling for software defect prediction[J].Artificial Intelligence Review,2022,55(3):2023-2064. [9]HAN H,WANG W Y,MAO B H,et al.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin:Springer,2005:878-887. [10]LIU X Y,WU J,ZHOU Z H,et al.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B(Cybernetics),2008,39(2):539-550. [11]KONNO T,IWAZUME M.Pseudo-feature generation for im-balanced data analysis in deep learning[C]//CoRR.2018. [12]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144. [13]LI Z.Imbalanced Data Enhancement Algorithm Based on GAN and Its Application Research[D].Beijing:Beijing Jiaotong University,2019. [14]WANG R,WANG Y,LI J,et al.Block-regularized m× 2 cross-validated estimator of the generalization error[J].Neural Computation,2017,29(2):519-554. [15]XUE Y.Confidence in Comparing Two Models with F1 MeasureBased on Block-regularized m×2 Cross Validation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:1-8. [16]WANG R.Research on Block-regularized Cross-Validation Me-thods for Comparing Supervised Algorithms[D].Taiyuan:Shanxi University,2019. [17]ALSAEEDI A,KHAN M Z.Software defect prediction usingsupervised machine learning and ensemble techniques:a compa-rative study[J].Journal of Software Engineering and Applications,2019,12(5):85-100. [18]WANG Y,LI J,LI Y,et al.Confidence Interval for F1 Measure of Algorithm Performance Based on Blocked 3×2 Cross-validation[J].IEEE Transactions on Knowledge and Data Enginee-ring,2014,27(3):651-659. [19]HOSSEINI S,TURHAN B,GUNARATHNA D,et al.A sys-tematic literature review and meta-analysis on cross project defect prediction[J].IEEE Transactions on Software Engineering,2017,45(2):111-147. [20]MENG F,CHENG W,WANG J,et al.Semi-supervised software defect prediction model based on tri-training[J].KSII Transactions on Internet and Information Systems(TIIS),2021,15(11):4028-4042. [21]WANG K,LIU L,YUAN C,et al.Software defect predictionmodel based on LASSO-SVM[J].Neural Computing and Applications,2021,33(14):8249-8259. [22]MALOHTRA R,YADAV H S.An improved CNN-based architecture forwithin-project software defect prediction[M]//Soft Computing and Signal Processing.Springer,Singapore,2021:335-349. [23]IBRAHIM D R,GHNEMAT R,HUDAIB A,et al.Software defect prediction using feature selection and random forest algorithm[C]//2017 International Conference on New Trends in Computing Sciences(ICTCS).IEEE,2017:252-257. [24]TANTITHAMTHAVORN C,HASSAN A E,MATSUMOTOK,et al.The impact of class rebalancing techniques on the performance and interpretation of defect prediction models[J].IEEE Transactions on Software Engineering,2018,46(11):1200-1219. [25]HU M Y,HUANG H Y,XIANG Z H,et al.EnsembleModel for Software Defect Prediction[J].Computer Science,2019,46(11):176-180. [26]ALI H,SALLEH M N M,SAEDUDIN R,et al.Imbalance class problems in data mining:a review[J].Indonesian Journal of Electrical Engineering and Computer Science,2019,14(3):1560-1571. [27]LEEVY J L,KHOSHGOFTAAR T M,BAUDER R A,et al.A survey on addressing high-class imbalance in big data[J].Journal of Big Data,2018,5(1):1-30. [28]QIAN Y,QIAN X M,GUAN Y,et al.A Cross-Project Defect Prediction Method Using Adversarial Learning[J].Journal of Software2022,33(6):2097-2112. [29]SHENG L,LU L,LIN J,et al.An adversarial discriminativeconvolutional neural network for cross-project defect prediction[J].IEEE Access,2020,8:55241-55253. [30]WANG R,LI J,YANG X,et al.Block-regularized repeatedlearning-testing for estimating generalization error[J].Information Sciences,2019,477:246-264. [31]YANG X,WANG Y,WANG R,et al.Ensemble Feature Selection With Block-Regularized mx2 Cross-Validation[J].IEEE Transactions on Neural Networks and Learning Systems,2023,34(9):6628-6641. [32]ARJOVSKY M,BOTTOU L.Towards principled methods for training generative adversarial networks[J].arXiv:1701.04862,2017. [33]LEI K,MARDANI M,PAULY J M,et al.Wasserstein GANs for MR imaging:from paired to unpaired training[J].IEEE Transactions on Medical Imaging,2020,40(1):105-115. [34]OBUKHOV A,KRASNYANSKIY M.Quality assessmentmethod for GAN based on modified metrics inception score and Fréchet inception distance[C]//Proceedings of the Computational Methods in Systems and Software.Cham:Springer,2020:102-114. [35]DEL RIO S,BENÍTEZ J M,HERRERA F,et al.Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification[C]//2015 IEEE Trustcom/BigDataSE/ISPA.IEEE,2015:180-185. [36]WANG S,LIU T,TAN L,et al.Automatically learning semantic features for defect prediction[C]//2016 IEEE/ACM 38th International Conference on Software Engineering(ICSE).IEEE,2016:297-308. |
|