计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 24-31.doi: 10.11896/jsjkx.221100171

• 计算机软件 • 上一篇    下一篇

基于GAN数据增强的软件缺陷预测聚合模型

徐金鹏1, 郭新峰1, 王瑞波2, 李济洪2   

  1. 1 山西大学自动化与软件学院 太原 030006
    2 山西大学现代教育技术学院 太原 030006
  • 收稿日期:2022-11-21 修回日期:2023-04-07 出版日期:2023-12-15 发布日期:2023-12-07
  • 通讯作者: 李济洪(lijh@sxu.edu.cn)
  • 作者简介:(202123604008@email.sxu.edu.cn)
  • 基金资助:
    国家自然科学基金青年科学基金(61806115)

Aggregation Model for Software Defect Prediction Based on Data Enhancement by GAN

XU Jinpeng1, GUO Xinfeng1, WANG Ruibo2, LI Jihong2   

  1. 1 School of Automation and Software Engineering,Shanxi University,Taiyuan 030006,China
    2 School of Modern Education Technology,Shanxi University,Taiyuan 030006,China
  • Received:2022-11-21 Revised:2023-04-07 Online:2023-12-15 Published:2023-12-07
  • About author:XU Jinpeng,born in 1998,postgra-duate.His main research interests include software defect prediction and deep learning.
    LI Jihong,born in 1964,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include deep lear-ning,natural language processing and software defect prediction.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(61806115).

摘要: 在软件缺陷预测任务中,通常基于C&K等静态软件特征数据集,使用机器学习分类算法来构建软件缺陷预测(SDP)模型。然而,大多数静态软件特征数据集中缺陷数较少,数据集的类不平衡问题较为严重,导致学习到的SDP模型的预测性能较差。文中基于生成对抗网络(GAN),并利用FID得分筛选生成正例样本数据,增强正例样本量,然后在组块正则化m×2交叉验证(m×2BCV)框架下,通过众数投票法聚合多个子模型的结果,最终构成SDP模型。以PROMISE数据库下的20个数据集为实验数据集,采用随机森林算法构建SDP聚合模型。实验结果表明,与传统的随机上采样、SMOTE、随机下采样相比,所提SDP聚合模型的F1平均值分别提高了10.2%,5.7%,3.4%,且F1的稳定性也得到相应提高;所提SDP聚合模型在20个数据集的评测中,有17个F1值最高。从AUC指标来看,所提方法与传统的采样方法没有明显差异。

关键词: 生成对抗网络, 数据增强, 组块正则化交叉验证, 软件缺陷预测, 聚合模型

Abstract: In the task of software defect prediction,the machine learning classification algorithm is usually used to build a software defect prediction(SDP) model based on dataset with static softwarefeatures such as C&K metrics.However,the number of defects in most datasets with static software metrics is small,the class imbalance in the dataset is serious,resulting in the low prediction performance of the model.Based on generation adversarial network(GAN),this paper uses FID score screening to ge-nerate positive sample data,enhances the amount of postitive data,and then aggregates the results of learned models by majority-voting,and finally build the SDP model based on block-regularized m×2 Cross validation(m×2BCV).20 datasets in PROMISE database are used as the experimental datasets,and random forest algorithm is used to build model.Experimental results show that,compared with the traditional random over-sampling,SMOTE,and random under-sampling,the average F1 values of the SDP aggregation model in the 20 datasets is increased by 10.2%,5.7%,and 3.4% respectively,and the stability of F1 is also improved accordingly.In 17 of the 20 datasets,the SDP aggregation models have the highest F1 values.From the AUC index,there is no significant difference between the proposed method and the traditional sampling method.

Key words: Generative adversarial network, Data enhancement, Block-regularized m×2 cross validation, Software defect prediction, Aggregation model

中图分类号: 

  • TP311
[1]LI L,REN Z K,SHI K X,et al.Cost Sensitive Boosting Software Defect Prediction Method[J].Computer Engineering,2022,48(3):175-180.
[2]LI Z,JING X Y,ZHU X,et al.Progress on approaches to software defect prediction[J].IET Software,2018,12(3):161-175.
[3]YU Q,JIANG S J.The Impact Study of Class Imbalance on the Performance of Software DefectPrediction Models[J].Chinese Journal of Computers,2018,4:809-824.
[4]SONG Q,GUO Y,SHEPPERD M,et al.A comprehensive investigation of the role of imbalanced learning for software defect prediction[J].IEEE Transactions on Software Engineering,2018,45(12):1253-1269.
[5]MALHOTRA R,KAMAL S.An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data[J].Neurocomputing,2019,343:120-140.
[6]NEZHADSHOKOUHI M M,MAJIDI M A,RASOOLZADE-GAN A,et al.Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance[J].The Journal of Supercomputing,2020,76(1):602-635.
[7]PAK C,WANG T,SU X,et al.An empirical study on software defect prediction using oversampling by smote[J].International Journal of Software Engineering and Knowledge Engineering,2018,28(6):811-830.
[8]GOYAL S.Handling class-imbalance with KNN(neighbour-hood) under-sampling for software defect prediction[J].Artificial Intelligence Review,2022,55(3):2023-2064.
[9]HAN H,WANG W Y,MAO B H,et al.Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//International Conference on Intelligent Computing.Berlin:Springer,2005:878-887.
[10]LIU X Y,WU J,ZHOU Z H,et al.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B(Cybernetics),2008,39(2):539-550.
[11]KONNO T,IWAZUME M.Pseudo-feature generation for im-balanced data analysis in deep learning[C]//CoRR.2018.
[12]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[13]LI Z.Imbalanced Data Enhancement Algorithm Based on GAN and Its Application Research[D].Beijing:Beijing Jiaotong University,2019.
[14]WANG R,WANG Y,LI J,et al.Block-regularized m× 2 cross-validated estimator of the generalization error[J].Neural Computation,2017,29(2):519-554.
[15]XUE Y.Confidence in Comparing Two Models with F1 MeasureBased on Block-regularized m×2 Cross Validation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:1-8.
[16]WANG R.Research on Block-regularized Cross-Validation Me-thods for Comparing Supervised Algorithms[D].Taiyuan:Shanxi University,2019.
[17]ALSAEEDI A,KHAN M Z.Software defect prediction usingsupervised machine learning and ensemble techniques:a compa-rative study[J].Journal of Software Engineering and Applications,2019,12(5):85-100.
[18]WANG Y,LI J,LI Y,et al.Confidence Interval for F1 Measure of Algorithm Performance Based on Blocked 3×2 Cross-validation[J].IEEE Transactions on Knowledge and Data Enginee-ring,2014,27(3):651-659.
[19]HOSSEINI S,TURHAN B,GUNARATHNA D,et al.A sys-tematic literature review and meta-analysis on cross project defect prediction[J].IEEE Transactions on Software Engineering,2017,45(2):111-147.
[20]MENG F,CHENG W,WANG J,et al.Semi-supervised software defect prediction model based on tri-training[J].KSII Transactions on Internet and Information Systems(TIIS),2021,15(11):4028-4042.
[21]WANG K,LIU L,YUAN C,et al.Software defect predictionmodel based on LASSO-SVM[J].Neural Computing and Applications,2021,33(14):8249-8259.
[22]MALOHTRA R,YADAV H S.An improved CNN-based architecture forwithin-project software defect prediction[M]//Soft Computing and Signal Processing.Springer,Singapore,2021:335-349.
[23]IBRAHIM D R,GHNEMAT R,HUDAIB A,et al.Software defect prediction using feature selection and random forest algorithm[C]//2017 International Conference on New Trends in Computing Sciences(ICTCS).IEEE,2017:252-257.
[24]TANTITHAMTHAVORN C,HASSAN A E,MATSUMOTOK,et al.The impact of class rebalancing techniques on the performance and interpretation of defect prediction models[J].IEEE Transactions on Software Engineering,2018,46(11):1200-1219.
[25]HU M Y,HUANG H Y,XIANG Z H,et al.EnsembleModel for Software Defect Prediction[J].Computer Science,2019,46(11):176-180.
[26]ALI H,SALLEH M N M,SAEDUDIN R,et al.Imbalance class problems in data mining:a review[J].Indonesian Journal of Electrical Engineering and Computer Science,2019,14(3):1560-1571.
[27]LEEVY J L,KHOSHGOFTAAR T M,BAUDER R A,et al.A survey on addressing high-class imbalance in big data[J].Journal of Big Data,2018,5(1):1-30.
[28]QIAN Y,QIAN X M,GUAN Y,et al.A Cross-Project Defect Prediction Method Using Adversarial Learning[J].Journal of Software2022,33(6):2097-2112.
[29]SHENG L,LU L,LIN J,et al.An adversarial discriminativeconvolutional neural network for cross-project defect prediction[J].IEEE Access,2020,8:55241-55253.
[30]WANG R,LI J,YANG X,et al.Block-regularized repeatedlearning-testing for estimating generalization error[J].Information Sciences,2019,477:246-264.
[31]YANG X,WANG Y,WANG R,et al.Ensemble Feature Selection With Block-Regularized mx2 Cross-Validation[J].IEEE Transactions on Neural Networks and Learning Systems,2023,34(9):6628-6641.
[32]ARJOVSKY M,BOTTOU L.Towards principled methods for training generative adversarial networks[J].arXiv:1701.04862,2017.
[33]LEI K,MARDANI M,PAULY J M,et al.Wasserstein GANs for MR imaging:from paired to unpaired training[J].IEEE Transactions on Medical Imaging,2020,40(1):105-115.
[34]OBUKHOV A,KRASNYANSKIY M.Quality assessmentmethod for GAN based on modified metrics inception score and Fréchet inception distance[C]//Proceedings of the Computational Methods in Systems and Software.Cham:Springer,2020:102-114.
[35]DEL RIO S,BENÍTEZ J M,HERRERA F,et al.Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification[C]//2015 IEEE Trustcom/BigDataSE/ISPA.IEEE,2015:180-185.
[36]WANG S,LIU T,TAN L,et al.Automatically learning semantic features for defect prediction[C]//2016 IEEE/ACM 38th International Conference on Software Engineering(ICSE).IEEE,2016:297-308.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!