用于软件缺陷预测的集成模型

doi:10.11896/jsjkx.180901685

摘要/Abstract

摘要： 软件缺陷预测的目的是有效地识别出有缺陷的模块。对于类别平衡数据,传统的分类器具有较好的预测效果,但当数据类别比例分布不均衡时,传统的分类器往往偏向于多数类,易使得少数类模块被误分。但是,真实的软件缺陷预测中的数据往往是类别不平衡的。为了处理软件缺陷中的这种类别不平衡问题,文中提出了基于改进的类权自适应、软投票与阈值移动的集成模型,该模型在不改变原始数据集的情况下,从训练阶段和决策阶段同时考虑处理类别不平衡的问题。首先,在类权值学习阶段,通过类权自适应学习得到不同类的最优权值;然后,在训练阶段,使用前一步得到的最优权值训练3个基分类器,并通过软集成的方法组合3个基分类器;最后,在决策阶段,根据阈值移动模型来做出决策,以得到最终预测类别。为了证明所提方法的有效性,实验采用NASA软件缺陷标准数据集和Eclipse软件缺陷标准数据集进行预测,并在相同的数据集上将其与近年提出的几种软件缺陷预测方法在召回率值Pd、假正例率值Pf和F1度量值F-measure方面进行了对比。实验结果表明,所提方法的召回率Pd平均提高了0.09,在F1度量值F-measure上平均提高了0.06。因此,文中提出的处理软件缺陷预测中类别不平衡问题的方法的整体性能优于其他软件缺陷预测方法,具有较好的预测效果。

关键词: 集成学习, 类权自适应, 软集成, 软件缺陷预测, 软投票, 阈值移动

Abstract: Software defect prediction aims to identify defective modules effectively.Traditional classifiers have good predictive effect on class-balanced data,but when the proportion of data classes is unbalanced,the traditional classifiers incline to majority classes,easily leading to the misclassification of minorityclass module.In reality,the data in software defect prediction are often unbalanced.In order to deal with this kind of class imbalance problem in software defects,this paper proposed an integrated model based on improved class weight self-adaptation,soft voting and threshold mo-ving.This model considers the class imbalance problem in the training stage and decision stage without changing the original data sets.Firstly,in class weight learning stage,the optimal weights of different classes are obtained through class weight adaptive learning.Then,in the training stage,three base classifiers are trained by using the optimal weights obtained in the previous step,and the three base classifiers are combined by soft ensemble method.Finally,in the decision stage,the decision is made according to the threshold moving model to get the final prediction category.In order to prove the validity of the proposed method,the NASA software defect standard data sets and the Eclipse software defect standard data sets are used for prediction,and the proposed method is compared with the results of several software defect prediction methods proposed in recent years on the recall rate Pd,false positive rate Pf and F1 measurement F-measure.The experimental results show that the recall rate Pd and F1 measurement F-measure of the proposed method improves by 0.09 and 0.06 on average respectively.Therefore,the overall performance of proposed method for dealing with class imbalance in software defect prediction is superior to other software defect prediction methods,and it has better prediction effect.

Key words: Class weighted self-adaptation, Ensemble learning, Soft ensemble, Soft voting, Software defect prediction, Threshold-moving

中图分类号:

TP311

胡梦园, 黄鸿云, 丁佐华. 用于软件缺陷预测的集成模型[J]. 计算机科学, 2019, 46(11): 176-180. https://doi.org/10.11896/jsjkx.180901685

HU Meng-yuan, HUANG Hong-yun, DING Zuo-hua. Ensemble Model for Software Defect Prediction[J]. Computer Science, 2019, 46(11): 176-180. https://doi.org/10.11896/jsjkx.180901685

参考文献

[1]BISHNU P S,BHATTACHERJEE V.Software fault prediction using quad tree-based k-means clustering algorithm[J].IEEE Transactions on Knowledge and Data Engineering,2012,24(6):1146-1150.
[2]HALL T,BEECHAM S,BOWES D,et al.A Systematic Literature Review on Fault Prediction Performance in Software Engineering[J].IEEE Transactions on Software Engineering,2012,38(6):1276-1304.
[3]WANG J,SHEN B,CHEN Y.Compressed C4.5 Models forSoftware Defect Prediction [C]∥International Conference on Quality Software.Xi An China.IEEE,2012:13-16.
[4]XING F,GUO P.Support vector regression for software reliability growth modeling and prediction[C]∥International Conference on Advances in Neural Networks.Chongqing China.Springer-Verlag,2005:925-930.
[5]ZHENG J.Cost-sensitive boosting neural networks for software defect prediction[J].Expert Systems with Applications,2010,37(6):4537-4543.
[6]GAO K,KHOSHGOFTAAR T M,NAPOLITANO A.A Hybrid Approach to Coping with High Dimensionality and Class Imbalance for Software Defect Prediction[C]∥International Conference on Machine Learning and Applications.Atlanta,GA,USA,IEEE,2013:281-288.
[7]WANG S,YAO X.Using Class Imbalance Learning for Software Defect Prediction[J].IEEE Transactions on Reliability,2013,62(2):434-443.
[8]YU Q,JIANG S J,ZHANG Y M,et al.The Impact Study of Class Imbalance on the Performance of Software Defect Prediction Models[J].Chinese Journal of Computer,2018,41(4):809-822.(in Chinese)
于巧,姜淑娟,张艳梅,等.分类不平衡对软件缺陷预测模型性能的影响研究[J].计算机学报,2018,41(4):809-822.
[9]MARUF ÖZTURK M,ZENGIN A.HSDD:A hybrid sampling strategy for class imbalance in defect prediction data sets[C]∥Eleventh International Conference on Digital Information Ma-nagement.Fukuoka,Japan.IEEE,2017:60-69.
[10]ZHOU Z H,LIU X Y.Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem[J].IEEE Transactions on Knowledge & Data Engineering,2006,18(1):63-77.
[11]WANG S,CHEN H,YAO X.Negative correlation learning for classification ensembles[C]∥International Joint Conference on Neural Networks.San Jose,California:IEEE,2011:1-8.
[12]MIAO L,LIU M,ZHANG D.Cost-sensitive feature selectionwith application in software defect prediction[C]∥2012 21st International Conference on Pattern Recognition (ICPR).Portland,Oregon:IEEE,2012:967-970.
[13]GALA R,FERNANDE Z,BARRENECHE A,et al.A Review on Ensembles for the Class Imbalance Problem:Bagging-,Boosting-,and Hybrid-Based Approaches[J].IEEE Transactions on Systems Man & Cybernetics Part C Applications & Reviews,2012,42(4):463-484.
[14]ELISH K O,ELISH M O.Predicting defect-prone softwaremodules using support vector machines[J].Journal of Systems &Software,2008,81(5):649-660.
[15]JIANG Y,LI M,ZHOU Z H.Software Defect Detection with Rocus[J].Journal of Computer Science & Technology,2011,26(2):328-342.
[16]ZHANG Z W,JING X Y,WANG T J.Label propagation based semi-supervised learning for software defect prediction[J].Automated Software Engineering,2016,24(1):1-23.
[17]JING X Y,YING S,ZHANG Z W,et al.Dictionary learningbased software defect prediction[C]∥Proceedings of the 36th International Conference on Software Engineering.ACM,2014:414-423.
[18]LU Q,JU C.Research on Credit Card Fraud Detection Model Based on Class Weighted Support Vector Machine[J].Journal of Convergence Information Technology,2011,6(1):62-68.
[19]MÖHLE S,BRÜNDL M,BEIERLE C.Modeling a System for Decision Support in Snow Avalanche Warning Using Balanced Random Forest and Weighted Random Forest[C]∥Internatio-nal Conference on Artificial Intelligence:Methodology,Systems,and Applications.Varna,Bulgaria,Springer/LNAI,2014:80-91.
[20]ZHANG Y,ZHANG H,CAI J,et al.A Weighted Voting Classifier Based on Differential Evolution[J].Abstract and Applied Analysis,2014,2014(2):1-6.
[21]ZHOU Z H.Ensemble Methods:Foundations and Algorithms[M].London:Taylor & Francis,2012.

相关文章 15

[1]	林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[2]	康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[3]	王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[4]	韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[5]	任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[6]	陈伟, 李杭, 李维华. 核小体定位预测的集成学习方法 Ensemble Learning Method for Nucleosome Localization Prediction 计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[7]	刘振宇, 宋晓莹. 一种可用于分类型属性数据的多变量回归森林 Multivariate Regression Forest for Categorical Attribute Data 计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[8]	周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[9]	周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[10]	戴宗明, 胡凯, 谢捷, 郭亚. 基于直觉模糊集的集成学习算法 Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets 计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036
[11]	郑小萌, 高猛, 滕俊元. 航天器软件缺陷预测数据集构建方法研究 Research on Construction Method of Defect Prediction Dataset for Spacecraft Software 计算机科学, 2021, 48(6A): 575-580. https://doi.org/10.11896/jsjkx.200900133
[12]	滕俊元, 高猛, 郑小萌, 江云松. 噪声可容忍的软件缺陷预测特征选择方法 Noise Tolerable Feature Selection Method for Software Defect Prediction 计算机科学, 2021, 48(12): 131-139. https://doi.org/10.11896/jsjkx.201000168
[13]	郇文明, 林海涛. 基于采样集成算法的入侵检测系统设计 Design of Intrusion Detection System Based on Sampling Ensemble Algorithm 计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101
[14]	刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型 FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest 计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162
[15]	钟熙, 孙祥娥. 基于Kmeans++聚类的朴素贝叶斯集成方法研究 Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering 计算机科学, 2019, 46(6A): 439-441.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed