计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 210800160-7.doi: 10.11896/jsjkx.210800160

• 软件工程 • 上一篇    下一篇

结合数据选择的多源跨项目缺陷预测

邓建华, 王炜   

  1. 云南大学软件学院 昆明 650091
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 王炜(wangwei@ynu.edu.cn)
  • 作者简介:(1765881146@qq.com)
  • 基金资助:
    云南省中青年学术和技术带头人后备人选项目(2019HB104)

Multi-source Cross-project Defect Prediction with Data Selection

DENG Jian-hua, WANG Wei   

  1. School of Software,Yunnan University,Kunming 650091,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:DENG Jian-hua,born in 1997,master candidate,is a member of China Computer Federation.His main research interest is software defect prediction in software engineering.
    WANG Wei,born in 1979,Ph.D,asso-ciate professor,is a member of China Computer Federation.His main research interests include software engineering,machine learning and formal methods.
  • Supported by:
    Young and Middle-aged Academic and Technical Leader Candidate Project of Yunnan Province(2019HB104).

摘要: 多源跨项目缺陷预测(Multi-sources Cross Project Defect Prediction,MCPDP)旨在使用多个来自其他项目(源项目)的历史数据来预测目标项目中软件模块出现缺陷的可能性。该研究解决了缺陷预测建模的冷启动问题,为新建软件或缺乏历史数据的软件系统建立缺陷预测模型提供了解决方案。对于进一步提高跨项目缺陷预测的准确性,源数据选择被认为是一条有效途径。因此,文中对数据选择的多源跨项目缺陷预测方法进行了研究,该方法包括两个步骤:1)源数据特征对齐;2)改进最大均值测度,实现源数据筛选。为了验证提出的方法的有效性,在AEEEM,Relink,NASA,SOFTLAB这4个公开数据集进行实验,结果表明所提方法在F-measure指标上比基线方法分别提高了4%和5%,证明该方法具有较好的性能。

关键词: 多源域, 跨项目, 缺陷预测, 数据选择, 特征对齐

Abstract: Multi-sources cross project defect prediction(MCPDP) aims to use multiple historical data from other projects(source projects) to predict the likelihood of defects in software modules in the target project.The research solves the cold start problem of defect prediction modeling and provides a solution for establishing defect prediction model for new software or software system lacking historical data.Source data selection is considered to be an effective way to further improve the accuracy of cross-project defect prediction.Therefore,a multi-source cross-project defect prediction method for data selection is studied in this paper.The method includes two steps:1) feature alignment of source data;2) improve the maximum mean measure to realize source data screening.In order to verify the effectiveness of the proposed method,experiments are carried out on four public data sets,namely AEEEM,Relink,NASA and SOFTLAB.The results show that the proposed method improves the F-measure index by 4% and 5% respectively compared with the baseline method,which proves that the proposed method has good performance.

Key words: Multi-source domain, Across projects, Defect prediction, Data selection, Feature alignment

中图分类号: 

  • TP311
[1]TIAN J.Software Quality Engineering:Testing,Quality Assu-rance,and Quantifiable Improvement[M].Wiley-Interscience,2005.
[2]CATAL C,DIRI B.Investigating the effect of dataset size,metrics sets,and feature selection techniques on software fault prediction problem[J].Information Sciences,2009,179(8):1040-1058.
[3]MENZIES T,TURHAN B,BENER A,et al.Implications ofceiling effects in defect predictors[C]//Proceedings of the 4th International Workshop on Predictor Models in Software Engineering.2008:47-54.
[4]CANFORA G,LUCIA A D,PENTA M D,et al.Defect prediction as a multiobjective optimization problem[J].Software Testing,Verification and Reliability,2015,25(4):426-459.
[5]MA Y,LUO G,ZENG X,et al.Transfer learning for cross-company software defect prediction[J].Information and Software Technology,2012,54(3):248-256.
[6]NAM J,PAN S J,KIM S.Transfer defect learning[C]//2013 35th International Conference on Software Engineering(ICSE).IEEE,2013:382-391.
[7]MARTINEZ-FERNANDEZ S,JOVANOVIC P,FRANCH X,et al.Towards automated data integration in software analytics[C]//Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics.2018:1-5.
[8]KAMEI Y,FUKUSHIMA T,MCINTOSH S,et al.Studyingjust-in-time defect prediction using cross-project models[J].Empirical Software Engineering,2016,21(5):2072-2106.
[9]HALL T,BEECHAM S,BOWES D,et al.A systematic literature review on fault prediction performance in software engineering[J].IEEE Transactions on Software Engineering,2011,38(6):1276-1304.
[10]LIN D,AN X,ZHANG J.Double-bootstrapping source data selection for instance-based transfer learning[J].Pattern Recognition Letters,2013,34(11):1279-1285.
[11]HERBOLD S.Training data selection for cross-project defectprediction[C]//Proceedings of the 9th International Conference on Predictive Models in Software Engineering.2013:1-10.
[12]TURHAN B,MENZIES T,BENER A B,et al.On the relative value of cross-company and within-company data for defect prediction[J].Empirical Software Engineering,2009,14(5):540-578.
[13]PETERS F,MENZIES T,MARCUS A.Better cross companydefect prediction[C]//2013 10th Working Conference on Mi-ning Software Repositories(MSR).IEEE,2013:409-418.
[14]HE Z,SHU F,YANG Y,et al.An investigation on the feasibility of cross-project defect prediction[J].Automated Software Engineering,2012,19(2):167-199.
[15]HE P,LI B,ZHANG D,et al.Simplification of training data for cross-project defect prediction[J].arXiv:1405.0773,2014.
[16]LI Y,HUANG Z,WANG Y,et al.Evaluating data filter on cross-project defect prediction:Comparison and improvements[J].IEEE Access,2017,5:25646-25656.
[17]LIU C,YANG D,XIA X,et al.A two-phase transfer learning model for cross-project defect prediction[J].Information and Software Technology,2019,107:125-136.
[18]GRETTON A,BORGWARDT K M,RASCH M J,et al.A kernel two-sample test[J].The Journal of Machine Learning Research,2012,13(1):723-773.
[19]SMOLA A,GRETTON A,SONG L,et al.A Hilbert space embedding for distributions[C]//International Conference on Algorithmic Learning Theory.Berlin:Springer,2007:13-31.
[20]JING X,WU F,DONG X,et al.Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning[C]//Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering.2015:496-507.
[21]YIN X,LIU L,LIU H,et al.Heterogeneous cross-project defect prediction with multiple source projects based on transferlear-ning[J].Mathematical Biosciences and Engineering,2020,17(2):1020-1040.
[22]D’AMBORS M,LANZA M,ROBBES R.An extensive comparison of bug prediction approaches[C]//2010 7th IEEE Working Conference on Mining Software Repositories(MSR 2010).IEEE,2010:31-41.
[23]WU R,ZHANG H,KIM S,et al.Relink:recovering links between bugs and changes[C]//Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering.2011:15-25.
[24]MENZIES T,GREENWALD J,FRANK A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2006,33(1):2-13.
[25]D’AMBORS M,LANZA M,ROBBES R.Evaluating defect prediction approaches:a benchmark and an extensive comparison[J].Empirical Software Engineering,2012,17(4):531-577.
[26]PETERS F,MENZIES T.Privacy and utility for defect prediction:Experiments with morph[C]//2012 34th International Conference on Software Engineering(ICSE).IEEE,2012:189-199.
[27]HE Z,SHU F,YANG Y,et al.An investigation on the feasibility of cross-project defect prediction[J].Automated Software Engineering,2012,19(2):167-199.
[1] 张大林, 张哲玮, 王楠, 刘吉强.
AutoUnit:基于主动学习和预测引导的测试自动生成
AutoUnit:Automatic Test Generation Based on Active Learning and Prediction Guidance
计算机科学, 2022, 49(11): 39-48. https://doi.org/10.11896/jsjkx.220200086
[2] 郑小萌, 高猛, 滕俊元.
航天器软件缺陷预测数据集构建方法研究
Research on Construction Method of Defect Prediction Dataset for Spacecraft Software
计算机科学, 2021, 48(6A): 575-580. https://doi.org/10.11896/jsjkx.200900133
[3] 肖蕾, 陈荣赏, 缪淮扣, 洪煜.
融合聚类算法和缺陷预测的测试用例优先排序方法
Test Case Prioritization Combining Clustering Approach and Fault Prediction
计算机科学, 2021, 48(5): 99-108. https://doi.org/10.11896/jsjkx.200400100
[4] 滕俊元, 高猛, 郑小萌, 江云松.
噪声可容忍的软件缺陷预测特征选择方法
Noise Tolerable Feature Selection Method for Software Defect Prediction
计算机科学, 2021, 48(12): 131-139. https://doi.org/10.11896/jsjkx.201000168
[5] 欧阳鹏, 陆璐, 张凡龙, 邱少健.
基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测
Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology
计算机科学, 2020, 47(9): 10-16. https://doi.org/10.11896/jsjkx.200400041
[6] 周玉, 任钦差, 牛会宾.
训练样本数据选择方法研究综述
Research on Training Sample Data Selection Methods
计算机科学, 2020, 47(11A): 402-408. https://doi.org/10.11896/jsjkx.191100094
[7] 袁丁,王茜,邓李维.
聚类辅助特征对齐的域适应方法
Clustering Assist Feature Alignment for Unsupervised Domain Adaptation
计算机科学, 2019, 46(3): 221-226. https://doi.org/10.11896/j.issn.1002-137X.2019.03.033
[8] 邱少健, 蔡子仪, 陆璐.
基于卷积神经网络的代价敏感软件缺陷预测模型
Cost-sensitive Convolutional Neural Network Model for Software Defect Prediction
计算机科学, 2019, 46(11): 156-160. https://doi.org/10.11896/jsjkx.191100502C
[9] 胡梦园, 黄鸿云, 丁佐华.
用于软件缺陷预测的集成模型
Ensemble Model for Software Defect Prediction
计算机科学, 2019, 46(11): 176-180. https://doi.org/10.11896/jsjkx.180901685
[10] 张爱英.
基于多语言语音数据选择的资源稀缺蒙语语音识别研究
Research on Low-resource Mongolian Speech Recognition Based on Multilingual Speech Data Selection
计算机科学, 2018, 45(9): 308-313. https://doi.org/10.11896/j.issn.1002-137X.2018.09.052
[11] 薛参观, 燕雪峰.
基于改进深度森林算法的软件缺陷预测
Software Defect Prediction Based on Improved Deep Forest Algorithm
计算机科学, 2018, 45(8): 160-165. https://doi.org/10.11896/j.issn.1002-137X.2018.08.029
[12] 陈翔, 王秋萍.
基于代码修改的多目标有监督缺陷预测建模方法
Multi-objective Supervised Defect Prediction Modeling Method Based on Code Changes
计算机科学, 2018, 45(6): 161-165. https://doi.org/10.11896/j.issn.1002-137X.2018.06.028
[13] 杨杰,燕雪峰,张德平.
基于Boosting的代价敏感软件缺陷预测方法
Cost-sensitive Software Defect Prediction Method Based on Boosting
计算机科学, 2017, 44(8): 176-180. https://doi.org/10.11896/j.issn.1002-137X.2017.08.031
[14] 甘露,臧洌,李航.
深度信念网软件缺陷预测模型
Deep Belief Network Software Defect Prediction Model
计算机科学, 2017, 44(4): 229-233. https://doi.org/10.11896/j.issn.1002-137X.2017.04.049
[15] 陈恒,刘文广,高东静,彭鑫,赵文耘.
面向单个文件的个性化缺陷预测方法
Personalized Defect Prediction for Individual Source Files
计算机科学, 2017, 44(4): 90-95. https://doi.org/10.11896/j.issn.1002-137X.2017.04.020
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!