计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 131-139.doi: 10.11896/jsjkx.201000168

• 计算机软件 • 上一篇    下一篇

噪声可容忍的软件缺陷预测特征选择方法

滕俊元, 高猛, 郑小萌, 江云松   

  1. 北京控制工程研究所 北京100190
  • 收稿日期:2020-10-28 修回日期:2021-03-15 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 高猛(gaomeng@sunwiseinfo.com)
  • 作者简介:tengjunyuan@sunwiseinfo.com
  • 基金资助:
    国家自然科学基金(61802017);装备预研领域基金项目(61400020407)

Noise Tolerable Feature Selection Method for Software Defect Prediction

TENG Jun-yuan, GAO Meng, ZHENG Xiao-meng, JIANG Yun-song   

  1. Beijing Institute of Control Engineering,Beijing 100190,China
  • Received:2020-10-28 Revised:2021-03-15 Online:2021-12-15 Published:2021-11-26
  • About author:TENG Jun-yuan,born in 1985,master,senior engineer.His main research interests include embedded software testing and software engineering.
    GAO Meng,born in 1982,master,senior engineer.His main research interests include embedded software testing and software engineering.
  • Supported by:
    National Natural Science Foundation of China(61802017) and Equipment Pre-Research Field Fund Project(61400020407).

摘要: 通过对缺陷数据集进行挖掘,缺陷预测模型能够提前预测出被测软件中的缺陷模块,帮助测试人员实现更有针对性的测试,而普遍存在的数据集标签噪声会影响预测模型的性能。已有的特征选择方法很少对噪声可容忍性进行针对性设计,同时在主流的具有噪声容忍能力的特征选择框架中策略选取只能依靠经验手动执行,难以在软件工程实践中得到应用。鉴于此,文中提出一种噪声可容忍的软件缺陷预测特征选择方法NTFES (Noise Tolerable FEature Selection),即通过Bootstrap抽样技术生成多个自助样本集,在自助样本集上基于近似马尔可夫毯将特征进行分组并采用两种启发式特征选择策略从每个组中选出候选特征,随后利用遗传算法在候选特征空间中搜索最优特征子集。为了验证NTFES方法的有效性,选择了NASA MDP软件项目集作为实验对象并对标签注入噪声以获得带有噪声标签的数据集,通过控制标签噪声比例对NTFES方法以及其他基准方法(如FULL,FCBF,CFS)进行了比较。实验结果表明:在可接受的标签噪声比例下,NTFES方法不仅具有更高的分类性能,还具有更好的噪声可容忍性。

关键词: 软件测试, 软件缺陷预测, 特征选择, 标签噪声, 噪声可容忍

Abstract: Software defect prediction can identify defective modules in advance by mining the defect datasets,helping testers to achieve more targeted testing.However,the ubiquity of label noise in the datasets affects the performance of the prediction mo-del.Few feature selection methods have been used to specifically design noise tolerance.In addition,the strategy selection in the mainstream noise tolerable feature selection framework can only be performed manually based on human experience,which is difficult to be applied in software engineering.In view of this,this paper proposes a novel method NTFES (noise tolerable feature selection).In particular,NTFES first generates multiple Bootstrap samples by Bootstrap sampling method.Then it divides the original features into different groups on Bootstrap samples by approximate Markov blanket and selects candidate features from each group based on two heuristic feature selection strategies. Sequently it uses genetic algorithm (GA) to search the optimal feature subset in the candidate feature space.To verify the effectiveness of the proposed method,this paper chooses NASA MDP dataset,and inject label noises simultaneously to imitate noisy datasets.Then it compares NTFES with other classical baseline methods,such as FULL,FCBF and CFS,by controlling the ratio of label noises.The experimental results show that the proposed method has the advantages of achieving higher classification performance and has better noise tolerable while the ratio of label noises is acceptable.

Key words: Software testing, Software defect prediction, Feature selection, Label noise, Noise tolerable

中图分类号: 

  • TP391
[1]CATAL C.Software fault prediction:A literature review and current trends[J].Expert Systems with Applications,2011,38(4):4626-4636.
[2]HERZIG K,JUST S,ZELLER A.It's not a bug,it's a feature:How misclassification impacts bug prediction[C]//Proceedings of the International Conference on Software Engineering.San Francisco,USA,2013:392-401.
[3]BOLON-CANEDO V,SANCHEZ-MARONO N,ALONSO- BETANZOS A.Feature selection for high dimensional data[J].Progress in Artificial Intelligence,2016,5(2):65-75.
[4]KIM S,ZHANG H Y,WU R X,et al.Dealing with noise in defect prediction[C]//Proceedings of the Intemational Conference on Software Engineering.Honolulu,USA,2011:481-490.
[5]TANTITHAMTHAVORN C,MCINTOSH S,HASSAN A E,et al.The impact of mislabeling on the performance and interpretation of defect prediction models[C]//Proceedings of the International Conference on Software Engineering.Firenze,Italy,2015:812-823.
[6]HALL T,BEECHAM S,BOWES D,et al.A systematic litera- ture review on fault prediction performance in software engineering[J].IEEE Transactions on Software Engineering,2012,38(6):1276-1304.
[7]CHEN X,GU Q,LIU W S,et al.Software defect prediction[J].Journal of Software,2016,27(1):1-25.
[8]MENZIES T,GREENWALD J,FRANK A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2007,33(1):2-13.
[9]GAO K H,KHOSHGOFTAAR T M,WANG H J,et al.Choosing software metrics for defect prediction:an investigation on feature selection techniques[J].Software Practice & Expe-rience,2011,41(5):579-606.
[10]WANG H J,KHOSHGOFTAAR T M,HULSE J V,et al.Metric selection for software defect prediction[J].International Journal of Software Engineering & Knowledge Engineering,2011,21(2):237-257.
[11]XU Z,XUAN J F,LIU J,et al.MICHAC:defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering[C]//Proceedings of the 23rd International Conference on Software Analysis,Evolution and Reengineering.Washington:IEEE Computer Society,2016,1:370-381.
[12]SONG Q B,JIA Z H,SHEPPERD M,et al.A general software defect-proneness prediction framework[J].IEEE Transactions on Software Engineering,2011,37(3):356-370.
[13]XU Z,LIU J,YANG Z J,et al.The impact of feature selection on defect prediction performance:an empirical comparison[C]//Proceedings of the 27th International Symposium on Software Reliability Engineering.Washington:IEEE Computer Society,2016:309-320.
[14]YU L,LIU H.Efficient feature selection via analysis of relevance and redundancy[J].Journal of Machine Learning Research,2004,5(10):1205-1224.
[15]PES B,DESSI N,ANGIONI M.Exploiting the ensemble paradigm for stable feature selection:A case study on high dimensional genomic data[J].Information Fusion,2017,35(C):132-147.
[16]ZHOU M.A hybrid feature selection method based on fisher score and genetic algorithm[J].Journal of Mathematical Sciences:Advances and Application,2016,37:51-78.
[17]LIU S L,CHEN X,LIU W S,et al.FECAR:A feature selection framework for software defect prediction[C]//Proceedings of the Annual Computer Software and Applications Conference.Vasteras,Sweden,2014:426-435.
[18]RAHMAN F,POSNETT D,HERRAIZ I,et al.Sample size vs.bias in defect prediction[C]//Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on Foundations of Software Engineering.Saint Petersburg,Russia,2013:147-157.
[19]LIU W S,CHEN X,GU Q,et al.A noise tolerable feature selection framework for software defect prediction[J].Chinese Journal of Computers,2018,41(3):506-520.
[20]GARCÍA-TORRES M,GÓMEZ-VELA F,MELIÁN-BATISTA B,et al.High-dimensional feature selection via feature grouping:A Variable Neighborhood Search approach[J].Information Sciences,2016,326:102-118.
[21]LIU Y,CAO J J,DIAO X C,et al.Survey on Stability of Feature Selection[J].Journal of Software,2018,29(9):2559-2579.
[22]DEVIJVER P A,KITTLER J.Pattern recognition:a statistical approach [M].London:Prentice Hall,1992.
[23]VAFAIE H,DE JONG K A.Genetic algorithms as a tool for feature selection in machine learning[C]//Proceedings of the 4th IEEE International Conference on Tools with AI.Washington DC:IEEE Computer Society,1992:200-203.
[24]HALL M A.Correlation-based feature subset selection for machine learning [D].Hamilton,New Zealand:University of Waikato,1999.
[25]SÁEZ J A,GALAR M,LUENGO J,et al.Tackling the problem of classification with noisy data using Multiple Classifier Systems:Analysis of the performance and robustness[J].Information Sciences,2013,247:1-20.
[26]LI J,CHENG K,WANG S,et al.Feature selection:A data perspective[J].ACM Computing Surveys (CSUR),2017,50(6):1-45.
[1] 张叶, 李志华, 王长杰. 基于核密度估计的轻量级物联网异常流量检测方法[J]. 计算机科学, 2021, 48(9): 337-344.
[2] 杨蕾, 降爱莲, 强彦. 基于自编码器和流形正则的结构保持无监督特征选择[J]. 计算机科学, 2021, 48(8): 53-59.
[3] 侯春萍, 赵春月, 王致芃. 基于自反馈最优子类挖掘的视频异常检测算法[J]. 计算机科学, 2021, 48(7): 199-205.
[4] 胡艳梅, 杨波, 多滨. 基于网络结构的正则化逻辑回归[J]. 计算机科学, 2021, 48(7): 281-291.
[5] 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究[J]. 计算机科学, 2021, 48(6A): 250-254.
[6] 郑小萌, 高猛, 滕俊元. 航天器软件缺陷预测数据集构建方法研究[J]. 计算机科学, 2021, 48(6A): 575-580.
[7] 丁思凡, 王锋, 魏巍. 一种基于标签相关度的Relief特征选择算法[J]. 计算机科学, 2021, 48(4): 91-96.
[8] 文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简[J]. 计算机科学, 2021, 48(12): 75-84.
[9] 张亚钏, 李浩, 宋晨明, 卜荣景, 王海宁, 康雁. 混合人工化学反应优化和狼群算法的特征选择[J]. 计算机科学, 2021, 48(11A): 93-101.
[10] 孙昌爱, 张守峰, 朱维忠. 一种基于变异分析的BPEL程序故障定位技术[J]. 计算机科学, 2021, 48(1): 301-307.
[11] 董明刚, 黄宇扬, 敬超. 基于遗传实例和特征选择的K近邻训练集优化方法[J]. 计算机科学, 2020, 47(8): 178-184.
[12] 张严, 秦亮曦. 基于Levy飞行策略的改进樽海鞘群算法[J]. 计算机科学, 2020, 47(7): 154-160.
[13] 王萌, 丁志军. 一种新的设备指纹特征选择及模型构建方法[J]. 计算机科学, 2020, 47(7): 257-262.
[14] 彭伟, 胡宁, 胡璟璟. 图像隐写分析算法研究概述[J]. 计算机科学, 2020, 47(6A): 325-331.
[15] 李金霞, 赵志刚, 李强, 吕慧显, 李明生. 改进的局部和相似性保持特征选择算法[J]. 计算机科学, 2020, 47(6A): 480-484.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 潘果,徐雨明. LBSN中位置信息与网络拓扑相融合的好友预测[J]. 计算机科学, 2014, 41(9): 115 -118 .
[2] 张桂刚. 一种大数据放置方法[J]. 计算机科学, 2014, 41(6): 1 -4 .
[3] 余一骄,刘芹. 面向超大规模的中文文本N-gram串统计[J]. 计算机科学, 2014, 41(4): 263 -268 .
[4] 谭光玮,武彤. 基于CDC机制的数据仓库实时数据更新方法研究[J]. 计算机科学, 2015, 42(Z6): 546 -548 .
[5] 张天宇,关楠,邓庆绪. Xen虚拟机Credit调度算法的实时性能分析[J]. 计算机科学, 2015, 42(12): 115 -119 .
[6] 郑志蕴 刘 博 李 伦 王振飞. 基于关键词的RDF数据图查询模型研究[J]. 计算机科学, 2015, 42(7): 234 -239 .
[7] 陈 乾,胡谷雨. 一种新的DTW最佳弯曲窗口学习方法[J]. 计算机科学, 2012, 39(8): 191 -195 .
[8] 姚清,陈性元,杜学绘,王娜. 网格环境中基于语义注释的服务发现算法[J]. 计算机科学, 2012, 39(6): 54 -57 .
[9] 刘宏哲,须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8 -13 .
[10] 徐国愚,陈性元,杜学绘. 一种新的基于上下文传递的临近空间安全切换机制[J]. 计算机科学, 2013, 40(4): 160 -163 .