计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 131-139.doi: 10.11896/jsjkx.201000168

• 计算机软件 • 上一篇    下一篇

噪声可容忍的软件缺陷预测特征选择方法

滕俊元, 高猛, 郑小萌, 江云松   

  1. 北京控制工程研究所 北京100190
  • 收稿日期:2020-10-28 修回日期:2021-03-15 出版日期:2021-12-15 发布日期:2021-11-26
  • 通讯作者: 高猛(gaomeng@sunwiseinfo.com)
  • 作者简介:tengjunyuan@sunwiseinfo.com
  • 基金资助:
    国家自然科学基金(61802017);装备预研领域基金项目(61400020407)

Noise Tolerable Feature Selection Method for Software Defect Prediction

TENG Jun-yuan, GAO Meng, ZHENG Xiao-meng, JIANG Yun-song   

  1. Beijing Institute of Control Engineering,Beijing 100190,China
  • Received:2020-10-28 Revised:2021-03-15 Online:2021-12-15 Published:2021-11-26
  • About author:TENG Jun-yuan,born in 1985,master,senior engineer.His main research interests include embedded software testing and software engineering.
    GAO Meng,born in 1982,master,senior engineer.His main research interests include embedded software testing and software engineering.
  • Supported by:
    National Natural Science Foundation of China(61802017) and Equipment Pre-Research Field Fund Project(61400020407).

摘要: 通过对缺陷数据集进行挖掘,缺陷预测模型能够提前预测出被测软件中的缺陷模块,帮助测试人员实现更有针对性的测试,而普遍存在的数据集标签噪声会影响预测模型的性能。已有的特征选择方法很少对噪声可容忍性进行针对性设计,同时在主流的具有噪声容忍能力的特征选择框架中策略选取只能依靠经验手动执行,难以在软件工程实践中得到应用。鉴于此,文中提出一种噪声可容忍的软件缺陷预测特征选择方法NTFES (Noise Tolerable FEature Selection),即通过Bootstrap抽样技术生成多个自助样本集,在自助样本集上基于近似马尔可夫毯将特征进行分组并采用两种启发式特征选择策略从每个组中选出候选特征,随后利用遗传算法在候选特征空间中搜索最优特征子集。为了验证NTFES方法的有效性,选择了NASA MDP软件项目集作为实验对象并对标签注入噪声以获得带有噪声标签的数据集,通过控制标签噪声比例对NTFES方法以及其他基准方法(如FULL,FCBF,CFS)进行了比较。实验结果表明:在可接受的标签噪声比例下,NTFES方法不仅具有更高的分类性能,还具有更好的噪声可容忍性。

关键词: 标签噪声, 软件测试, 软件缺陷预测, 特征选择, 噪声可容忍

Abstract: Software defect prediction can identify defective modules in advance by mining the defect datasets,helping testers to achieve more targeted testing.However,the ubiquity of label noise in the datasets affects the performance of the prediction mo-del.Few feature selection methods have been used to specifically design noise tolerance.In addition,the strategy selection in the mainstream noise tolerable feature selection framework can only be performed manually based on human experience,which is difficult to be applied in software engineering.In view of this,this paper proposes a novel method NTFES (noise tolerable feature selection).In particular,NTFES first generates multiple Bootstrap samples by Bootstrap sampling method.Then it divides the original features into different groups on Bootstrap samples by approximate Markov blanket and selects candidate features from each group based on two heuristic feature selection strategies. Sequently it uses genetic algorithm (GA) to search the optimal feature subset in the candidate feature space.To verify the effectiveness of the proposed method,this paper chooses NASA MDP dataset,and inject label noises simultaneously to imitate noisy datasets.Then it compares NTFES with other classical baseline methods,such as FULL,FCBF and CFS,by controlling the ratio of label noises.The experimental results show that the proposed method has the advantages of achieving higher classification performance and has better noise tolerable while the ratio of label noises is acceptable.

Key words: Feature selection, Label noise, Noise tolerable, Software defect prediction, Software testing

中图分类号: 

  • TP391
[1]CATAL C.Software fault prediction:A literature review and current trends[J].Expert Systems with Applications,2011,38(4):4626-4636.
[2]HERZIG K,JUST S,ZELLER A.It's not a bug,it's a feature:How misclassification impacts bug prediction[C]//Proceedings of the International Conference on Software Engineering.San Francisco,USA,2013:392-401.
[3]BOLON-CANEDO V,SANCHEZ-MARONO N,ALONSO- BETANZOS A.Feature selection for high dimensional data[J].Progress in Artificial Intelligence,2016,5(2):65-75.
[4]KIM S,ZHANG H Y,WU R X,et al.Dealing with noise in defect prediction[C]//Proceedings of the Intemational Conference on Software Engineering.Honolulu,USA,2011:481-490.
[5]TANTITHAMTHAVORN C,MCINTOSH S,HASSAN A E,et al.The impact of mislabeling on the performance and interpretation of defect prediction models[C]//Proceedings of the International Conference on Software Engineering.Firenze,Italy,2015:812-823.
[6]HALL T,BEECHAM S,BOWES D,et al.A systematic litera- ture review on fault prediction performance in software engineering[J].IEEE Transactions on Software Engineering,2012,38(6):1276-1304.
[7]CHEN X,GU Q,LIU W S,et al.Software defect prediction[J].Journal of Software,2016,27(1):1-25.
[8]MENZIES T,GREENWALD J,FRANK A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2007,33(1):2-13.
[9]GAO K H,KHOSHGOFTAAR T M,WANG H J,et al.Choosing software metrics for defect prediction:an investigation on feature selection techniques[J].Software Practice & Expe-rience,2011,41(5):579-606.
[10]WANG H J,KHOSHGOFTAAR T M,HULSE J V,et al.Metric selection for software defect prediction[J].International Journal of Software Engineering & Knowledge Engineering,2011,21(2):237-257.
[11]XU Z,XUAN J F,LIU J,et al.MICHAC:defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering[C]//Proceedings of the 23rd International Conference on Software Analysis,Evolution and Reengineering.Washington:IEEE Computer Society,2016,1:370-381.
[12]SONG Q B,JIA Z H,SHEPPERD M,et al.A general software defect-proneness prediction framework[J].IEEE Transactions on Software Engineering,2011,37(3):356-370.
[13]XU Z,LIU J,YANG Z J,et al.The impact of feature selection on defect prediction performance:an empirical comparison[C]//Proceedings of the 27th International Symposium on Software Reliability Engineering.Washington:IEEE Computer Society,2016:309-320.
[14]YU L,LIU H.Efficient feature selection via analysis of relevance and redundancy[J].Journal of Machine Learning Research,2004,5(10):1205-1224.
[15]PES B,DESSI N,ANGIONI M.Exploiting the ensemble paradigm for stable feature selection:A case study on high dimensional genomic data[J].Information Fusion,2017,35(C):132-147.
[16]ZHOU M.A hybrid feature selection method based on fisher score and genetic algorithm[J].Journal of Mathematical Sciences:Advances and Application,2016,37:51-78.
[17]LIU S L,CHEN X,LIU W S,et al.FECAR:A feature selection framework for software defect prediction[C]//Proceedings of the Annual Computer Software and Applications Conference.Vasteras,Sweden,2014:426-435.
[18]RAHMAN F,POSNETT D,HERRAIZ I,et al.Sample size vs.bias in defect prediction[C]//Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on Foundations of Software Engineering.Saint Petersburg,Russia,2013:147-157.
[19]LIU W S,CHEN X,GU Q,et al.A noise tolerable feature selection framework for software defect prediction[J].Chinese Journal of Computers,2018,41(3):506-520.
[20]GARCÍA-TORRES M,GÓMEZ-VELA F,MELIÁN-BATISTA B,et al.High-dimensional feature selection via feature grouping:A Variable Neighborhood Search approach[J].Information Sciences,2016,326:102-118.
[21]LIU Y,CAO J J,DIAO X C,et al.Survey on Stability of Feature Selection[J].Journal of Software,2018,29(9):2559-2579.
[22]DEVIJVER P A,KITTLER J.Pattern recognition:a statistical approach [M].London:Prentice Hall,1992.
[23]VAFAIE H,DE JONG K A.Genetic algorithms as a tool for feature selection in machine learning[C]//Proceedings of the 4th IEEE International Conference on Tools with AI.Washington DC:IEEE Computer Society,1992:200-203.
[24]HALL M A.Correlation-based feature subset selection for machine learning [D].Hamilton,New Zealand:University of Waikato,1999.
[25]SÁEZ J A,GALAR M,LUENGO J,et al.Tackling the problem of classification with noisy data using Multiple Classifier Systems:Analysis of the performance and robustness[J].Information Sciences,2013,247:1-20.
[26]LI J,CHENG K,WANG S,et al.Feature selection:A data perspective[J].ACM Computing Surveys (CSUR),2017,50(6):1-45.
[1] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[3] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[4] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[5] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[6] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[7] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[8] 杨蕾, 降爱莲, 强彦.
基于自编码器和流形正则的结构保持无监督特征选择
Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization
计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211
[9] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[10] 胡艳梅, 杨波, 多滨.
基于网络结构的正则化逻辑回归
Logistic Regression with Regularization Based on Network Structure
计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[11] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[12] 郑小萌, 高猛, 滕俊元.
航天器软件缺陷预测数据集构建方法研究
Research on Construction Method of Defect Prediction Dataset for Spacecraft Software
计算机科学, 2021, 48(6A): 575-580. https://doi.org/10.11896/jsjkx.200900133
[13] 丁思凡, 王锋, 魏巍.
一种基于标签相关度的Relief特征选择算法
Relief Feature Selection Algorithm Based on Label Correlation
计算机科学, 2021, 48(4): 91-96. https://doi.org/10.11896/jsjkx.200800025
[14] 文进, 张星宇, 沙朝锋, 刘艳君.
基于次模函数最大化的测试用例集约简
Test Suite Reduction via Submodular Function Maximization
计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086
[15] 张亚钏, 李浩, 宋晨明, 卜荣景, 王海宁, 康雁.
混合人工化学反应优化和狼群算法的特征选择
Hybrid Artificial Chemical Reaction Optimization with Wolf Colony Algorithm for Feature Selection
计算机科学, 2021, 48(11A): 93-101. https://doi.org/10.11896/jsjkx.210100067
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!