计算机科学 ›› 2021, Vol. 48 ›› Issue (12): 131-139.doi: 10.11896/jsjkx.201000168
滕俊元, 高猛, 郑小萌, 江云松
TENG Jun-yuan, GAO Meng, ZHENG Xiao-meng, JIANG Yun-song
摘要: 通过对缺陷数据集进行挖掘,缺陷预测模型能够提前预测出被测软件中的缺陷模块,帮助测试人员实现更有针对性的测试,而普遍存在的数据集标签噪声会影响预测模型的性能。已有的特征选择方法很少对噪声可容忍性进行针对性设计,同时在主流的具有噪声容忍能力的特征选择框架中策略选取只能依靠经验手动执行,难以在软件工程实践中得到应用。鉴于此,文中提出一种噪声可容忍的软件缺陷预测特征选择方法NTFES (Noise Tolerable FEature Selection),即通过Bootstrap抽样技术生成多个自助样本集,在自助样本集上基于近似马尔可夫毯将特征进行分组并采用两种启发式特征选择策略从每个组中选出候选特征,随后利用遗传算法在候选特征空间中搜索最优特征子集。为了验证NTFES方法的有效性,选择了NASA MDP软件项目集作为实验对象并对标签注入噪声以获得带有噪声标签的数据集,通过控制标签噪声比例对NTFES方法以及其他基准方法(如FULL,FCBF,CFS)进行了比较。实验结果表明:在可接受的标签噪声比例下,NTFES方法不仅具有更高的分类性能,还具有更好的噪声可容忍性。
中图分类号:
[1]CATAL C.Software fault prediction:A literature review and current trends[J].Expert Systems with Applications,2011,38(4):4626-4636. [2]HERZIG K,JUST S,ZELLER A.It's not a bug,it's a feature:How misclassification impacts bug prediction[C]//Proceedings of the International Conference on Software Engineering.San Francisco,USA,2013:392-401. [3]BOLON-CANEDO V,SANCHEZ-MARONO N,ALONSO- BETANZOS A.Feature selection for high dimensional data[J].Progress in Artificial Intelligence,2016,5(2):65-75. [4]KIM S,ZHANG H Y,WU R X,et al.Dealing with noise in defect prediction[C]//Proceedings of the Intemational Conference on Software Engineering.Honolulu,USA,2011:481-490. [5]TANTITHAMTHAVORN C,MCINTOSH S,HASSAN A E,et al.The impact of mislabeling on the performance and interpretation of defect prediction models[C]//Proceedings of the International Conference on Software Engineering.Firenze,Italy,2015:812-823. [6]HALL T,BEECHAM S,BOWES D,et al.A systematic litera- ture review on fault prediction performance in software engineering[J].IEEE Transactions on Software Engineering,2012,38(6):1276-1304. [7]CHEN X,GU Q,LIU W S,et al.Software defect prediction[J].Journal of Software,2016,27(1):1-25. [8]MENZIES T,GREENWALD J,FRANK A.Data mining static code attributes to learn defect predictors[J].IEEE Transactions on Software Engineering,2007,33(1):2-13. [9]GAO K H,KHOSHGOFTAAR T M,WANG H J,et al.Choosing software metrics for defect prediction:an investigation on feature selection techniques[J].Software Practice & Expe-rience,2011,41(5):579-606. [10]WANG H J,KHOSHGOFTAAR T M,HULSE J V,et al.Metric selection for software defect prediction[J].International Journal of Software Engineering & Knowledge Engineering,2011,21(2):237-257. [11]XU Z,XUAN J F,LIU J,et al.MICHAC:defect prediction via feature selection based on maximal information coefficient with hierarchical agglomerative clustering[C]//Proceedings of the 23rd International Conference on Software Analysis,Evolution and Reengineering.Washington:IEEE Computer Society,2016,1:370-381. [12]SONG Q B,JIA Z H,SHEPPERD M,et al.A general software defect-proneness prediction framework[J].IEEE Transactions on Software Engineering,2011,37(3):356-370. [13]XU Z,LIU J,YANG Z J,et al.The impact of feature selection on defect prediction performance:an empirical comparison[C]//Proceedings of the 27th International Symposium on Software Reliability Engineering.Washington:IEEE Computer Society,2016:309-320. [14]YU L,LIU H.Efficient feature selection via analysis of relevance and redundancy[J].Journal of Machine Learning Research,2004,5(10):1205-1224. [15]PES B,DESSI N,ANGIONI M.Exploiting the ensemble paradigm for stable feature selection:A case study on high dimensional genomic data[J].Information Fusion,2017,35(C):132-147. [16]ZHOU M.A hybrid feature selection method based on fisher score and genetic algorithm[J].Journal of Mathematical Sciences:Advances and Application,2016,37:51-78. [17]LIU S L,CHEN X,LIU W S,et al.FECAR:A feature selection framework for software defect prediction[C]//Proceedings of the Annual Computer Software and Applications Conference.Vasteras,Sweden,2014:426-435. [18]RAHMAN F,POSNETT D,HERRAIZ I,et al.Sample size vs.bias in defect prediction[C]//Proceedings of the Joint Meeting of the European Software Engineering Conference and the Symposium on Foundations of Software Engineering.Saint Petersburg,Russia,2013:147-157. [19]LIU W S,CHEN X,GU Q,et al.A noise tolerable feature selection framework for software defect prediction[J].Chinese Journal of Computers,2018,41(3):506-520. [20]GARCÍA-TORRES M,GÓMEZ-VELA F,MELIÁN-BATISTA B,et al.High-dimensional feature selection via feature grouping:A Variable Neighborhood Search approach[J].Information Sciences,2016,326:102-118. [21]LIU Y,CAO J J,DIAO X C,et al.Survey on Stability of Feature Selection[J].Journal of Software,2018,29(9):2559-2579. [22]DEVIJVER P A,KITTLER J.Pattern recognition:a statistical approach [M].London:Prentice Hall,1992. [23]VAFAIE H,DE JONG K A.Genetic algorithms as a tool for feature selection in machine learning[C]//Proceedings of the 4th IEEE International Conference on Tools with AI.Washington DC:IEEE Computer Society,1992:200-203. [24]HALL M A.Correlation-based feature subset selection for machine learning [D].Hamilton,New Zealand:University of Waikato,1999. [25]SÁEZ J A,GALAR M,LUENGO J,et al.Tackling the problem of classification with noisy data using Multiple Classifier Systems:Analysis of the performance and robustness[J].Information Sciences,2013,247:1-20. [26]LI J,CHENG K,WANG S,et al.Feature selection:A data perspective[J].ACM Computing Surveys (CSUR),2017,50(6):1-45. |
[1] | 李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124 |
[2] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[3] | 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135 |
[4] | 储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075 |
[5] | 孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094 |
[6] | 李宗然, 陈秀宏, 陆赟, 邵政毅. 鲁棒联合稀疏不相关回归 Robust Joint Sparse Uncorrelated Regression 计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034 |
[7] | 张叶, 李志华, 王长杰. 基于核密度估计的轻量级物联网异常流量检测方法 Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method 计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108 |
[8] | 杨蕾, 降爱莲, 强彦. 基于自编码器和流形正则的结构保持无监督特征选择 Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization 计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211 |
[9] | 侯春萍, 赵春月, 王致芃. 基于自反馈最优子类挖掘的视频异常检测算法 Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining 计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146 |
[10] | 胡艳梅, 杨波, 多滨. 基于网络结构的正则化逻辑回归 Logistic Regression with Regularization Based on Network Structure 计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106 |
[11] | 周钢, 郭福亮. 基于特征选择的高维数据集成学习方法研究 Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data 计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102 |
[12] | 郑小萌, 高猛, 滕俊元. 航天器软件缺陷预测数据集构建方法研究 Research on Construction Method of Defect Prediction Dataset for Spacecraft Software 计算机科学, 2021, 48(6A): 575-580. https://doi.org/10.11896/jsjkx.200900133 |
[13] | 丁思凡, 王锋, 魏巍. 一种基于标签相关度的Relief特征选择算法 Relief Feature Selection Algorithm Based on Label Correlation 计算机科学, 2021, 48(4): 91-96. https://doi.org/10.11896/jsjkx.200800025 |
[14] | 文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086 |
[15] | 张亚钏, 李浩, 宋晨明, 卜荣景, 王海宁, 康雁. 混合人工化学反应优化和狼群算法的特征选择 Hybrid Artificial Chemical Reaction Optimization with Wolf Colony Algorithm for Feature Selection 计算机科学, 2021, 48(11A): 93-101. https://doi.org/10.11896/jsjkx.210100067 |
|