基于语义变化的缺陷生成与缺陷预测模型测试

doi:10.11896/jsjkx.241200059

计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241200059-7.doi: 10.11896/jsjkx.241200059

基于语义变化的缺陷生成与缺陷预测模型测试

郭力玮¹, 吴永豪², 刘勇¹

1 北京化工大学信息科学与技术学院北京 100029
2 北京石油化工学院信息工程学院北京 102627

出版日期:2025-11-15 发布日期:2025-11-10
通讯作者: 刘勇(lyong@mail.buct.edu.cn)
作者简介:liwei.glw@outlook.com
基金资助:
国家自然科学基金(61902015,61872026,61672085)

Semantic Variations Based Defect Generation and Prediction Model Testing

GUO Liwei¹, WU Yonghao², LIU Yong¹

1 College of Information Science and Technology,Beijing University of Chemical Technology,Beijing 100029,China
2 School of Information Engineering,Beijing Institute of Petrochemical Technology,Beijing 102627,China

Online:2025-11-15 Published:2025-11-10
Supported by:
National Natural Science Foundation of China(61902015,61872026,61672085).

摘要/Abstract

摘要： 近年来,机器学习技术在软件开发中的缺陷预测领域取得了显著进展,能够在大规模代码库中自动检测错误。这些进展有望提升软件的可靠性、安全性和整体质量。缺陷预测模型可以自动化检测代码中是否包含错误。然而,现有的缺陷预测模型虽然具有一定优势,但往往无法准确识别那些标记为无问题的有缺陷代码。目前缺乏对缺陷检测模型质量的系统性的实证研究,现有方法 DPTester 通过生成缺陷代码来检测缺陷模型的能力,该方法通过修改代码中的 if 条件来产生缺陷代码。然而,现有方法自动生成的缺陷代码过于简单,评估场景也未包括最新大语言模型在内的广泛模型。基于此,提出了改进方法 DefectGen,通过引入多种策略来生成更符合现实问题的缺陷代码,并且评估的缺陷模型包含了大语言模型。实验结果表明,DefectGen 在生成复杂缺陷代码的能力上较之前的方法有显著提升,能够在单个正确代码上生成1.2倍的缺陷代码。在测试 CodeT5+,CodeBERT 和 GPT-4o 模型时,发现缺陷预测有误的数量占比分别为62%,78% 和 30%。与此同时,DefectGen 在测试输入生成和缺陷检测阶段展现出更高的效率,每条测试输入的生成时间和检测时间分别为 0.003 s和 0.02 s。这些结果表明,DefectGen 不仅有效揭示了现有模型的局限性,还为改进缺陷预测模型和提升软件质量保障流程提供了新可能。

关键词: 缺陷预测, 机器学习, 大语言模型

Abstract: In recent years,machine learning techniques have made significant advancements in defect prediction within software development,enabling the automatic detection of errors in large-scale codebases.These advancements are expected to enhance the reliability,security,and overall quality of software.Defect prediction models can autonomously identify whether code contains errors.However,existing models,while having certain advantages,also exhibit limitations.They often fail to accurately identify vulnerabilities or incorrectly label defective code segments as problem-free.Currently,there is a lack of systematic empirical studies on the quality of defect detection models.The existing method,DPTester,assesses the effectiveness of defect models by generating defective code through modifications to if conditions in the code.However,the defect code produced by this method is overly simplistic,and the evaluation scenarios do not cover a wide range of models,including the latest large language models.To address this gap,this paper proposes an improved method called DefectGen.This new approach introduces multiple strategies to generate defect code that more closely reflects real-world issues.Furthermore,the evaluation of defect models includes large language mo-dels.Experimental results indicate that DefectGen significantly enhances the ability to generate complex defect code compared to previous methods,producing 1.2 times more defective code from a single correct code instance.When testing the CodeT5+,CodeBERT,and GPT-4o models,the proportions of incorrect defect predictions were found to be 62%,78%,and 30%.Additionally,DefectGen demonstrates higher efficiency in both test input generation and defect detection phases,with generation and detection times of 0.003 seconds and 0.02 seconds per test input.These results suggest that DefectGen not only effectively exposes the limitations of existing models but also provides new opportunities for improving defect prediction models and enhancing software quality assurance processes.

Key words: Defect prediction, Machine learning, Large language models

中图分类号:

TP311.53

郭力玮, 吴永豪, 刘勇. 基于语义变化的缺陷生成与缺陷预测模型测试[J]. 计算机科学, 2025, 52(11A): 241200059-7. https://doi.org/10.11896/jsjkx.241200059

GUO Liwei, WU Yonghao, LIU Yong. Semantic Variations Based Defect Generation and Prediction Model Testing[J]. Computer Science, 2025, 52(11A): 241200059-7. https://doi.org/10.11896/jsjkx.241200059

参考文献

[1]CHEN J,HU K,YU Y,et al.Software Visualization and Deep Transfer Learning for Effective Software Defect Prediction[C]//2020 IEEE/ACM 42nd International Conference on Software Engineering(ICSE).2020:578-589.
[2]WANG S,LIU T,TAN L.Automatically learning semantic features for defect prediction[C]//Proceedings of the 38th International Conference on Software Engineering.New York:Association for Computing Machinery,2016:297-308.
[3]LIANG H,YU Y,JIANG L,et al.Seml:A Semantic LSTMModel for Software Defect Prediction[J].IEEE Access,2019,7:83812-83824.
[4]GIRAY G,BENNIN K E,KÖKSAL Ö,et al.On the use of deep learning in software defect prediction[J].Journal of Systems and Software,2023,195:111537.
[5]FENTON N E,NEIL M.A critique of software defect prediction models[J].IEEE Transactions on Software Engineering,1999,25(5):675-689.
[6]CARSON J S.Model verification and validation[C]//Procee-dings of the Winter Simulation Conference:2002:52-58.
[7]XU F,SUN Z.Defect-Introducing Defect Prediction Testing[C]//2024 IEEE 24th International Conference on Software Quality,Reliability,and Security Companion(QRS-C).2024:401-410.
[8]ZHU Q,SUN Z,XIAO Y,et al.A syntax-guided edit decoder for neural program repair[C]//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.New York:Association for Computing Machinery,2021:341-353.
[9]YEFET N,ALON U,YAHAV E.Adversarial examples formodels of code[J].Proc.ACM Program.Lang.,2020,4(OOPSLA):162:1-162:30.
[10]FENG Z,GUO D,TANG D,et al.CodeBERT:A Pre-TrainedModel for Programming and Natural Languages[C]//Findings of the Association for Computational Linguistics(EMNLP 2020).Association for Computational Linguistics,2020:1536-1547.
[11]WANG Y,WANG W,JOTY S,et al.CodeT5:Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation[C]//MOENS M F,HUANG X,SPECIA L,et al.Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Online and Punta Cana,Dominican Republic:Association for Computational Linguistics,2021:8696-8708.
[12]WANG Y,LE H,GOTMARE A,et al.CodeT5+:Open Code Large Language Models for Code Understanding and Generation[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.Singapore:Association for Computational Linguistics,2023:1069-1088.
[13]ZHANG H,LI Z,LI G,et al.Generating Adversarial Examples for Holding Robustness of Source Code Processing Models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:1169-1176.
[14]POUR M V,LI Z,MA L,et al.A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding[C]//2021 14th IEEE Conference on Software Testing,Verification and Validation(ICST).2021:36-46.
[15]HENKEL J,RAMAKRISHNAN G,WANG Z,et al.SemanticRobustness of Models of Source Code[C]//2022 IEEE International Conference on Software Analysis,Evolution and Reengineering(SANER).2022:526-537.
[16]JHA A,REDDY C K.CodeAttack:Code-Based Adversarial Attacks for Pre-trained Programming Language Models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:14892-14900.
[17]TIAN Z,CHEN J,JIN Z.Code Difference Guided AdversarialExample Generation for Deep Code Models[C]//2023 38th IEEE/ACM International Conference on Automated Software Engineering(ASE).2023:850-862.
[18]LI Z,WANG C,LIU Z,et al.CCTEST:Testing and Repairing Code Completion Systems[C]//2023 IEEE/ACM 45th International Conference on Software Engineering(ICSE).2023:1238-1250.
[19]FENG Z,GUO D,TANG D,et al.CodeBERT:A Pre-TrainedModel for Programming and Natural Languages[C]//Findings of the Association for Computational Linguistics(EMNLP 2020).Association for Computational Linguistics,2020:1536-1547.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于语义变化的缺陷生成与缺陷预测模型测试

Semantic Variations Based Defect Generation and Prediction Model Testing

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0