代码自动生成工具Github Copilot生成代码质量的分析

doi:10.11896/jsjkx.240600076

计算机科学 ›› 2025, Vol. 52 ›› Issue (7): 37-49.doi: 10.11896/jsjkx.240600076

代码自动生成工具Github Copilot生成代码质量的分析

王东煜, 莫然, 詹文静, 蒋颖婕

华中师范大学计算机学院武汉 430079

收稿日期:2024-06-11 修回日期:2024-09-21 发布日期:2025-07-17
通讯作者: 莫然(moran@ccnu.edu.cn)
作者简介:(wangdongyu156@mails.ccnu.edu.cn)
基金资助:
华中师范大学科研交叉平台重大项目(CCNU24JCPT015)

Analysis of the Code Quality of Code Automatic Generation Tool Github Copilot

WANG Dongyu, MO Ran, ZHAN Wenjing, JIANG Yingjie

School of Computer Science, Central China Normal University, Wuhan 430079, China

Received:2024-06-11 Revised:2024-09-21 Published:2025-07-17
About author:WANG Dongyu,born in 2000,postgra-duate,is a member of CCF(No.Q5469G).His main research interests include code generation model and so on.
MO Ran,born in 1989,Ph.D,professor.His main research interests include software architecture analysis,software data mining,software defect analysis,and intelligent software engineering.
Supported by:
Key Programs of the Interdisciplinary Research Platform at Central China Normal University(CCNU24JCPT015).

摘要/Abstract

摘要： Github Copilot是Github和OpenAI推出的一款基于生成式AI的代码自动生成工具,它的核心功能之一,是根据自然语言的描述生成对应的实现代码。这一AI在编程领域的拓展,近年来引起了热议与重视。现阶段人们的关注点主要在AI编程与人类编程的对比,比如AI程序员与人类程序员的编程效率对比,以及两者所编写的代码性能对比。然而,目前关于Copilot代码本身特征的研究较少,特别是代码质量问题,例如AI生成代码有哪些缺陷,这些缺陷是否会导致程序错误,以及代码是否易于理解等。代码质量对软件开发至关重要,分析AI生成代码的代码质量有助于更好地使用和改进此类代码生成工具。本文使用工具从LeetCode中提取所有的开源问题(共2033道)作为数据样本对Copilot进行测试,分别生成3种语言(Java,JavaScript,Python)的代码建议,提交并记录代码建议的执行结果。使用SonarQube静态分析这些代码建议文件,结合代码建议的执行结果,从可靠性、可维护性、复杂性3个维度分析Copilot的代码质量特征。结果发现:1) Copilot生成代码较为可靠,对于Java,JavaScript和Python 3种语言,分别收集了7,5,9种Bug类型,且3种语言涉及Bug的代码建议比例不超过3%,但涉及Bug的代码建议50%以上未通过测试;2) Copilot代码建议可维护性较差,对于Java,JavaScript和Python,分别收集了47,23,20种代码异味类类型,3种语言涉及代码建议的比例均超过40%,涉及代码异味的代码建议50%以上未通过测试用例;3) Copilot代码易于理解,多数代码建议的复杂度未超过阈值,且复杂度异常的代码建议比例不超过6%。最后,结合实验结果,提出了维护Copilot的可行建议,并探讨了此类工具未来可能的研究方向。

关键词: 自动代码生成, 代码质量, 代码可靠性, 代码可维护性, 代码复杂度

Abstract: Github Copilot is a generative AI-based code auto-generation tool launched by Github and OpenAI in 2022.One of its core functions is to generate corresponding implementation code based on natural language annotations describing functions.This expansion of AI in the field of programming has attracted heated discussion and attention in recent years.At this stage,people's focus is mainly on the comparison between AI programming and human programming,such as the comparison of programming efficiency and code performance between AI programmers and human programmers.However,there is currently limited research on the characteristics of Copilot-generated code itself,particularly regarding code quality issues,such as defects in the AI-generated code,whether these defects might lead to program errors,and the understandability of the code.Code quality directly determines the life and durability of a software project.Analyzing and summarizing its code quality characteristics helps to better use and improve such AI code tools.This paper utilizes tools to extract all open-source problems from LeetCode(2,033 in total) as data samples to test Copilot,generating code suggestions in three programming languages(Java,JavaScript,and Python),submitting them,and recording the execution results of the generated code.By statically analyzing the code suggestions with SonarQube and integrating their execution results,this paper evaluates Copilot's code quality in terms of reliability,maintainability,and complexity.The results reveal that:1)Copilot-generated code is relatively reliable.For Java,JavaScript,and Python,7,5,and 9 types of bugs are identified respectively.The proportion of code suggestions involving bugs do not exceed 3% across all three languages,but over 50% of bug-related code suggestions fail test cases.2)Copilot's code suggestions exhibit poor maintainability.For Java,JavaScript,and Python,47,23,and 20 types of code smells are detected respectively.Over 40% of code suggestions in all three languages contain code smells,and more than 50% of smell-related suggestions failetest cases.3)Copilot-generated code is easy to understand.The complexity of most code suggestions do not exceed predefined thresholds,with less than 6% of suggestions flag for excessive complexity.Finally,based on the experimental findings,practical recommendations for improving Copilot are proposed,and potential future research directions for such tools are discussed.

Key words: Automatic code generation, Code quality, Code reliability, Code maintainability, Code complexity

中图分类号:

TP391

王东煜, 莫然, 詹文静, 蒋颖婕. 代码自动生成工具Github Copilot生成代码质量的分析[J]. 计算机科学, 2025, 52(7): 37-49. https://doi.org/10.11896/jsjkx.240600076

WANG Dongyu, MO Ran, ZHAN Wenjing, JIANG Yingjie. Analysis of the Code Quality of Code Automatic Generation Tool Github Copilot[J]. Computer Science, 2025, 52(7): 37-49. https://doi.org/10.11896/jsjkx.240600076

参考文献

[1]ORTIN F,ESCALADA J,RODRIGUEZ-PRIETO O.Big Code:New Opportunities for Improving Software Construction[J].Journal of Software,2016,11(11):1083-1088.
[2]ALLAMANIS M,BARR E T,DEVANBU P,et al.A survey ofmachine learning for big code and naturalness[J].ACM Computing Surveys(CSUR),2018,51(4):1-37.
[3]LUAN S,YANG D,BARNABY C,et al.Aroma:Code recommendation via structural code search[C]//Proceedings of the ACM on Programming Languages.2019:1-28.
[4]NGUYEN T,VU P,NGUYEN T.Code recommendation for exception handling[C]//Proceedings of the 28th ACM Joint Mee-ting on European Software Engineering Conference and Sympo-sium on the Foundations of Software Engineering.2020:1027-1038.
[5]Github.Research:quantifying GitHub Copilot's impact on developer productivity and happiness.[EB/OL].https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/.
[6]BACCHELLI A,BIRD C.Expectations,outcomes,and challenges of modern code review[C]//2013 35th International Conference on Software Engineering(ICSE).San Francisco,CA,USA,2013:712-721.
[7]LeetCode.The World's Leading Online Programming Platform[EB/OL].https://leetcode.com/.
[8]SonarQube.Code Quality Tool[EB/OL].https://www.sonarsource.com-/products/sonarqube/.
[9]STAMELOS I,ANGELIS L,OIKONOMOU A,et al.Codequality analysis in open source software development[J].Information Systems Journal,2002,12(1):43-60.
[10]Github Copilot.Your AI Programmer[EB/OL].https://git-hub.com/features/copilot.
[11]OpenAI CodeX.An AI System Translating Natural Language to Code[EB/OL].https://openai.com/blog/openai-codexGit-hub.
[12]Copilot.What is GitHub Copilot?[EB/OL].https://docs.git-hub.com/en/copilot/overview-of-github-copilot/about-github-copilot-for-individuals.
[13]NGUYEN N,NADI S.An empirical evaluation of GitHub copilot's code suggestions[C]//Proceedings of the 19th Interna-tional Conference on Mining Software Repositories.2022:1-5.
[14]LeetCode.Palindrome number[EB/OL].https://leetcode.com/problems/palindrome-number/.
[15]SonarQube.SonarQube Severity Issues[EB/OL].https://www.sonarsource.com/blog/we-are-adjusting-rules-severities/.
[16]SonarQube.Sonar Rules[EB/OL].https://rules.sonarsource.com/.
[17]HELMUTH T,KELLY P.PSB2:the second program synthesisbenchmark suite[C]//Proceedings of the Genetic and Evolutio-nary Computation Conference.2021:785-794.
[18]HELMUTH T,SPECTOR L.General program synthesis be-nchmark suite[C]//Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation.2015:1039-1046.
[19]SANTOS J A M,ROCHA-JUNIOR J B,PRATES LC L,et al.A systematic review on the code smell effect[J].Journal of Systems and Software,2018,144:450-477.
[20]Homeland Security Systems Engineering and Development Institute.Common Weakness Enumeration[EB/OL].https://cwe.mitre.org/index.html.
[21]OWASP Foundation.OWASP Top Ten 2017[EB/OL].https://owasp.org/www-project-top-ten/2017/.
[22]Ranga Karanam.Code Quality Basics-What Is Code Duplication?[EB/OL].https://www.springboottutorial.com/code-quality-what-is-code-duplication.
[23]DAKHEL A M,MAJDINASAB V,NIKANJAMA,et al.Github copilot ai pair programmer:Asset or liability?[J].Journal of Systems and Software,2023,203:111734.
[24]LEISERSON C E,RIVEST R L,CORMEN T H,et al.Introduction to algorithms[M].Cambridge,MA,USA:MIT press,1994.
[25]IMAI S.Is github copilot a substitute for human pair-programming? an empirical study[C]//Proceedings of the ACM/IEEE 44th International Conference on Software Engineering:Companion Proceedings.2022:319-321.
[26]SOBANIA D,BRIESCH M,ROTHLAUF F.Choose your programming copilot:a comparison of the program synthesis performance of github copilot and genetic programming[C]//Proceedings of the Genetic and Evolutionary Computation Confe-rence.2022:1019-1027.
[27]PEARCE H,AHMAD B,TAN B,et al.Asleep at the keyboard? assessing the security of github copilot's code contributions[C]//2022 IEEE Symposium on Security and Privacy(SP).IEEE,2022:754-768.
[28]MASTROPAOLO A,PASCARELLA L,GUGLIELMIE,et al.On the robustness of code generation techniques:An empirical study on github copilot[J].arXiv:2302.00438,2023.
[29]KARLSSON S,FARAH M,HASSAN F.Evaluating large language models' capability to generate algorithmic code using prompt engineering[EB／OL].(2024-07-09)[2024-07-21].https://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-24285.
[30]DE BARROS G R C,MARTINS MO.Comparative of source code generated by principal LLM generators for Python and Lua languages[J/OL].https://tfgonline.lapinf.ufn.edu.br/media/midias/TFGII_Gustavo_2024.pdf.
[31]YOUNES Y,NASSRALLAHT.Enhancing Software Mainte-nance with Large Language Models:A comprehensive study[J/OL].(2024-06-25)[2024-07-21]. https://www.diva-portal.org/smash/get/diva2:1868472/FULLTEXT01.pdf.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

代码自动生成工具Github Copilot生成代码质量的分析

Analysis of the Code Quality of Code Automatic Generation Tool Github Copilot

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0