一种基于词法特征和数据挖掘的无意义变量名检测方法

doi:10.11896/jsjkx.231100030

摘要/Abstract

摘要： 标识符是代码的重要组成部分,也是人们理解代码语义的关键元素之一。变量名是最常见的标识符之一,其质量对于代码的可读性和可理解性有着重要的意义。然而,因为各种原因程序员经常使用一些毫无意义的变量名,如“a”和“var”等。这些无意义的变量名严重降低了代码的可理解性,需要进行检测并重构(重命名)。为此,提出了一种基于词法特征和数据挖掘的自动化方法,以检测代码中无意义的变量名。首先,对开源代码中的无意义变量名进行了实证分析,发现无意义变量名通常比较短且不包含任何有意义的单词,因此可以利用词法特征筛选出名称较短且不包含有意义单词的可疑变量名。如果可疑变量名包含缩写词,则使用缩写词扩展算法进行扩展,以获得完整的变量名。然后,基于数据挖掘算法判断可疑变量名是否为约定俗成的常用变量名。有些常用的变量名,如 “i”和“e”,虽然字面上没有明确的语义,但是通过约定俗成的表示规范,程序员可以理解该变量的语义,因此不算是无意义的变量名,也不需要进行重构。如果可疑变量名称不是约定俗成的常用变量名,则断定该变量名为无意义的变量名,并提醒程序员进行重命名。在开源数据集上进行实验,结果表明,该方法具有较高的准确率,其平均查准率为85%,平均查全率为91.5%。

关键词: 软件重构, 代码质量, 数据挖掘, 无意义变量名, 词法特征

Abstract: Identifiers is an important part of code,and it is also one of the key elements for people to understand the semantics of code.Variables are widely used to represent objects in programs.Names of such variables could serve as a major clue to the responsibility of the variables if they are serious and properly named.However,unqualified variable names(e.g.,“a”,“var”) are constructed frequently by developers.Such nonsense variable names have a severe negative impact on the readability and maintai-nability of software applications.So,automated identification of bad smells is one of the hot topics in the field of software refacto-ring.To identify such nonsense names automatically,we conduct an empirical study to figure out the key features that could be exploited to distinguishing nonsense names from well-constructed meaningful ones.Results of the study suggest that nonsense variable names are often short and rarely contain meaningful words.To this end,in this paper,we propose a heuristics and data mining-based approach to identifying nonsense variable names.It first retrieves suspicious variable names based on lexical analysis.On the resulting suspicious names,it conducts an abbreviation expansion-based filtering to exclude such variable names that are carefully constructed to represent the abbreviations of meaningful words.Finally,it conducts data mining-based filtering to further exclude well-known symbols(e.g.“i”,“e”).Experimental results on open source datasets show that the proposed method has high accuracy.Its average precision and recall is 85% and 91.5%,respectively.

Key words: Software refactoring, Code quality, Data mining, Nonsense variable names, Lexical features

中图分类号:

TP311

姜艳杰, 东春浩, 刘辉. 一种基于词法特征和数据挖掘的无意义变量名检测方法[J]. 计算机科学, 2024, 51(6): 23-33. https://doi.org/10.11896/jsjkx.231100030

JIANG Yanjie, DONG Chunhao, LIU Hui. Nonsense Variable Names Detection Method Based on Lexical Features and Data Mining[J]. Computer Science, 2024, 51(6): 23-33. https://doi.org/10.11896/jsjkx.231100030

参考文献

[1]MEYER B.Object-oriented software construction[M].Englewood Cliffs:Prentice hall,1997.
[2]LIU K,KIM D,BISSYANDÉ T F,et al.Learning to spot and refactor inconsistent method names[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:1-12.
[3]MARCUS A,POSHYVANYK D,FERENC R.Using the con-ceptual cohesion of classes for fault prediction in object-oriented systems[J].IEEE Transactions on Software Engineering,2008,34(2):287-300.
[4]ZHAO W,ZHANG L,LIU Y,et al.SNIAFL:Towards a static noninteractive approach to feature location[J].ACM Transactions on Software Engineering and Methodology(TOSEM),2006,15(2):195-226.
[5]SHTERN M,TZERPOS V.Clustering methodologies for software engineering[J].Advances in Software Engineering,2012(2012):792024.1-792024.18.
[6]EADDY M,AHO A V,ANTONIOL G,et al.Cerberus:Tracing requirements to source code using information retrieval,dynamic analysis,and program analysis[C]//2008 16th IEEE International Conference on Program Comprehension.IEEE,2008:53-62.
[7]LUCIA D.Information retrieval models for recovering traceabi-lity links between code and documentation[C]//Proceedings 2000 International Conference on Software Maintenance.IEEE,2000:40-49.
[8]FOWLER M.Refactoring:improving the design of existing code[M].Addison-Wesley Professional,2018.
[9]LI G,LIU H,NYAMAWE A S.A survey on renamings of software entities[J].ACM Computing Surveys(CSUR),2020,53(2):1-38.
[10]ARNAOUDOVA V,ESHKEVARI L M,DI PENTA M,et al.Repent:Analyzing the nature of identifier renamings[J].IEEE Transactions on Software Engineering,2014,40(5):502-532.
[11]FELDTHAUS A,MØLLER A.Semi-automatic rename refactoring for JavaScript[J].ACM SIGPLAN Notices,2013,48(10):323-338.
[12]THIES A,ROTH C.Recommending rename refactorings[C]//Proceedings of the 2nd International workshop on recommendation systems for software engineering.2010:1-5.
[13]LIU B,LIU H,NIU N,et al.Automated Software Entity Matching BetweenSuccessive Versions[C]//2023 38th IEEE/ACM International Conference on Automated Software Engineering(ASE).IEEE,2023:1615-1627.
[14]ZHANG M,HALL T,BADDOO N.Code bad smells:a review of current knowledge[J].Journal of Software Maintenance and Evolution:research and practice,2011,23(3):179-202.
[15]LEE S,KIM S,KIM J A,et al.Detecting Inconsistent Names of Source Code Using NLP[C]//Computer Applications for Database,Education,and Ubiquitous Computing:International Conferences,EL,DTA and UNESST 2012,Held as Part of the Future Generation Information Technology Conference,FGIT 2012.Springer Berlin Heidelberg,2012:111-115.
[16]ABEBE S L,HAIDUC S,TONELLA P,et al.Lexicon badsmells in software[C]//2009 16th Working Conference on Reverse Engineering.IEEE,2009:95-99.
[17]COHEN J.A coefficient of agreement for nominal scales[J].Edu-cational and psychological measurement,1960,20(1):37-46.
[18]GAN G,MA C,WU J.Data clustering:theory,algorithms,and applications[M].Society for Industrial and Applied Mathema-tics,2020.
[19]JIANG Y,LIU H,ZHANG L.Semantic relation based expansion of abbreviations[C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2019:131-141.
[20]ALLAMANIS M,BARR E T,BIRD C,et al.Learning natural coding conventions[C]//Proceedings of the 22nd ACM Sigsoft International Symposium on Foundations of Software Enginee-ring.2014:281-293.
[21]ALON U,ZILBERSTEIN M,LEVY O,et al.code2vec:Lear-ning distributed representations of code[J].Proceedings of the ACM on Programming Languages,2019,3(POPL):1-29.
[22]FAKHOURY S,MA Y,ARNAOUDOVA V,et al.The effect of poor source code lexicon and readability on developers’ cognitive load[C]//Proceedings of the 26th Conference on Program Comprehension.2018:286-296.
[23]LUNGU M,KURŠ J.On planning an evaluation of the impact of identifier names on the readability and quality of smalltalk programs[C]//2013 2nd International Workshop on User Evaluations for Software Engineering Researchers(USER).IEEE,2013:13-15.
[24]CAPRILE B,TONELLA P.Restructuring program identifiernames[C]//Proceedings 2000 International Conference on Software Maintenance.IEEE,2000:97-107.
[25]CAPRILE C,TONELLA P.Nomen est omen:Analyzing thelanguage of function identifiers[C]//Sixth Working Conference on Reverse Engineering(Cat.No.PR00303).IEEE,1999:112-122.
[26]ARNAOUDOVA V,ESHKEVARI L M,DI PENTA M,et al.Repent:Analyzing the nature of identifier renamings[J].IEEE Transactions on Software Engineering,2014,40(5):502-532.
[27]LAWRIE D,MORRELL C,FEILD H,et al.What’s in a Name? A Study of Identifiers[C]//14th IEEE International Conference on Program Comprehension(ICPC’06).IEEE,2006:3-12.
[28]AVIDAN E,FEITELSON D G.Effects of variable names oncomprehension:An empirical study[C]//2017 IEEE/ACM 25th International Conference on Program Comprehension(ICPC).IEEE,2017:55-65.
[29]SCHANKIN A,BERGER A,HOLT D V,et al.Descriptivecompound identifier names improve source code comprehension[C]//Proceedings of the 26th Conference on Program Comprehension.2018:31-40.
[30]HOFMEISTER J,SIEGMUND J,HOLT D V.Shorter identifier names take longer to comprehend[C]//2017 IEEE 24th International Conference on Software Analysis,Evolution and Reengineering(SANER).IEEE,2017:217-227.
[31]BENIAMINI G,GINGICHASHVILI S,ORBACH A K,et al.Meaningful identifier names:The case of single-letter variables[C]//2017 IEEE/ACM 25th International Conference on Program Comprehension(ICPC).IEEE,2017:45-54.
[32]PERUMA A,MKAOUER M W,DECKER M J,et al.An empirical investigation of how and why developers rename identifiers[C]//Proceedings of the 2nd International Workshop on Refactoring.2018:26-33.
[33]SWIDAN A,SEREBRENIK A,HERMANS F.How do Scratch programmers name variables and procedures?[C]//2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation(SCAM).IEEE,2017:51-60.
[34]BINKLEY D,HEARN M,LAWRIE D.Improving identifier informativeness using part of speech information[C]//Procee-dings of the 8th Working Conference on Mining Software Repositories.2011:203-206.
[35]ALLAMANIS M,BARR E T,BIRD C,et al.Learning natural coding conventions[C]//Proceedings of the 22nd ACM Sigsoft International Symposium on Foundations of Software Enginee-ring.2014:281-293.
[36]LIU H,LIU Q,LIU Y,et al.Identifying renaming opportunities by expanding conducted rename refactorings[J].IEEE Transactions on Software Engineering,2015,41(9):887-900.
[37]LIU H,LIU Q,STAICU C A,et al.Nomen est omen:Exploring and exploiting similarities between argument and parameter names[C]//Proceedings of the 38th International Conference on Software Engineering.2016:1063-1073.
[38]MALPOHL G,HUNT J J,TICHY W F.Renaming detection[J].Automated Software Engineering,2003,10:183-202.
[39]BUTLER S,WERMELINGER M,YU Y.Investigating naming convention adherence in Java references[C]//2015 IEEE International Conference on Software Maintenance and Evolution(ICSME).IEEE,2015:41-50.
[40]CAPRILE B,TONELLA P.Restructuring program identifiernames[C]//Proceedings 2000 International Conference on Software Maintenance.IEEE,2000:97-107.
[41]CORBO F,DEL GROSSO C,DI PENTA M.Smart formatter:Learning coding style from existing source code[C]//2007 IEEE International Conference on Software Maintenance.IEEE,2007:525-526.
[42]LAWRIE D,BINKLEY D.Expanding identifiers to normalizesource code vocabulary[C]//2011 27th IEEE International Conference on Software Maintenance(ICSM).IEEE,2011:113-122.
[43]LAWRIE D,BINKLEY D,MORRELL C.Normalizing sourcecode vocabulary[C]//2010 17th Working Conference on Reverse Engineering.IEEE,2010:3-12.
[44]ALATAWI A,XU W,YAN J.The expansion of source code abbreviations using a language model[C]//2018 IEEE 42nd An-nual Computer Software and Applications Conference(COMPSAC).IEEE,2018:370-375.
[45]ABEBE S L,HAIDUC S,TONELLA P,et al.Lexicon badsmells in software[C]//2009 16th Working Conference on Reverse Engineering.IEEE,2009:95-99.
[46]DEISSENBOECK F,PIZKA M.Concise and consistent naming[J].Software Quality Journal,2006,14:261-282.
[47]LAWRIE D,FEILD H,BINKLEY D.Syntactic identifier con-ciseness and consistency[C]//2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.IEEE,2006:139-148.
[48]HØST E W,ØSTVOLD B M.Debugging method names[C]//European Conference on Object-Oriented Programming.Berlin:Springer,2009:294-317.
[49]DE LUCIA A,DI PENTA M,OLIVETO R.Improving source code lexicon via traceability and information retrieval[J].IEEE Transactions on Software Engineering,2010,37(2):205-227.
[50]SATRATZEMI M,STELIOS X,TSOMPANOUDI D.Distributed pair programming in higher education:A systematic literature review[J].Journal of Educational Computing Research,2023,61(3):546-577.
[51]PARSA S,ZAKERI-NASRABADI M,EKHTIARZADEH M,et al.Method name recommendation based on source code metrics[J].Journal of Computer Languages,2023,74:101177.
[52]DESAI R H,TADIMETI U,RICCARDI N.Proper and common names in the semantic system[J].Brain Structure and Function,2023,228(1):239-254.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed