计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 75-81.doi: 10.11896/jsjkx.230100115

• 计算机软件 • 上一篇    下一篇

基于CodeBERT的设计模式语言模型

陈时非, 刘东, 江贺   

  1. 大连理工大学软件学院 辽宁 大连 116620
  • 收稿日期:2023-01-31 修回日期:2023-05-22 出版日期:2023-12-15 发布日期:2023-12-07
  • 通讯作者: 江贺(jianghe@dlut.edu.cn)
  • 作者简介:(chenshifei@mail.dlut.edu.cn)
  • 基金资助:
    国家自然科学基金(61722202)

CodeBERT-based Language Model for Design Patterns

CHEN Shifei, LIU Dong, JIANG He   

  1. School of Software Technology,Dalian University of Technology,Dalian,Liaoning 116620,China
  • Received:2023-01-31 Revised:2023-05-22 Online:2023-12-15 Published:2023-12-07
  • About author:CHEN Shifei,born in 1998,master.His main research interest is natural language processing.
    JIANG He,born in 1980,Ph.D,professor,is a distinguished member of China Computer Federation.His main research interests include system software and software engineering.
  • Supported by:
    National Natural Science Foundation of China(61722202).

摘要: 设计模式是对实际软件设计方案的经验性总结,是软件开发中辅助软件设计的有效方案之一。现有设计模式挖掘研究的任务大多是在源代码中识别设计模式的实例,少有考虑用自然语言语料对设计模式建模。为了提升设计模式语言分类模型的推荐效果,将代码、类图或对象协作纳入考虑范围,提出了一种基于CodeBERT的设计模式分类挖掘模型dpCodeBERT,以实现自然语言与代码语言的对照理解。首先,通过随机组合合成多分类算法数据和代码搜索数据作为模型输入,dpCodeBERT模型能够获取transformer层中的模型为令牌生成的注意力权重;然后,分析令牌和语句注意力权重以发现更有效的模型输入类别,进一步改造训练输入;最后,dpCodeBERT模型能够通过全连接层将分布式特征映射到样本空间并输出复数值的方式实现具体软件工程任务,如设计模式选择和设计模式代码搜索任务。在拥有80个软件设计问题的设计模式选择任务的数据集上的实验结果显示,相比同类基准模型,所提模型在设计模式检测准确率(RCDDP)和平均倒数排名(MRR)两个指标上平均提升了10%~20%,设计模式选择更加准确。通过深度研究模型数据需求,dpCodeBERT挖掘了CodeBERT对类级代码的理解,探索了CodeBERT在设计模式挖掘中的应用,具有预测准确、拓展性强等特点。

关键词: 设计模式挖掘, 自然语言处理, 预训练语言模型, CodeBERT, 模型精调, 向量化

Abstract: As summarizations of the experiences of practical software design,design patterns are regarded as an effective means for software design assistance.Most of the current researches on design patterns mining aim at recognition of design pattern instance in source codes,modelling design patterns with natural language corpus is largely unexplored.In order to enhance the performance of language model for recommending design patterns with codes,class diagram or object collaboration,a design pattern classification mining model based on CodeBERT,named dpCodeBERT,is proposed,achieving the contrast understanding of design patterns in natural language and programming language.Firstly,multi-classification dataset and code search dataset are ge-nerated using random combination and used as inputs of the model.Using dpCodeBERT to get attention weights of each layer of transformer of each token and statement from the inputs.Secondly,the input dataset is further improved by analyzing attention weights and discovering the most important category of inputs.Finally,dpCodeBERT is applied to specific software engineering downstream tasks such as design patterns selection and design patterns code search.The purposes of tasks are accomplished by mapping distributed features to sample space trough fully connected layers and outputting multi values.The result of the experiment on 80 software design problems in design pattern selection task shows that ratio of correct detection of design pattern(RCDDP)and mean reciprocal rank(MRR) of dpCodeBERT are improved by the average of 10%~20% compared with baseline mo-dels,and the design pattern selection is more accurate.Through in-depth study of the data demand of the model,dpCodeBERT improves the understanding of class code of CodeBERT and discovers the application of CodeBERT in design patterns mining.It has the characteristics of accurate prediction and great scalability.

Key words: Design pattern mining, Natural language processing, Pre-trained language models, CodeBERT, Model fine-tuning, Vector quantization

中图分类号: 

  • TP311
[1]HASHEMINEJAD S M H,JALILI S.Design patterns selec-tion:An automatic two-phase method[J].The Journal of Systems and Software,2012,85(2):408-424.
[2]FONTANA F A,MAGGIONI S,RAIBULET C.Design pat-terns:a survey on their micro-structures[J].Journal of Software:Evolution and Process,2013,25(1):27-52.
[3]ZHANG C,BUDGEN D.What do we know about the effectiveness of software design patterns?[J].IEEE Transactions on Software Engineering,2012,38(5):1213-1231.
[4]GAMMA,E,HELM R,JOHNSON R,et al.Design Patterns:Elements of Reusable Object-Oriented Software[M]//Rea-ding.MA:Addison-Wesley,1995.
[5]MAYVAN B B,RASOOLZADEGAN A,YAZDI Z G.The state of the art on design patterns:a systematic mapping of the literature[J].Journal of Systems and Software,2017,125(3):93-118.
[6]ZHU H,BAYLEY I.An algebra of design patterns[J].ACMTransactions on Software Engineering and Methodology,2013,22(3):23-61.
[7]ZANONI M,FONTANA F A,STELLA F.On applying ma-chine learning techniques for design pattern detection[J].Journal of Systems and Software,2015,88(5):102-117.
[8]CHIHADA A,JALILI S,HASHEMINEJAD S M H,et al.Source code and design conformance,design pattern detection from source code by classification approach[J].Applied Soft Computing,2015,26(1):357-367.
[9]MAYVAN B B,RASOOLZADEGAN A.Design pattern detection based on the graph theory[J].Knowledge-Based Systems,2017,120(1):211-225.
[10]DWIVEDI A K,TIRKEY A,RATH S K.Applying learning-based methods for recognizing design patterns[J].Innovations in Systems and Software Engineering,2019,15(2):87-100.
[11]DWIVEDI A K,TIRKEY A,RATH S K.Software design pattern mining using classification-based techniques[J].Frontiers of Computer Science,2018,12(5):908-922.
[12]PETTERSON N,LÖWE W,NIVRE J.Evaluation of accuracy in design pattern occurrence detection[J].IEEE Transactions on Software Engineering,2010,36(4):575-590.
[13]YU D,ZHANG P,YANG J,et al.Efficiently detecting structu-ral design pattern instances based on ordered sequences[J].Journal of Systems and Software,2018,91(5):35-56.
[14]XIAO Z Y,HUANG H,HE P,et al.Evaluation strategy of efficiency in design pattern detection tools[J].Journal of Frontiers of Computer Science and Technology,2018,12(3):380-392.
[15]HUSSAIN S,KEUNG J,KHAN A A.Software design patterns classification and selection using text categorization approach[J].Applied Soft Computing,2017,58:225-244.
[16]LIU D,JIANG H,LI X,et al.DPWord2Vec:better representation of design patterns in semantics[J].IEEE Transactions on Software Engineering,2020,48(4):1228-1248.
[17]LIU D.Data-Driven Software Design Pattern Analysis and Application[D].Dalian:Dalian University of Technology,2022.
[18]DOUGLASS B P.Real-Time Design Patterns:Robust Scalable Architecture for Real-Time Systems[M].Boston MA:Addison-Wesley/Longman Publishing,2002.
[19]SCHUMACHER M,FERNANDEZ-BUGLIONI E,HYBERTSON D,et al.Security patterns:Integrating security and systems engineering[M].Hoboken:John Wiley & Sons,2006.
[20]BAO L,XING Z,XIA X,et al.Psc2code:Denoising code extraction from programming screencasts[J].ACM Transactions on Software Engineering Methodology,2020,29(3):1-21,48.
[21]BEZDEK J C.Pattern recognition with fuzzy objective function algorithms[M].New York:Springer Science & Business Media,2013.
[22]UYSAL A K.An improved global feature selection scheme for text classification[J].Expert Systems with Applications,2016,43:82-92.
[23]ZHANG Z,ZHANG H,SHEN B,et al.Diet code is healthy:Simplifying programs for pre-trained models of code[C]//Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2022:1073-1084.
[24]HUSAIN H,WU H H,GAZIT T,et al.Codesearchnet chal-lenge:Evaluating the state of semantic code search[J].arXiv:1909.09436,2019.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!