计算机科学 ›› 2019, Vol. 46 ›› Issue (11): 168-175.doi: 10.11896/jsjkx.191100504C

• 软件与数据库技术 • 上一篇    下一篇

改进的神经语言模型及其在代码提示中的应用

张献, 贲可荣   

  1. (海军工程大学电子工程学院 武汉430033)
  • 收稿日期:2018-10-16 出版日期:2019-11-15 发布日期:2019-11-14
  • 通讯作者: 贲可荣(1963-),男,博士,教授,CCF高级会员,主要研究方向为软件质量保证、人工智能,E-mail:benkerong08@163.com
  • 作者简介:张献(1990-),男,博士生,CCF学生会员,主要研究方向为软件缺陷挖掘、机器学习,E-mail:tomtomzx@foxmail.com。
  • 基金资助:
    本文受国家安全重大基础研究计划项目(613315)资助。

Modified Neural Language Model and Its Application in Code Suggestion

ZHANG Xian, BEN Ke-rong   

  1. (School of Electronic Engineering,Naval University of Engineering,Wuhan 430033,China)
  • Received:2018-10-16 Online:2019-11-15 Published:2019-11-14

摘要: 语言模型旨在刻画文本段的发生概率,作为自然语言处理领域中的一类重要模型,近年来其被广泛应用于不同软件分析任务,例如代码提示。为提高模型对代码特征的学习能力,文中提出了一种改进的循环神经网络语言模型——CodeNLM。该模型通过分析词向量形式表示的源代码序列,能够捕获代码规律,实现对序列联合概率分布的估计。考虑到现有模型仅学习代码数据,信息的利用不充分,提出了附加信息引导策略,通过非代码信息的辅助来提高代码规律的刻画能力。针对语言建模任务的特点,提出了节点逐层递增策略,通过优化网络结构来改善信息传递的有效性。实验中,针对9个Java项目共203万行代码,CodeNLM得到的困惑度指标明显优于n-gram类模型和传统神经语言模型,在代码提示应用中得到的平均准确度(MRR指标)较对比方法提高了3.4%~24.4%。实验结果表明,CodeNLM能有效地实现程序语言建模和代码提示任务,并具有较强的长距离信息学习能力。

关键词: 代码提示, 软件分析, 循环神经网络, 语言模型, 自然语言处理

Abstract: Language models are designed to characterize the occurrence probabilities of text segments.As a class of important model in the field of natural language processing,it has been widely used in different software analysis tasks in recent years.To enhance the learning ability for code features,this paper proposed a modified recurrent neural network language model,called CodeNLM.By analyzing the source code sequences represented in embedding form,the model can capture rules in codes and realize the estimation of the joint probability distribution of the sequences.Considering that the existing models only learn the code data and the information is not fully utilized,this paper proposed an additional information guidance strategy,which can improve the ability of characterizing the code rules through the assistance of non-code information.Aiming at the characteristics of language modeling task,alayer-by-layer incremental nodes setting strategy is proposed,which can optimize the network structure and improve the effectiveness of information transmission.In the verification experiments,for 9 Java projects with 2.03M lines of code,the perplexity index of CodeNLM is obviously better than the contrast n-gram class models and neural language models.In the code suggestion task,the average accuracy (MRR index) of the proposed model is 3.4%~24.4% higher than the contrast methods.The experimental results show that except possessing a strong long-distance information learning capability,CodeNLM can effectively model programming language and perform code suggestion well.

Key words: Code suggestion, Language model, Natural language processing, Recurrent neural network, Software analysis

中图分类号: 

  • TP311.5
[1]王千祥,张健,谢涛,等.软件分析:技术、应用与趋势/ CCF 2015-2016中国计算机科学技术发展报告[M].北京:机械工业出版社,2016:55-113.
[2]ZHANG X,BEN K R.Application of deep learning methods in software analysis [J].Computer Engineering and Science,2017(12):2260-2268.(in Chinese)
张献,贲可荣.深度学习方法在软件分析中的应用[J].计算机工程与科学,2017,39(12):2260-2268.
[3]JURAFSKY D,JAMES H M.Speech and language processing (2nd ed) [M].Upper Saddle River:Pearson/Prentice Hall,2009:4th Chapter.
[4]HINDLE A,BARR E T,SU Z,et al.On the naturalness of software[C]∥Proceedings of the 34th International Conference on Software Engineering.Piscataway:IEEE,2012:837-847.
[5]TU Z,SU Z,DEVANBU P.On the localness of software[C]∥Proceedings of the 22nd ACM SIGSOFT InternationalSympo-sium on Foundations of Software Engineering.New York:ACM,2014:269-280.
[6]NGUYEN T T,NGUYEN A T,NGUYEN H A,et al.A statistical semantic language model for source code[C]∥Proceedings of the 9th Joint Meeting on Foundations of Software Enginee-ring.New York:ACM,2013:532-542.
[7]YANG Y,JIANG Y,GU M,et al.A language model for statements of software code[C]∥Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2017:682-687.
[8]NGUYEN A T,HILTON M,CODOBAN M,et al.API code recommendation using statistical learning from fine-grained changes[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:511-522.
[9]ALLAMANIS M,BARR E T,BIRD C,et al.Learning natural coding conventions[C]∥Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2014:281-293.
[10]RAY B,HELLENDOORN V,GODHANE S,et al.On the naturalness of buggy code[C]∥Proceedings of the 38th Internatio-nal Conference on Software Engineering.New York:ACM,2016:428-439.
[11]BIELIK P,RAYCHEV V,VECHEV M.Program synthesis for character level language modeling[C]∥Proceedings of the 5th International Conference on Learning Representations.Toulon,OpenReview,2017.
[12]ODA Y,FUDABA H,NEUBIG G,et al.Learning to generate pseudo-code from source code using statistical machine translation[C]∥Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2015:574-584.
[13]HIRSCHBERG J,MANNING C D.Advances in natural language processing[J].Science,2015,349(6245):261-266.
[14]WHITE M,VENDOME C,LINARES-VÁSQUEZ M,et al.Toward deep learning software repositories[C]∥Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories Piscataway.IEEE,2015:334-345.
[15]DAM H K,TRAN T,GRUNDY J,et al.DeepSoft:a vision for a deep model of software[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:944-947.
[16]DAM H K,TRAN T,PHAM T.A deep language model for software code[C]∥Workshop on Naturalness of Software,Collocated with the 24th International Symposium on Foundations of Software Engineering.New York:ACM,2016.
[17]ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]∥Proceedings of the 10th Joint Meeting on Foundations of Software Engineering.New York:ACM,2015:38-49.
[18]YIN C L,WANG W,LI T,et al.Using RNNLM to conduct topic oriented feature location method[J].Journal of Frontiers of Computer Science and Technology,2017,11(10):1599-1608.(in Chinese)
尹春林,王炜,李彤,等.利用RNNLM面向主题的特征定位方法[J].计算机科学与探索,2017,11(10):1599-1608.
[19]ZHANG X,BEN K R,ZENG J.Cross-entropy:a new metric for software defect prediction[C]∥Proceedings of the 18th IEEE International Conference on Software Quality,Reliability and Security.Piscataway:IEEE,2018:111-122.
[20]BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[C]∥Proceedings of the 15th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2001:932-938.
[21]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,512(7553):436-444.
[22]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrentneural network based language model[C]∥Proceedings of the Annual Conference of the International Speech Communication Association.Makuhari,ISCA,2010:1045-1048.
[23]MIKOLOV T,YIH W,ZWEIG G.Linguistic regularities in continuous space word representations[C]∥Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Atlanta,NAAC,2013:746-751.
[24]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014.
[25]SALEHINEJAD H,BAARBE J,SANKAR S,et al.Recent advances in recurrent neural networks[J].arXiv:1801.01078,2018.
[26]BIELIK P,RAYCHEV V,VECHEV M.PHOG:probabilisticmodel for code[C]∥ Proceedings of the 33rd International Conference on Machine Learning.New York:ACM,2016:2933-2942.
[27]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[28]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[29]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]∥Proceedings of the 26th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2012:1097-1105.
[30]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:1-9.
[31]PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]∥Proceedings of the 30th International Conference on Machine Learning.New York:ACM,2013:1310-1318.
[1] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[2] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[3] 彭双, 伍江江, 陈浩, 杜春, 李军.
基于注意力神经网络的对地观测卫星星上自主任务规划方法
Satellite Onboard Observation Task Planning Based on Attention Neural Network
计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093
[4] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[5] 喻昕, 林植良.
解决一类非光滑伪凸优化问题的新型神经网络
Novel Neural Network for Dealing with a Kind of Non-smooth Pseudoconvex Optimization Problems
计算机科学, 2022, 49(5): 227-234. https://doi.org/10.11896/jsjkx.210400179
[6] 安鑫, 代子彪, 李阳, 孙晓, 任福继.
基于BERT的端到端语音合成方法
End-to-End Speech Synthesis Based on BERT
计算机科学, 2022, 49(4): 221-226. https://doi.org/10.11896/jsjkx.210300071
[7] 时雨涛, 孙晓.
一种会话理解模型的问题生成方法
Conversational Comprehension Model for Question Generation
计算机科学, 2022, 49(3): 232-238. https://doi.org/10.11896/jsjkx.210200153
[8] 李昊, 曹书瑜, 陈亚青, 张敏.
基于注意力机制的用户轨迹识别模型
User Trajectory Identification Model via Attention Mechanism
计算机科学, 2022, 49(3): 308-312. https://doi.org/10.11896/jsjkx.210300231
[9] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[10] 陈志毅, 隋杰.
基于DeepFM和卷积神经网络的集成式多模态谣言检测方法
DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection
计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007
[11] 肖丁, 张玙璠, 纪厚业.
基于多头注意力机制的用户窃电行为检测
Electricity Theft Detection Based on Multi-head Attention Mechanism
计算机科学, 2022, 49(1): 140-145. https://doi.org/10.11896/jsjkx.210100177
[12] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[13] 裴莹, 李天祥, 王鏖清, 付加胜, 韩霄松.
基于新闻的国际天然气价格趋势预测方法
Prediction Method of International Natural Gas Price Trends Based on News
计算机科学, 2021, 48(6A): 235-239. https://doi.org/10.11896/jsjkx.201000056
[14] 潘芳, 张会兵, 董俊超, 首照宇.
基于高效Transformer的中文在线课程评论方面情感分析
Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer
计算机科学, 2021, 48(6A): 264-269. https://doi.org/10.11896/jsjkx.200800116
[15] 曾友渝, 谢强.
基于改进RNN和VAR的船舶设备故障预测方法
Fault Prediction Method Based on Improved RNN and VAR for Ship Equipment
计算机科学, 2021, 48(6): 184-189. https://doi.org/10.11896/jsjkx.200700117
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!