计算机科学 ›› 2019, Vol. 46 ›› Issue (11): 168-175.doi: 10.11896/jsjkx.191100504C

• 软件与数据库技术 • 上一篇    下一篇

改进的神经语言模型及其在代码提示中的应用

张献, 贲可荣   

  1. (海军工程大学电子工程学院 武汉430033)
  • 收稿日期:2018-10-16 出版日期:2019-11-15 发布日期:2019-11-14
  • 通讯作者: 贲可荣(1963-),男,博士,教授,CCF高级会员,主要研究方向为软件质量保证、人工智能,E-mail:benkerong08@163.com
  • 作者简介:张献(1990-),男,博士生,CCF学生会员,主要研究方向为软件缺陷挖掘、机器学习,E-mail:tomtomzx@foxmail.com。
  • 基金资助:
    本文受国家安全重大基础研究计划项目(613315)资助。

Modified Neural Language Model and Its Application in Code Suggestion

ZHANG Xian, BEN Ke-rong   

  1. (School of Electronic Engineering,Naval University of Engineering,Wuhan 430033,China)
  • Received:2018-10-16 Online:2019-11-15 Published:2019-11-14

摘要: 语言模型旨在刻画文本段的发生概率,作为自然语言处理领域中的一类重要模型,近年来其被广泛应用于不同软件分析任务,例如代码提示。为提高模型对代码特征的学习能力,文中提出了一种改进的循环神经网络语言模型——CodeNLM。该模型通过分析词向量形式表示的源代码序列,能够捕获代码规律,实现对序列联合概率分布的估计。考虑到现有模型仅学习代码数据,信息的利用不充分,提出了附加信息引导策略,通过非代码信息的辅助来提高代码规律的刻画能力。针对语言建模任务的特点,提出了节点逐层递增策略,通过优化网络结构来改善信息传递的有效性。实验中,针对9个Java项目共203万行代码,CodeNLM得到的困惑度指标明显优于n-gram类模型和传统神经语言模型,在代码提示应用中得到的平均准确度(MRR指标)较对比方法提高了3.4%~24.4%。实验结果表明,CodeNLM能有效地实现程序语言建模和代码提示任务,并具有较强的长距离信息学习能力。

关键词: 软件分析, 代码提示, 自然语言处理, 语言模型, 循环神经网络

Abstract: Language models are designed to characterize the occurrence probabilities of text segments.As a class of important model in the field of natural language processing,it has been widely used in different software analysis tasks in recent years.To enhance the learning ability for code features,this paper proposed a modified recurrent neural network language model,called CodeNLM.By analyzing the source code sequences represented in embedding form,the model can capture rules in codes and realize the estimation of the joint probability distribution of the sequences.Considering that the existing models only learn the code data and the information is not fully utilized,this paper proposed an additional information guidance strategy,which can improve the ability of characterizing the code rules through the assistance of non-code information.Aiming at the characteristics of language modeling task,alayer-by-layer incremental nodes setting strategy is proposed,which can optimize the network structure and improve the effectiveness of information transmission.In the verification experiments,for 9 Java projects with 2.03M lines of code,the perplexity index of CodeNLM is obviously better than the contrast n-gram class models and neural language models.In the code suggestion task,the average accuracy (MRR index) of the proposed model is 3.4%~24.4% higher than the contrast methods.The experimental results show that except possessing a strong long-distance information learning capability,CodeNLM can effectively model programming language and perform code suggestion well.

Key words: Software analysis, Code suggestion, Natural language processing, Language model, Recurrent neural network

中图分类号: 

  • TP311.5
[1] 王千祥,张健,谢涛,等.软件分析:技术、应用与趋势/ CCF 2015-2016中国计算机科学技术发展报告[M].北京:机械工业出版社,2016:55-113.
[2] ZHANG X,BEN K R.Application of deep learning methods in software analysis [J].Computer Engineering and Science,2017(12):2260-2268.(in Chinese)张献,贲可荣.深度学习方法在软件分析中的应用[J].计算机工程与科学,2017,39(12):2260-2268.
[3] JURAFSKY D,JAMES H M.Speech and language processing (2nd ed) [M].Upper Saddle River:Pearson/Prentice Hall,2009:4th Chapter.
[4] HINDLE A,BARR E T,SU Z,et al.On the naturalness of software[C]∥Proceedings of the 34th International Conference on Software Engineering.Piscataway:IEEE,2012:837-847.
[5] TU Z,SU Z,DEVANBU P.On the localness of software[C]∥Proceedings of the 22nd ACM SIGSOFT InternationalSympo-sium on Foundations of Software Engineering.New York:ACM,2014:269-280.
[6] NGUYEN T T,NGUYEN A T,NGUYEN H A,et al.A statistical semantic language model for source code[C]∥Proceedings of the 9th Joint Meeting on Foundations of Software Enginee-ring.New York:ACM,2013:532-542.
[7] YANG Y,JIANG Y,GU M,et al.A language model for statements of software code[C]∥Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2017:682-687.
[8] NGUYEN A T,HILTON M,CODOBAN M,et al.API code recommendation using statistical learning from fine-grained changes[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:511-522.
[9] ALLAMANIS M,BARR E T,BIRD C,et al.Learning natural coding conventions[C]∥Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2014:281-293.
[10] RAY B,HELLENDOORN V,GODHANE S,et al.On the naturalness of buggy code[C]∥Proceedings of the 38th Internatio-nal Conference on Software Engineering.New York:ACM,2016:428-439.
[11] BIELIK P,RAYCHEV V,VECHEV M.Program synthesis for character level language modeling[C]∥Proceedings of the 5th International Conference on Learning Representations.Toulon,OpenReview,2017.
[12] ODA Y,FUDABA H,NEUBIG G,et al.Learning to generate pseudo-code from source code using statistical machine translation[C]∥Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2015:574-584.
[13] HIRSCHBERG J,MANNING C D.Advances in natural language processing[J].Science,2015,349(6245):261-266.
[14] WHITE M,VENDOME C,LINARES-VÁSQUEZ M,et al.Toward deep learning software repositories[C]∥Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories Piscataway.IEEE,2015:334-345.
[15] DAM H K,TRAN T,GRUNDY J,et al.DeepSoft:a vision for a deep model of software[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:944-947.
[16] DAM H K,TRAN T,PHAM T.A deep language model for software code[C]∥Workshop on Naturalness of Software,Collocated with the 24th International Symposium on Foundations of Software Engineering.New York:ACM,2016.
[17] ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]∥Proceedings of the 10th Joint Meeting on Foundations of Software Engineering.New York:ACM,2015:38-49.
[18] YIN C L,WANG W,LI T,et al.Using RNNLM to conduct topic oriented feature location method[J].Journal of Frontiers of Computer Science and Technology,2017,11(10):1599-1608.(in Chinese)尹春林,王炜,李彤,等.利用RNNLM面向主题的特征定位方法[J].计算机科学与探索,2017,11(10):1599-1608.
[19] ZHANG X,BEN K R,ZENG J.Cross-entropy:a new metric for software defect prediction[C]∥Proceedings of the 18th IEEE International Conference on Software Quality,Reliability and Security.Piscataway:IEEE,2018:111-122.
[20] BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[C]∥Proceedings of the 15th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2001:932-938.
[21] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,512(7553):436-444.
[22] MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrentneural network based language model[C]∥Proceedings of the Annual Conference of the International Speech Communication Association.Makuhari,ISCA,2010:1045-1048.
[23] MIKOLOV T,YIH W,ZWEIG G.Linguistic regularities in continuous space word representations[C]∥Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Atlanta,NAAC,2013:746-751.
[24] ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014.
[25] SALEHINEJAD H,BAARBE J,SANKAR S,et al.Recent advances in recurrent neural networks[J].arXiv:1801.01078,2018.
[26] BIELIK P,RAYCHEV V,VECHEV M.PHOG:probabilisticmodel for code[C]∥ Proceedings of the 33rd International Conference on Machine Learning.New York:ACM,2016:2933-2942.
[27] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[28] LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[29] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]∥Proceedings of the 26th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2012:1097-1105.
[30] SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:1-9.
[31] PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]∥Proceedings of the 30th International Conference on Machine Learning.New York:ACM,2013:1310-1318.
[1] 李亚男, 胡宇佳, 甘伟, 朱敏. 基于深度学习的miRNA靶位点预测研究综述[J]. 计算机科学, 2021, 48(1): 209-216.
[2] 仝鑫, 王斌君, 王润正, 潘孝勤. 面向自然语言处理的深度学习对抗样本综述[J]. 计算机科学, 2021, 48(1): 258-267.
[3] 陆龙龙, 陈统, 潘敏学, 张天. CodeSearcher:基于自然语言功能描述的代码查询[J]. 计算机科学, 2020, 47(9): 1-9.
[4] 田野, 寿黎但, 陈珂, 骆歆远, 陈刚. 基于字段嵌入的数据库自然语言查询接口[J]. 计算机科学, 2020, 47(9): 60-66.
[5] 庄世杰, 於志勇, 郭文忠, 黄昉菀. 基于Zoneout的跨尺度循环神经网络及其在短期电力负荷预测中的应用[J]. 计算机科学, 2020, 47(9): 105-109.
[6] 游兰, 韩雪薇, 何正伟, 肖丝雨, 何渡, 潘筱萌. 基于改进Seq2Seq的短时AIS轨迹序列预测模型[J]. 计算机科学, 2020, 47(9): 169-174.
[7] 张迎, 张宜飞, 王中卿, 王红玲. 基于主次关系特征的自动文摘方法[J]. 计算机科学, 2020, 47(6A): 6-11.
[8] 赫磊, 邵展鹏, 张剑华, 周小龙. 基于深度学习的行为识别算法综述[J]. 计算机科学, 2020, 47(6A): 139-147.
[9] 张浩洋, 周良. 改进的GHSOM算法在民航航空法规知识地图构建中的应用[J]. 计算机科学, 2020, 47(6A): 429-435.
[10] 吴小坤, 赵甜芳. 自然语言处理技术在社会传播学中的应用研究和前景展望[J]. 计算机科学, 2020, 47(6): 184-193.
[11] 张志扬, 张凤荔, 陈学勤, 王瑞锦. 基于分层注意力的信息级联预测模型[J]. 计算机科学, 2020, 47(6): 201-209.
[12] 胡超文, 杨亚连, 邬昌兴. 基于深度学习的隐式篇章关系识别综述[J]. 计算机科学, 2020, 47(4): 157-163.
[13] 余珊珊, 苏锦钿, 李鹏飞. 一种基于自注意力的句子情感分类方法[J]. 计算机科学, 2020, 47(4): 204-210.
[14] 李太松,贺泽宇,王冰,颜永红,唐向红. 基于循环时间卷积网络的序列流推荐算法[J]. 计算机科学, 2020, 47(3): 103-109.
[15] 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .