计算机科学 ›› 2019, Vol. 46 ›› Issue (11): 168-175.doi: 10.11896/jsjkx.191100504C
张献, 贲可荣
ZHANG Xian, BEN Ke-rong
摘要: 语言模型旨在刻画文本段的发生概率,作为自然语言处理领域中的一类重要模型,近年来其被广泛应用于不同软件分析任务,例如代码提示。为提高模型对代码特征的学习能力,文中提出了一种改进的循环神经网络语言模型——CodeNLM。该模型通过分析词向量形式表示的源代码序列,能够捕获代码规律,实现对序列联合概率分布的估计。考虑到现有模型仅学习代码数据,信息的利用不充分,提出了附加信息引导策略,通过非代码信息的辅助来提高代码规律的刻画能力。针对语言建模任务的特点,提出了节点逐层递增策略,通过优化网络结构来改善信息传递的有效性。实验中,针对9个Java项目共203万行代码,CodeNLM得到的困惑度指标明显优于n-gram类模型和传统神经语言模型,在代码提示应用中得到的平均准确度(MRR指标)较对比方法提高了3.4%~24.4%。实验结果表明,CodeNLM能有效地实现程序语言建模和代码提示任务,并具有较强的长距离信息学习能力。
中图分类号:
[1]王千祥,张健,谢涛,等.软件分析:技术、应用与趋势/ CCF 2015-2016中国计算机科学技术发展报告[M].北京:机械工业出版社,2016:55-113. [2]ZHANG X,BEN K R.Application of deep learning methods in software analysis [J].Computer Engineering and Science,2017(12):2260-2268.(in Chinese) 张献,贲可荣.深度学习方法在软件分析中的应用[J].计算机工程与科学,2017,39(12):2260-2268. [3]JURAFSKY D,JAMES H M.Speech and language processing (2nd ed) [M].Upper Saddle River:Pearson/Prentice Hall,2009:4th Chapter. [4]HINDLE A,BARR E T,SU Z,et al.On the naturalness of software[C]∥Proceedings of the 34th International Conference on Software Engineering.Piscataway:IEEE,2012:837-847. [5]TU Z,SU Z,DEVANBU P.On the localness of software[C]∥Proceedings of the 22nd ACM SIGSOFT InternationalSympo-sium on Foundations of Software Engineering.New York:ACM,2014:269-280. [6]NGUYEN T T,NGUYEN A T,NGUYEN H A,et al.A statistical semantic language model for source code[C]∥Proceedings of the 9th Joint Meeting on Foundations of Software Enginee-ring.New York:ACM,2013:532-542. [7]YANG Y,JIANG Y,GU M,et al.A language model for statements of software code[C]∥Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2017:682-687. [8]NGUYEN A T,HILTON M,CODOBAN M,et al.API code recommendation using statistical learning from fine-grained changes[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:511-522. [9]ALLAMANIS M,BARR E T,BIRD C,et al.Learning natural coding conventions[C]∥Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2014:281-293. [10]RAY B,HELLENDOORN V,GODHANE S,et al.On the naturalness of buggy code[C]∥Proceedings of the 38th Internatio-nal Conference on Software Engineering.New York:ACM,2016:428-439. [11]BIELIK P,RAYCHEV V,VECHEV M.Program synthesis for character level language modeling[C]∥Proceedings of the 5th International Conference on Learning Representations.Toulon,OpenReview,2017. [12]ODA Y,FUDABA H,NEUBIG G,et al.Learning to generate pseudo-code from source code using statistical machine translation[C]∥Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2015:574-584. [13]HIRSCHBERG J,MANNING C D.Advances in natural language processing[J].Science,2015,349(6245):261-266. [14]WHITE M,VENDOME C,LINARES-VÁSQUEZ M,et al.Toward deep learning software repositories[C]∥Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories Piscataway.IEEE,2015:334-345. [15]DAM H K,TRAN T,GRUNDY J,et al.DeepSoft:a vision for a deep model of software[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:944-947. [16]DAM H K,TRAN T,PHAM T.A deep language model for software code[C]∥Workshop on Naturalness of Software,Collocated with the 24th International Symposium on Foundations of Software Engineering.New York:ACM,2016. [17]ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]∥Proceedings of the 10th Joint Meeting on Foundations of Software Engineering.New York:ACM,2015:38-49. [18]YIN C L,WANG W,LI T,et al.Using RNNLM to conduct topic oriented feature location method[J].Journal of Frontiers of Computer Science and Technology,2017,11(10):1599-1608.(in Chinese) 尹春林,王炜,李彤,等.利用RNNLM面向主题的特征定位方法[J].计算机科学与探索,2017,11(10):1599-1608. [19]ZHANG X,BEN K R,ZENG J.Cross-entropy:a new metric for software defect prediction[C]∥Proceedings of the 18th IEEE International Conference on Software Quality,Reliability and Security.Piscataway:IEEE,2018:111-122. [20]BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[C]∥Proceedings of the 15th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2001:932-938. [21]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,512(7553):436-444. [22]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrentneural network based language model[C]∥Proceedings of the Annual Conference of the International Speech Communication Association.Makuhari,ISCA,2010:1045-1048. [23]MIKOLOV T,YIH W,ZWEIG G.Linguistic regularities in continuous space word representations[C]∥Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Atlanta,NAAC,2013:746-751. [24]ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014. [25]SALEHINEJAD H,BAARBE J,SANKAR S,et al.Recent advances in recurrent neural networks[J].arXiv:1801.01078,2018. [26]BIELIK P,RAYCHEV V,VECHEV M.PHOG:probabilisticmodel for code[C]∥ Proceedings of the 33rd International Conference on Machine Learning.New York:ACM,2016:2933-2942. [27]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [28]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324. [29]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]∥Proceedings of the 26th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2012:1097-1105. [30]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:1-9. [31]PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]∥Proceedings of the 30th International Conference on Machine Learning.New York:ACM,2013:1310-1318. |
[1] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[2] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[3] | 彭双, 伍江江, 陈浩, 杜春, 李军. 基于注意力神经网络的对地观测卫星星上自主任务规划方法 Satellite Onboard Observation Task Planning Based on Attention Neural Network 计算机科学, 2022, 49(7): 242-247. https://doi.org/10.11896/jsjkx.210500093 |
[4] | 李小伟, 舒辉, 光焱, 翟懿, 杨资集. 自然语言处理在简历分析中的应用研究综述 Survey of the Application of Natural Language Processing for Resume Analysis 计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134 |
[5] | 喻昕, 林植良. 解决一类非光滑伪凸优化问题的新型神经网络 Novel Neural Network for Dealing with a Kind of Non-smooth Pseudoconvex Optimization Problems 计算机科学, 2022, 49(5): 227-234. https://doi.org/10.11896/jsjkx.210400179 |
[6] | 安鑫, 代子彪, 李阳, 孙晓, 任福继. 基于BERT的端到端语音合成方法 End-to-End Speech Synthesis Based on BERT 计算机科学, 2022, 49(4): 221-226. https://doi.org/10.11896/jsjkx.210300071 |
[7] | 时雨涛, 孙晓. 一种会话理解模型的问题生成方法 Conversational Comprehension Model for Question Generation 计算机科学, 2022, 49(3): 232-238. https://doi.org/10.11896/jsjkx.210200153 |
[8] | 李昊, 曹书瑜, 陈亚青, 张敏. 基于注意力机制的用户轨迹识别模型 User Trajectory Identification Model via Attention Mechanism 计算机科学, 2022, 49(3): 308-312. https://doi.org/10.11896/jsjkx.210300231 |
[9] | 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062 |
[10] | 陈志毅, 隋杰. 基于DeepFM和卷积神经网络的集成式多模态谣言检测方法 DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection 计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007 |
[11] | 肖丁, 张玙璠, 纪厚业. 基于多头注意力机制的用户窃电行为检测 Electricity Theft Detection Based on Multi-head Attention Mechanism 计算机科学, 2022, 49(1): 140-145. https://doi.org/10.11896/jsjkx.210100177 |
[12] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
[13] | 裴莹, 李天祥, 王鏖清, 付加胜, 韩霄松. 基于新闻的国际天然气价格趋势预测方法 Prediction Method of International Natural Gas Price Trends Based on News 计算机科学, 2021, 48(6A): 235-239. https://doi.org/10.11896/jsjkx.201000056 |
[14] | 潘芳, 张会兵, 董俊超, 首照宇. 基于高效Transformer的中文在线课程评论方面情感分析 Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer 计算机科学, 2021, 48(6A): 264-269. https://doi.org/10.11896/jsjkx.200800116 |
[15] | 曾友渝, 谢强. 基于改进RNN和VAR的船舶设备故障预测方法 Fault Prediction Method Based on Improved RNN and VAR for Ship Equipment 计算机科学, 2021, 48(6): 184-189. https://doi.org/10.11896/jsjkx.200700117 |
|