计算机科学 ›› 2019, Vol. 46 ›› Issue (11): 168-175.doi: 10.11896/jsjkx.191100504C
张献, 贲可荣
ZHANG Xian, BEN Ke-rong
摘要: 语言模型旨在刻画文本段的发生概率,作为自然语言处理领域中的一类重要模型,近年来其被广泛应用于不同软件分析任务,例如代码提示。为提高模型对代码特征的学习能力,文中提出了一种改进的循环神经网络语言模型——CodeNLM。该模型通过分析词向量形式表示的源代码序列,能够捕获代码规律,实现对序列联合概率分布的估计。考虑到现有模型仅学习代码数据,信息的利用不充分,提出了附加信息引导策略,通过非代码信息的辅助来提高代码规律的刻画能力。针对语言建模任务的特点,提出了节点逐层递增策略,通过优化网络结构来改善信息传递的有效性。实验中,针对9个Java项目共203万行代码,CodeNLM得到的困惑度指标明显优于n-gram类模型和传统神经语言模型,在代码提示应用中得到的平均准确度(MRR指标)较对比方法提高了3.4%~24.4%。实验结果表明,CodeNLM能有效地实现程序语言建模和代码提示任务,并具有较强的长距离信息学习能力。
中图分类号:
[1] | 王千祥,张健,谢涛,等.软件分析:技术、应用与趋势/ CCF 2015-2016中国计算机科学技术发展报告[M].北京:机械工业出版社,2016:55-113. |
[2] | ZHANG X,BEN K R.Application of deep learning methods in software analysis [J].Computer Engineering and Science,2017(12):2260-2268.(in Chinese)张献,贲可荣.深度学习方法在软件分析中的应用[J].计算机工程与科学,2017,39(12):2260-2268. |
[3] | JURAFSKY D,JAMES H M.Speech and language processing (2nd ed) [M].Upper Saddle River:Pearson/Prentice Hall,2009:4th Chapter. |
[4] | HINDLE A,BARR E T,SU Z,et al.On the naturalness of software[C]∥Proceedings of the 34th International Conference on Software Engineering.Piscataway:IEEE,2012:837-847. |
[5] | TU Z,SU Z,DEVANBU P.On the localness of software[C]∥Proceedings of the 22nd ACM SIGSOFT InternationalSympo-sium on Foundations of Software Engineering.New York:ACM,2014:269-280. |
[6] | NGUYEN T T,NGUYEN A T,NGUYEN H A,et al.A statistical semantic language model for source code[C]∥Proceedings of the 9th Joint Meeting on Foundations of Software Enginee-ring.New York:ACM,2013:532-542. |
[7] | YANG Y,JIANG Y,GU M,et al.A language model for statements of software code[C]∥Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2017:682-687. |
[8] | NGUYEN A T,HILTON M,CODOBAN M,et al.API code recommendation using statistical learning from fine-grained changes[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:511-522. |
[9] | ALLAMANIS M,BARR E T,BIRD C,et al.Learning natural coding conventions[C]∥Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2014:281-293. |
[10] | RAY B,HELLENDOORN V,GODHANE S,et al.On the naturalness of buggy code[C]∥Proceedings of the 38th Internatio-nal Conference on Software Engineering.New York:ACM,2016:428-439. |
[11] | BIELIK P,RAYCHEV V,VECHEV M.Program synthesis for character level language modeling[C]∥Proceedings of the 5th International Conference on Learning Representations.Toulon,OpenReview,2017. |
[12] | ODA Y,FUDABA H,NEUBIG G,et al.Learning to generate pseudo-code from source code using statistical machine translation[C]∥Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering.Piscataway:IEEE,2015:574-584. |
[13] | HIRSCHBERG J,MANNING C D.Advances in natural language processing[J].Science,2015,349(6245):261-266. |
[14] | WHITE M,VENDOME C,LINARES-VÁSQUEZ M,et al.Toward deep learning software repositories[C]∥Proceedings of the 12th IEEE/ACM Working Conference on Mining Software Repositories Piscataway.IEEE,2015:334-345. |
[15] | DAM H K,TRAN T,GRUNDY J,et al.DeepSoft:a vision for a deep model of software[C]∥Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.New York:ACM,2016:944-947. |
[16] | DAM H K,TRAN T,PHAM T.A deep language model for software code[C]∥Workshop on Naturalness of Software,Collocated with the 24th International Symposium on Foundations of Software Engineering.New York:ACM,2016. |
[17] | ALLAMANIS M,BARR E T,BIRD C,et al.Suggesting accurate method and class names[C]∥Proceedings of the 10th Joint Meeting on Foundations of Software Engineering.New York:ACM,2015:38-49. |
[18] | YIN C L,WANG W,LI T,et al.Using RNNLM to conduct topic oriented feature location method[J].Journal of Frontiers of Computer Science and Technology,2017,11(10):1599-1608.(in Chinese)尹春林,王炜,李彤,等.利用RNNLM面向主题的特征定位方法[J].计算机科学与探索,2017,11(10):1599-1608. |
[19] | ZHANG X,BEN K R,ZENG J.Cross-entropy:a new metric for software defect prediction[C]∥Proceedings of the 18th IEEE International Conference on Software Quality,Reliability and Security.Piscataway:IEEE,2018:111-122. |
[20] | BENGIO Y,DUCHARME R,VINCENT P.A neural probabilistic language model[C]∥Proceedings of the 15th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2001:932-938. |
[21] | LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,512(7553):436-444. |
[22] | MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrentneural network based language model[C]∥Proceedings of the Annual Conference of the International Speech Communication Association.Makuhari,ISCA,2010:1045-1048. |
[23] | MIKOLOV T,YIH W,ZWEIG G.Linguistic regularities in continuous space word representations[C]∥Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Atlanta,NAAC,2013:746-751. |
[24] | ZAREMBA W,SUTSKEVER I,VINYALS O.Recurrent neural network regularization[J].arXiv:1409.2329,2014. |
[25] | SALEHINEJAD H,BAARBE J,SANKAR S,et al.Recent advances in recurrent neural networks[J].arXiv:1801.01078,2018. |
[26] | BIELIK P,RAYCHEV V,VECHEV M.PHOG:probabilisticmodel for code[C]∥ Proceedings of the 33rd International Conference on Machine Learning.New York:ACM,2016:2933-2942. |
[27] | HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. |
[28] | LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324. |
[29] | KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]∥Proceedings of the 26th Annual Conference on Neural Information Processing Systems.Massachusetts:MIT Press,2012:1097-1105. |
[30] | SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE,2015:1-9. |
[31] | PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]∥Proceedings of the 30th International Conference on Machine Learning.New York:ACM,2013:1310-1318. |
[1] | 李亚男, 胡宇佳, 甘伟, 朱敏. 基于深度学习的miRNA靶位点预测研究综述[J]. 计算机科学, 2021, 48(1): 209-216. |
[2] | 仝鑫, 王斌君, 王润正, 潘孝勤. 面向自然语言处理的深度学习对抗样本综述[J]. 计算机科学, 2021, 48(1): 258-267. |
[3] | 陆龙龙, 陈统, 潘敏学, 张天. CodeSearcher:基于自然语言功能描述的代码查询[J]. 计算机科学, 2020, 47(9): 1-9. |
[4] | 田野, 寿黎但, 陈珂, 骆歆远, 陈刚. 基于字段嵌入的数据库自然语言查询接口[J]. 计算机科学, 2020, 47(9): 60-66. |
[5] | 庄世杰, 於志勇, 郭文忠, 黄昉菀. 基于Zoneout的跨尺度循环神经网络及其在短期电力负荷预测中的应用[J]. 计算机科学, 2020, 47(9): 105-109. |
[6] | 游兰, 韩雪薇, 何正伟, 肖丝雨, 何渡, 潘筱萌. 基于改进Seq2Seq的短时AIS轨迹序列预测模型[J]. 计算机科学, 2020, 47(9): 169-174. |
[7] | 张迎, 张宜飞, 王中卿, 王红玲. 基于主次关系特征的自动文摘方法[J]. 计算机科学, 2020, 47(6A): 6-11. |
[8] | 赫磊, 邵展鹏, 张剑华, 周小龙. 基于深度学习的行为识别算法综述[J]. 计算机科学, 2020, 47(6A): 139-147. |
[9] | 张浩洋, 周良. 改进的GHSOM算法在民航航空法规知识地图构建中的应用[J]. 计算机科学, 2020, 47(6A): 429-435. |
[10] | 吴小坤, 赵甜芳. 自然语言处理技术在社会传播学中的应用研究和前景展望[J]. 计算机科学, 2020, 47(6): 184-193. |
[11] | 张志扬, 张凤荔, 陈学勤, 王瑞锦. 基于分层注意力的信息级联预测模型[J]. 计算机科学, 2020, 47(6): 201-209. |
[12] | 胡超文, 杨亚连, 邬昌兴. 基于深度学习的隐式篇章关系识别综述[J]. 计算机科学, 2020, 47(4): 157-163. |
[13] | 余珊珊, 苏锦钿, 李鹏飞. 一种基于自注意力的句子情感分类方法[J]. 计算机科学, 2020, 47(4): 204-210. |
[14] | 李太松,贺泽宇,王冰,颜永红,唐向红. 基于循环时间卷积网络的序列流推荐算法[J]. 计算机科学, 2020, 47(3): 103-109. |
[15] | 李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173. |
|