计算机科学 ›› 2020, Vol. 47 ›› Issue (9): 1-9.doi: 10.11896/jsjkx.191200170

• 计算机软件* 上一篇    下一篇

CodeSearcher:基于自然语言功能描述的代码查询

陆龙龙, 陈统, 潘敏学, 张天   

  1. 南京大学计算机软件新技术国家重点实验室 南京210023
  • 收稿日期:2019-12-30 发布日期:2020-09-10
  • 通讯作者: 潘敏学(mxp@nju.edu.cn)
  • 作者简介:maomao75979@gmail.com
  • 基金资助:
    国家自然科学基金(61972193);中央高校基本科研业务费专项资金(14380022,14380020)

CodeSearcher:Code Query Using Functional Descriptions in Natural Languages

LU Long-long, CHEN Tong, PAN Min-xue, ZHANG Tian   

  1. State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China
  • Received:2019-12-30 Published:2020-09-10
  • About author:LU Long-long,born in 1993,master.His main research interests include software verification and model checking.
    PAN Min-xue,born in 1983,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include software modelling & verification,software analysis & testing,mobile computing and intelligent software engineering.
  • Supported by:
    National Natural Science Foundation of China (61972193) and Fundamental Research Funds for the Central Universities (14380022,14380020).

摘要: 在项目开发过程中,开发者需要为实现某一功能而编写代码;在不确定如何使用特定编程语言来实现当前待开发功能时,其往往会在文档或网络资源中进行代码查询。因此,代码查询的有效性会直接影响软件开发的效率。目前,已有相当数量的工具可以用来辅助开发者进行代码查询,但这些工具普遍存在输入形式复杂或者匹配精确度低等问题。文中提出的CodeSearcher 是一种基于自然语言功能描述的代码查询方法。CodeSearcher 将软件开发垂直领域的问答网站Stack OverFlow 的问答记录转换为〈自然语言描述,代码片段〉数据对,使用神经网络模型将“自然语言描述”和“代码片段”映射到相同的向量空间并进行匹配,从而能够支持开发者使用待开发功能的自然语言描述来查询相应代码。CodeSearcher 不同于一般的代码查询系统,一方面,它只需要代码本身而不依赖于代码的注释或说明,因此可以支持更多代码查询的场景;另一方面,它拓展了代码查询的流程,使其不再局限于一次性的查询反馈流程,而是在这中间加入了代码询答的流程,利用返回代码片段之间的差异性元素帮助开发者挑选目标代码,使得开发者不需要详细阅读所有返回的代码片段。实验结果表明,CodeSearcher 相较于基准有着更好的效果。

关键词: Stack OverFlow, 代码查询, 自然语言处理

Abstract: When a developer is required to implement a function,but not knowing how to implement this function using a specific programming language,he/she usually needs to perform code query using natural language.It is time-consuming and labor-intensive to perform code query while programming.There have been bunch of code query tools proposed over the past years to assist developers,while most of the approaches require complex inputs or have low precision.We propose a new code query approach called CodeSearcher based on natural language description.Relying on the 〈natural language description,code snippet〉 data pairs extracted from Stack OverFlow,which is a software development related Q&A website,we design a neural network model and the corresponding training method to map “natural language description” and “code snippets” to the same vector space.CodeSearcher is different from the conventional code query systems.On the one hand,it accepts all kinds of user-provided code bases for searching,because the system only relies on the source codes without depending on the comments or description of the source codes;on the other hand,it no longer limits the form of code query process to “entering the natural language description and feeding back the code snippets”,but extends a code Q&A section,helping the users pick the appropriate code snippet by the characteristic key words,so that developers do not have to read all returned code snippets in detail.The experimental results show that CodeSearcher has high precision compared with the baseline.

Key words: Code query, Natural language processing, Stack OverFlow

中图分类号: 

  • TP391
[1] JANICE S,LETHBRIDGE T,VINSON N,et al.An examina-tion of software engineering work practices[C]//CASCON First Decade High Impact Papers.2010:174-188.
[2] KUMAR S.How to convert string to xml file in java[EB/OL].[2019-06-24].https://stackoverflow.com/questions/3888033/how-to-convert-string-to-xml-file-in-java.
[3] KRAMER D.API documentation from source code comments:a case study of Javadoc[C]//Proceedings of the 17th Annual International Conference on Computer Documentation.ACM,1999:147-153.
[4] SPOLSKY J,ATWOOD J.Stack OverFlow Users [EB/OL].[2019-06-25].http://stackexchange.com/leagues/1/week/stackoverflow.
[5] SPOLSKY J,ATWOOD J.Stack Exchange Data Explorer[EB/OL].[2019-06-25].https://data.stackexchange.com/.
[6] SACHDEV S,LI H,LUAN S,et al.Retrieval on source code:a neural code search[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages.ACM,2018:31-41.
[7] WILLETT P.The Porter stemming algorithm:then and now[J].Program,2006,40(3):219-223.
[8] ZHANG Z,LYONS M,SCHUSTER M,et al.Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron[C]//Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.IEEE,1998:454-459.
[9] KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems.2012:1097-1105.
[10] GU X,ZHANG H,KIM S.Deep code search[C]//2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).IEEE,2018:933-944.
[11] SMITH N,VAN BRUGGEN D,TOMASSETTI F.JavaParser[OL].[2019-06-24].https://github.com/javaparser/javapar-ser.
[12] ERICH G.Design patterns:elements of reusable object-oriented software[M].Pearson Education India,1995.
[13] CUTTING D.Lucene[OL].[2019-08-17].https://lucene.apche.org.
[14] CHOLLET F.Keras[OL].[2019-06-24].https://keras.io.
[15] HINTON G,DEAN J.TensorFlow[OL].[2019-06-24].https://www.tensorflow.org/.
[16] LI X,WANG Z,WANG Q,et al.Relationship-aware codesearch for JavaScript frameworks[C]//ACM SIGSOFT International Symposium on Foundations of Software Engineering.ACM,2016:690-701.
[17] LV F,ZHANG H,LOU J,et al.Codehow:Effective code search based on api understanding and extended boolean model (e)[C]//2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).IEEE,2015:260-270.
[18] YE X,BUNESCU R,LIU C.Learning to rank relevant files for bug reports using domain knowledge[C]//Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering.ACM,2014:689-699.
[19] SUBRAMANIAN S,INOZEMTSEVA L,HOLMES R.LiveAPI documentation[C]//Proceedings of the 36th International Conference on Software Engineering.ACM,2014:643-652.
[20] MORENO L,BAVOTA G,DI PENTA M,et al.How can I use this method?[C]//Proceedings of the 37th International Conference on Software Engineering-Volume 1.IEEE Press,2015:880-890.
[21] STOLEE K T,ELBAUM S,DOBOS D.Solving the search for source code[J].ACM Transactions on Software Engineering and Methodology (TOSEM),2014,23(3):26.
[22] LEMOS O A L,BAJRACHARYA S,OSSHER J,et al.A test-driven approach to code search and its application to the reuse of auxiliary functionality[J].Information and Software Technology,2011,53(4):294-306.
[23] INOUE K,SASAKI Y,XIA P,et al.Where does this code come from and where does it go?-integrated code history tracker for open source systems[C]//Proceedings of the 34th International Conference on Software Engineering.IEEE Press,2012:331-341.
[24] LINSTEAD E,BAJRACHARYA S,NGO T,et al.Sourcerer:mining and searching internet-scale software repositories[J].Data Mining and Knowledge Discovery,2009,18(2):300-336.
[25] MCMILLAN C,GRECHANIK M,POSHYVANYK D,et al.Portfolio:finding relevant functions and their usage[C]//Proceedings of the 33rd International Conference on Software Engineering.ACM,2011:111-120.
[26] LU M,SUN X,WANG S,et al.Query expansion via wordnet for effective code search[C]//2015 IEEE 22nd International Conference on Software Analysis,Evolution,and Reengineering (SANER).IEEE,2015:545-549.
[27] GEORGE A M.WordNet:a lexical database for English[J].Communications of the ACM,1995,38:39-41.
[1] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[2] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[3] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[4] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[5] 陈志毅, 隋杰.
基于DeepFM和卷积神经网络的集成式多模态谣言检测方法
DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection
计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007
[6] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[7] 裴莹, 李天祥, 王鏖清, 付加胜, 韩霄松.
基于新闻的国际天然气价格趋势预测方法
Prediction Method of International Natural Gas Price Trends Based on News
计算机科学, 2021, 48(6A): 235-239. https://doi.org/10.11896/jsjkx.201000056
[8] 刘蕴涵, 沙朝锋, 牛军钰.
基于Stack Overflow的数据库相关主题分析
Analysis of Topics on Database Systems in Stack Overflow
计算机科学, 2021, 48(6): 48-56. https://doi.org/10.11896/jsjkx.200800217
[9] 吴俣, 李舟军.
检索式聊天机器人技术综述
Survey on Retrieval-based Chatbots
计算机科学, 2021, 48(12): 278-285. https://doi.org/10.11896/jsjkx.210900250
[10] 仝鑫, 王斌君, 王润正, 潘孝勤.
面向自然语言处理的深度学习对抗样本综述
Survey on Adversarial Sample of Deep Learning Towards Natural Language Processing
计算机科学, 2021, 48(1): 258-267. https://doi.org/10.11896/jsjkx.200500078
[11] 田野, 寿黎但, 陈珂, 骆歆远, 陈刚.
基于字段嵌入的数据库自然语言查询接口
Natural Language Interface for Databases with Content-based Table Column Embeddings
计算机科学, 2020, 47(9): 60-66. https://doi.org/10.11896/jsjkx.190800138
[12] 张迎, 张宜飞, 王中卿, 王红玲.
基于主次关系特征的自动文摘方法
Automatic Summarization Method Based on Primary and Secondary Relation Feature
计算机科学, 2020, 47(6A): 6-11. https://doi.org/10.11896/JsJkx.191000007
[13] 张浩洋, 周良.
改进的GHSOM算法在民航航空法规知识地图构建中的应用
Application of Improved GHSOM Algorithm in Civil Aviation Regulation Knowledge Map Construction
计算机科学, 2020, 47(6A): 429-435. https://doi.org/10.11896/JsJkx.190700161
[14] 吴小坤, 赵甜芳.
自然语言处理技术在社会传播学中的应用研究和前景展望
Application of Natural Language Processing in Social Communication:A Review and Future Perspectives
计算机科学, 2020, 47(6): 184-193. https://doi.org/10.11896/jsjkx.191200151
[15] 胡超文, 杨亚连, 邬昌兴.
基于深度学习的隐式篇章关系识别综述
Survey of Implicit Discourse Relation Recognition Based on Deep Learning
计算机科学, 2020, 47(4): 157-163. https://doi.org/10.11896/jsjkx.190300115
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!