计算机科学 ›› 2017, Vol. 44 ›› Issue (4): 35-38.doi: 10.11896/j.issn.1002-137X.2017.04.008

• NASAC 2015 • 上一篇    下一篇

基于LDA的软件代码主题摘要自动生成方法

李文鹏,赵俊峰,谢冰   

  1. 北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871,北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;北京大学天津滨海新一代信息技术研究院 天津300450,北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;北京大学天津滨海新一代信息技术研究院 天津300450
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61472007),质检公益性行业科研专项(201510209),国家重点专项(2016YFB1000801)资助

Summary Extraction Method for Code Topic Based on LDA

LI Wen-peng, ZHAO Jun-feng and XIE Bing   

  • Online:2018-11-13 Published:2018-11-13

摘要: 理解软件代码的功能是软件复用的一个重要环节。基于主题建模技术的代码理解方法能够挖掘软件代码中潜在的主题,这些主题在一定程度上代表了软件代码所实现的功能。但是使用主题建模技术所挖掘出的代码主题有着语义模糊、难以理解的弊端。潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)技术是一种比较常用的主题建模技术, 其在软件代码主题挖掘领域已取得了较好的结果,但同样存在上述问题。为此,需要为主题生成解释性文本描述。基于LDA的软件代码主题摘要自动生成方法除了利用主题建模技术对源代码生成主题之外,还利用文档、问答信息等包含软件系统功能描述的各类软件资源挖掘出代码主题的描述文本并提取摘要,从而能够更好地帮助开发人员理解软件的功能。

关键词: 软件代码,LDA,代码功能挖掘,软件文档,摘要

Abstract: Understanding the function of software code is important in software reuse.Topic modeling technologies can mine the latent topics from software code,which represent the software function.But these topics lack unambiguous explanation that make them hard to be understood by the developers.Latent Dirichlet allocation (LDA) technology is one of the popular topic modeling technology.There are studies which have used LDA to mine software code and get a good result,but there are also the problems in topic description.In this paper,in addition to the use of topic modeling techno-logy to generate topics from source code,explanatory text descriptions were generated for code topics from software resource such as documents,pairs of question and answer,mailing lists and so on.It can help users to understand the function of software code.The experiments show that the approach proposed in this paper is effective.

Key words: Software code,LDA,Code function mining,Software document,Summarization

[1] YANG F Q,MEI H,LI K Q.Software Reuse and Software Component Technology[J].Chinese Journal of Electronics,1999,27(2):68-75.(in Chinese) 杨芙清,梅宏,李克勤.软件复用与软件构件技术[J].电子学报,1999,27(2):68-75.
[2] ABRAN A,MOORE J,BOURQUE P,et al.Guide to the software engineering body of knowledge[M]∥SWEBOK.IEEE Computer Society,2004.
[3] KUHN A,DUCASSE S,GLRBA.Semantic clustering:Identifying topics in source code[J].Information and Software Technology,2007,49(3):230-243
[4] HOFMANN,THOMAS.Probabilistic Latent Semantic Indexing[C]∥Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval.1999:50-57.
[5] BIEID,NG A,JORDAN M.Latent dirichlet allocation[J].The Journal of Machine Learning Research, 2003,3:993-1022.
[6] WEI X,Croft W B.LDA-based document models for ad-hoc retrieval[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’06).2006:178-185.
[7] BALDI P F,LOPES C V,L INSTEAD E J,et al.A Theory of Aspects as Latent Topics[C]∥Proceedings of the 23rd ACM SIGPLAN Conference on Object-oriented Programming Systems Languages and Applications(OOPSLA’08).2008:543-562.
[8] MASKERI G,SARKAR S,HEAFIELD K.Mining BusinessTopics in Source Code using Latent Dirichlet Allocation[C]∥Proceedings of the 1st India software engineering conference(ISEC’08).2008:113-120
[9] XIE B,LI M,JIN J,et al.Mining Cohesive Domain Topics from Source Code[M]∥Safe and Secure Software Reuse:ICSR 2013.LNCS 7925,2013:239-254.
[10] HAIDUC S,APONTE J,MORENO L,et al.On the Use of Automated Text Summarization Techniques for Summarizing Source Code[C]∥2010 17th Working Conference on Reverse Engineering (WCRE).IEEE,2010:35-44.
[11] EDDY B P,ROBINSON J A,KRAFT N A,et al.Evaluatingsource code summarization techniques:Replication and expansion[C]∥2013 IEEE 21st International Conference on Program Comprehension (ICPC).IEEE,2013:13-22.
[12] CHANG J,BLEIl D M.Hierarchical relational models for document networks[J].The Annals of Applied Statistics,2010,4(1):124-150.
[13] ERKAN G,RADEV D R.LexRank:graph-based lexical centra-lity as salience in text summarization[J].Journal of Artificial Intelligence Research,2011,22(1):457-479.
[14] MCCANDLESS M,HATCHER E,G OSPODNETIC O.Lucene in Action(Second Edition)[M].The United States of America:Manning Publications Co.,2010:532.
[15] GRIFFITHS T L,STEYVERS M.Finding scientific topics[J].PNAS,2004,101:5228-5235.
[16] ARAFAT O,RIEHLE D.The comment density of open source software code[C]∥31st International Conference on Software Engineering-Companion(ICSE-Companion 2009).IEEE,2009:195-198.
[17] FLURI B,WRSCH M,GALL H C.Do code and comments co-evolve? on the relation between source code and comment changes[C]∥14th Working Conference on Reverse Engineering,2007(WCRE 2007).IEEE,2007:70-79.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .