计算机科学 ›› 2014, Vol. 41 ›› Issue (9): 52-59.doi: 10.11896/j.issn.1002-137X.2014.09.008

• 2013’服务化软件 • 上一篇    下一篇

一种基于主题建模的代码功能挖掘工具

华哲邦,李萌,赵俊峰,邹艳珍,谢冰,李扬   

  1. 北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;北京大学信息科学技术学院 北京100871;高可信软件技术教育部重点实验室 北京100871;神州数码信息系统有限公司 北京100085
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家高技术研究发展计划(863计划)(2012AA01A403),国家自然科学基金(61121063),国家重点基础研究发展计划(973计划)(2009CB320703)资助

Code Function Mining Tool Based on Topic Modeling Technology

HUA Zhe-bang,LI Meng,ZHAO Jun-feng,ZOU Yan-zhen,XIE Bing and LI Yang   

  • Online:2018-11-14 Published:2018-11-14

摘要: 代码复用是重要的软件复用方式之一,复用者需要理解软件代码实现的功能方能有效实施软件复用。基于主题建模技术的程序理解方法逐渐受到研究人员的重视,它能够帮助软件开发者和使用者更好地理解软件的功能。目前,基于主题建模技术的程序理解方法一般欠缺对挖掘出的Topic的语义分析,为此提出的基于代码静态分析和LDA技术的代码功能挖掘(Code Function Mining,CFM)方法可作为对这类方法的补充。CFM是一套以代码为研究对象的挖掘、筛选、组织和描述主题(Topic)的方法,该方法能够生成带描述的功能型Topic的层次结构,以供使用者更清晰和方便地浏览、学习软件的功能。功能型Topic的描述能够帮助复用者理解代码功能,其层次结构能够让复用者从不同抽象层次理解代码功能。CFM方法包括4个部分:挖掘Topic、筛选Topic、组织Topic、描述Topic。以CFM方法为基础,设计并实现了一个CFM工具。CFM工具能够分析用户提交的代码,通过Web页面向用户展示带描述的功能型Topic的层次结构。最后,对CFM方法中的几个关键算法进行实验分析,验证了CFM方法的有效性。

关键词: 软件代码,代码静态分析,LDA,代码功能挖掘

Abstract: Code reuse is an important way of software reuse.Software engineers need to understand the code functions before they reuse the code.A Code Function Mining (CFM) method based on static code analysis and LDA technologies was proposed.CFM method is a code-oriented method for mining,filtering,organizing and describing topics.The output of CFM method is a hierarchy structure of functional topics with descriptions.Topic descriptions can help better learn code functions and the hierarchy structure can help better understand code functions from different abstraction levels.CFM method can be used as a supplement of traditional methods based on topic modeling technology to make up for the lack of semantic analysis of topics. CFM method includes four parts:Topic Mining,Topic Filtering,Topic Organizing,Topic Describing.A CFM tool based on CFM method can automatically analyze code and show the function topic hierarchy to users through Web page.To verify the validity of CFM method,the experimental analysis was also presented on several key algorithms in it.

Key words: Software code,Static code analysis,LDA,Code function mining

[1] 杨芙清,梅宏,李克勤.软件复用与软件构件技术[J].电子学报,1999,27(2):68-75
[2] Cleland-Huang J,Gotel O,Zisman A.Software andSystemsTraceability[M].Springer,2012
[3] Kuhn A,Ducasse S,Girba T.Semanticclustering:Identifyingtopics in sourcecode[J].Information and Software Technology, 2007,49(3):230-243
[4] Maskeri G,Sarkar S,Heafield K.Mining business topics insource code using latent dirichletallocation[C]∥Proceedings of the 1st India software engineering conference.ACM,2008:113-120
[5] Gethers M,Savage T,Di Penta M,et al.Codetopics:whichtopic am I coding now[C]∥ 33rd International Conference on Software Engineering (ICSE).IEEE,2011:1034-1036
[6] Blei D M,Lafferty J D.Topic models[J].Text mining:classification,clustering,and applications,2009(10):71
[7] Frigyik B A,Kapila A,Gupta M R.Introduction to the Dirichlet Distribution and Related Processes[R].UWEE Technical Report Number UWEETR-2010-0006.2010
[8] Heinrich G.Parameter estimation for text analysis[R].Technical Report.Fraunhofer IGD,Darmstadt,Germany,2009
[9] Baldi P F,Lopes C V,Linstead E J,et al.A Theory of Aspects as Latent Topics[C]∥Proceedings of the 23rd ACM SIGPLAN Conference on Object-oriented Programming Systems Languages and Applications,OOPSLA’08.2008:543-562
[10] Blei D,Lafferty J.Correlated topic models[J].Advances in neural information processing systems,2006,18:147
[11] Blei D M,Griffiths T L,Jordan M I,et al.Hierarchical topic models and the nested chinese restaurant process[C]∥Advances in Neural Information Processing Systems 16:Proceedings of the 2003Conference.MIT Press,2004,6:17
[12] Blei D M,Griffiths T L,Jordan M I.The Nested Chinese Restaurant Process and BayesianNonparametric Inference of Topic Hierarchies[J].Journal of the ACM (JACM),2010,57(2):1-30
[13] Segal E,Koller D,Ormoneit D.Probabilistic abstraction hierarchies[J].Advances in Neural Information Processing Systems,2002(2):913-920
[14] Griffiths T L,Steyvers M.Finding scientific topics[J].PNAS,2004,101:5228-5235
[15] Panichella A,Dit B,Oliveto R,et al.How to effectively use topic models for software engineering tasks on approach based on genetic algorithm[C]∥Proceedings of the 2013 International Conference on Software Engineering.IEEE Press,2013:522-531
[16] Savage T,Dit B,Gethers M,et al.TopicXP:Exploring Topics in Source Code using Latent Dirichlet Allocation[C]∥ICSM.2010:1-6

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!