计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 44-51.doi: 10.11896/jsjkx.230300091

• 计算机软件 • 上一篇    下一篇

一种结合代码片段和混合主题模型的软件数据聚类方法

魏林林1,2, 沈国华1,2,3, 黄志球1,2,3, 蔡梦男1,2, 郭菲菲1,2   

  1. 1 南京航空航天大学计算机科学与技术学院 南京 211106
    2 南京航空航天大学高安全系统的软件开发与验证技术工业和信息化部重点实验室 南京 211106
    3 软件新技术与产业化协同创新中心 南京 210093
  • 收稿日期:2023-03-11 修回日期:2023-07-13 出版日期:2024-06-15 发布日期:2024-06-05
  • 通讯作者: 沈国华(ghshen@nuaa.edu.cn)
  • 作者简介:(linlinwei@nuaa.edu.cn)
  • 基金资助:
    国家自然科学基金(61772270,U2241216);民航应急科学与技术重点实验室开放基金(NJ2022022)

Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models

WEI Linlin1,2, SHEN Guohua1,2,3, HUANG Zhiqiu1,2,3, CAI Mengnan1,2, GUO Feifei1,2   

  1. 1 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
    2 Key Laboratory of Safety-Critical Software,Ministry of Industry and Information Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
    3 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China
  • Received:2023-03-11 Revised:2023-07-13 Online:2024-06-15 Published:2024-06-05
  • About author:WEI Linlin,born in 1998,postgra-duate.His main research interests include software engineering and text mining.
    SHEN Guohua,born in 1976,Ph.D,is a senior member of CCF(No.E200018171S).His main research interests include software engineering,software metrics,semantic web and description logic.
  • Supported by:
    National Natural Science Foundation of China(61772270,U2241216) and Open Foundation of Civil Aviation Emergency Science and Technology Key Laboratory(NJ2022022).

摘要: 使用主题模型进行文档聚类是众多文本挖掘任务中一种常见的做法。许多研究针对软件问答网站的数据,利用主题模型进行聚类来分析不同领域在社区的发展情况。然而,这些软件相关数据往往包含代码片段且文本长度分布不均,使用传统单一的主题模型对文本数据建模,易得到不稳定的聚类结果。文中提出了一种结合代码片段和混合主题模型的聚类方法,并使用Stack Overflow作为数据源,构造了在该平台上被提问数量排名前60的Python第三方库数据集,经过建模,该数据集最终划分为以下6个不同的领域:网络安全、数据分析、人工智能、文本处理、软件开发和系统终端。实验结果表明,在自动评估和人工评估的指标上,使用代码片段结合文本进行主题建模,在聚类结果划分的质量上表现良好,而联合多个模型进行实验,一定程度上提高了聚类结果的稳定性和准确性。

关键词: 代码片段, 主题模型, Stack Overflow, Python, 聚类

Abstract: Using topic model to cluster documents is a common practice in many text mining tasks.Many studies use topic models to cluster data from software websites to analyze the development of communities in different fields.However,due to the fact that these software-related data often contain code snippets and the uneven distribution of text length,it is easy to get unstable clustering results by using traditional single topic model to handle this text data.This paper proposes a clustering method combining code snippets and hybrid topic models,and uses Stack Overflow as the data source to construct a Python third-party libraries dataset with the top 60 questions on the platform.After analyzing,it is finally divided into the following six different areas:network security,data analysis,artificial intelligence,text processing,software development and system terminal.Experimental results show that in terms of automatic evaluation and manual evaluation indicators,using code snippets combined with text for topic modeling,the quality of clustering results division performs well, while combining multiple models for experiments can improve the stability and accuracy of clustering results to a certain extent.

Key words: Code snippets, Topic model, Stack Overflow, Python, Cluster

中图分类号: 

  • TP311
[1]WANG Z Y,XIA X,AHMED E H,et al.What do programmers discuss about blockchain? a case study on the use of balanced lda and the reference architecture of a domain to capture online discussions about blockchain platforms across the stack exchange communities[J].IEEE Transactions on Software Engineering,2021,47(7):1331-1349.
[2]SUWONCHOOCHIT N,SENIVONGSE T.Classification ofDatabase Technology Problems on Stack Overflow[C]//Proceedings of the 2021 IEEE/ACIS 19th International Conference on Software Engineering Research,Management and Applications,2021:21-26.
[3]SYED A,MEHDI B.What do concurrency developers askabout?A large-scale study using stack overflow[C]//Procee-dings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement.2018:1-10.
[4]BAJAJ K,PATTABIRAMAN K,MESBAH A,et al.Miningquestions asked by web developers[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.2014:112-121.
[5]BLEI M,NG Y,JORDAN I,et al.Latent Dirichlet Allocation[J].Journal of machine Learning research,2003,2003(3):993-1022.
[6]COSTA G,ORTALE R.Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters[J].International Journal of Approximate Reasoning,2022,147:23-39.
[7]AJAM G,CARLOS R,BENATALLAH B,et al.API Topics Issues in Stack Overflow Q&As Posts:An Empirical Study[C]//Proceedings of the 2020 XLVI Latin American Computing Conference.2020:147-155.
[8]AHASANUZZAMAN M,ASADUZZAMAN M,ROY K,et al.Classifying stack overflow posts on API issues[C]//Proceedings of the 2018 IEEE 25th International Conference on Software Analysis,Evolution and Reengineering.2018:244-254.
[9]ZHAO H H,LI Y H,LIU F W,et al.State and tendency:anempirical study of deep learning question and answer topics on Stack Overflow[J].Science China Information Sciences,2021,64(11):1-23.
[10]YANG X L,LO D,XIA X,et al.What security questions do developers ask? A large-scale study of stack overflow posts[J].Journal of Computer Science and Technology,2016,31(5):910-924.
[11]ROSEN C,SHIHAB E.What are mobile developers askingabout?A large scale study using stack overflow[J].Empirical Software Engineering,2016,21(3):1192-1223.
[12]CHEN J,LI B,WANG J,et al.Knowledge Graph EnhancedThird-Party Library Recommendation for Mobile Application Development[J].IEEE Access,2020,8:42436-42446.
[13]ALEXANDRE R,OUNI A,SAIED A M,et al.On the Identification of Third-Party Library Usage Patterns for Android Applications[C]//Proceedings of the International Conference on Evaluation and Assessment in Software Engineering.2022:255-259.
[14]ALLAMANIS M,SUTTON C.Why,when,and what:Analy-zing Stack Overflow questions by topic,type,and code[C]//Proceedings of the 10th Working Conference on Mining Software Repositories.2013:53-56.
[15]DEERWESTER S,DUMAIS S,LANDAUER T,et al.Indexing by latentsemantic analysis[J].Journal of the American society for Information Science,1990,41(6):391-407.
[16]HOFMANN T.Probabilistic latent semantic indexing[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.1999:50-57.
[17]BLEI D,LAFFERTY J.Correlated topic models[C]//Procee-dings of Advances in Neural Information Processing Systems.2005:147-154.
[18]BLEI D,LAFFERTY J.Dynamic topic models[C]//Proceedings of International Conference on Machine Learning.2006:113-120.
[19]WALLACH H.Topic modeling:beyond bag-of-words[C]//Proceedings of International Conference on Machine Learning.2006:977-984.
[20]BENGIO Y,DUCHARME R,VINCENT P,et al.A neuralprobabilistic language model[J].Machine Learning,2003,3(2003):1137-1155.
[21]BARUA A,THOMAS W,HASSAN E,et al.What are develo-pers talking about?an analysis of topics and trends in stack overflow[J].Empirical Software Engineering,2014,19(3):619-654.
[22]PLETEA D,VASILESCU B,SEREBRENIK A,et al.Security and emotion:sentiment analysis of security discussions on GitHub[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.2014:348-351.
[23]PEDREGOSAF,VAROQUAUX G,VINCENT W,et al.Scikit-learn:machine learning in Python[J].Journal of Machine Lear-ning Research,2011,12(2011):2825-2830.
[24]CHEN Z F,MA W W Y,LIN W,et al.A study on the changes of dynamic feature code when fixing bugs:towards the benefits and costs of Python dynamic features[J].Science China Information Sciences,2018,61(1):1-18.
[25]CHEN L,WU D,MA W W W Y,et al.How C++ templates are used for generic programming[J].ACM Transactions on Software Engineering and Methodology,2020,29(1):1-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!