计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 44-51.doi: 10.11896/jsjkx.230300091
魏林林1,2, 沈国华1,2,3, 黄志球1,2,3, 蔡梦男1,2, 郭菲菲1,2
WEI Linlin1,2, SHEN Guohua1,2,3, HUANG Zhiqiu1,2,3, CAI Mengnan1,2, GUO Feifei1,2
摘要: 使用主题模型进行文档聚类是众多文本挖掘任务中一种常见的做法。许多研究针对软件问答网站的数据,利用主题模型进行聚类来分析不同领域在社区的发展情况。然而,这些软件相关数据往往包含代码片段且文本长度分布不均,使用传统单一的主题模型对文本数据建模,易得到不稳定的聚类结果。文中提出了一种结合代码片段和混合主题模型的聚类方法,并使用Stack Overflow作为数据源,构造了在该平台上被提问数量排名前60的Python第三方库数据集,经过建模,该数据集最终划分为以下6个不同的领域:网络安全、数据分析、人工智能、文本处理、软件开发和系统终端。实验结果表明,在自动评估和人工评估的指标上,使用代码片段结合文本进行主题建模,在聚类结果划分的质量上表现良好,而联合多个模型进行实验,一定程度上提高了聚类结果的稳定性和准确性。
中图分类号:
[1]WANG Z Y,XIA X,AHMED E H,et al.What do programmers discuss about blockchain? a case study on the use of balanced lda and the reference architecture of a domain to capture online discussions about blockchain platforms across the stack exchange communities[J].IEEE Transactions on Software Engineering,2021,47(7):1331-1349. [2]SUWONCHOOCHIT N,SENIVONGSE T.Classification ofDatabase Technology Problems on Stack Overflow[C]//Proceedings of the 2021 IEEE/ACIS 19th International Conference on Software Engineering Research,Management and Applications,2021:21-26. [3]SYED A,MEHDI B.What do concurrency developers askabout?A large-scale study using stack overflow[C]//Procee-dings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement.2018:1-10. [4]BAJAJ K,PATTABIRAMAN K,MESBAH A,et al.Miningquestions asked by web developers[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.2014:112-121. [5]BLEI M,NG Y,JORDAN I,et al.Latent Dirichlet Allocation[J].Journal of machine Learning research,2003,2003(3):993-1022. [6]COSTA G,ORTALE R.Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters[J].International Journal of Approximate Reasoning,2022,147:23-39. [7]AJAM G,CARLOS R,BENATALLAH B,et al.API Topics Issues in Stack Overflow Q&As Posts:An Empirical Study[C]//Proceedings of the 2020 XLVI Latin American Computing Conference.2020:147-155. [8]AHASANUZZAMAN M,ASADUZZAMAN M,ROY K,et al.Classifying stack overflow posts on API issues[C]//Proceedings of the 2018 IEEE 25th International Conference on Software Analysis,Evolution and Reengineering.2018:244-254. [9]ZHAO H H,LI Y H,LIU F W,et al.State and tendency:anempirical study of deep learning question and answer topics on Stack Overflow[J].Science China Information Sciences,2021,64(11):1-23. [10]YANG X L,LO D,XIA X,et al.What security questions do developers ask? A large-scale study of stack overflow posts[J].Journal of Computer Science and Technology,2016,31(5):910-924. [11]ROSEN C,SHIHAB E.What are mobile developers askingabout?A large scale study using stack overflow[J].Empirical Software Engineering,2016,21(3):1192-1223. [12]CHEN J,LI B,WANG J,et al.Knowledge Graph EnhancedThird-Party Library Recommendation for Mobile Application Development[J].IEEE Access,2020,8:42436-42446. [13]ALEXANDRE R,OUNI A,SAIED A M,et al.On the Identification of Third-Party Library Usage Patterns for Android Applications[C]//Proceedings of the International Conference on Evaluation and Assessment in Software Engineering.2022:255-259. [14]ALLAMANIS M,SUTTON C.Why,when,and what:Analy-zing Stack Overflow questions by topic,type,and code[C]//Proceedings of the 10th Working Conference on Mining Software Repositories.2013:53-56. [15]DEERWESTER S,DUMAIS S,LANDAUER T,et al.Indexing by latentsemantic analysis[J].Journal of the American society for Information Science,1990,41(6):391-407. [16]HOFMANN T.Probabilistic latent semantic indexing[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.1999:50-57. [17]BLEI D,LAFFERTY J.Correlated topic models[C]//Procee-dings of Advances in Neural Information Processing Systems.2005:147-154. [18]BLEI D,LAFFERTY J.Dynamic topic models[C]//Proceedings of International Conference on Machine Learning.2006:113-120. [19]WALLACH H.Topic modeling:beyond bag-of-words[C]//Proceedings of International Conference on Machine Learning.2006:977-984. [20]BENGIO Y,DUCHARME R,VINCENT P,et al.A neuralprobabilistic language model[J].Machine Learning,2003,3(2003):1137-1155. [21]BARUA A,THOMAS W,HASSAN E,et al.What are develo-pers talking about?an analysis of topics and trends in stack overflow[J].Empirical Software Engineering,2014,19(3):619-654. [22]PLETEA D,VASILESCU B,SEREBRENIK A,et al.Security and emotion:sentiment analysis of security discussions on GitHub[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.2014:348-351. [23]PEDREGOSAF,VAROQUAUX G,VINCENT W,et al.Scikit-learn:machine learning in Python[J].Journal of Machine Lear-ning Research,2011,12(2011):2825-2830. [24]CHEN Z F,MA W W Y,LIN W,et al.A study on the changes of dynamic feature code when fixing bugs:towards the benefits and costs of Python dynamic features[J].Science China Information Sciences,2018,61(1):1-18. [25]CHEN L,WU D,MA W W W Y,et al.How C++ templates are used for generic programming[J].ACM Transactions on Software Engineering and Methodology,2020,29(1):1-49. |
|