一种结合代码片段和混合主题模型的软件数据聚类方法

doi:10.11896/jsjkx.230300091

Computer Science ›› 2024, Vol. 51 ›› Issue (6): 44-51.doi: 10.11896/jsjkx.230300091

• Computer Software • Previous Articles Next Articles

Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models

WEI Linlin^1,2, SHEN Guohua^1,2,3, HUANG Zhiqiu^1,2,3, CAI Mengnan^1,2, GUO Feifei^1,2

1 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
2 Key Laboratory of Safety-Critical Software,Ministry of Industry and Information Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
3 Collaborative Innovation Center of Novel Software Technology and Industrialization,Nanjing 210093,China

Received:2023-03-11 Revised:2023-07-13 Online:2024-06-15 Published:2024-06-05
About author:WEI Linlin,born in 1998,postgra-duate.His main research interests include software engineering and text mining.
SHEN Guohua,born in 1976,Ph.D,is a senior member of CCF(No.E200018171S).His main research interests include software engineering,software metrics,semantic web and description logic.
Supported by:
National Natural Science Foundation of China(61772270,U2241216) and Open Foundation of Civil Aviation Emergency Science and Technology Key Laboratory(NJ2022022).

Abstract

Abstract: Using topic model to cluster documents is a common practice in many text mining tasks.Many studies use topic models to cluster data from software websites to analyze the development of communities in different fields.However,due to the fact that these software-related data often contain code snippets and the uneven distribution of text length,it is easy to get unstable clustering results by using traditional single topic model to handle this text data.This paper proposes a clustering method combining code snippets and hybrid topic models,and uses Stack Overflow as the data source to construct a Python third-party libraries dataset with the top 60 questions on the platform.After analyzing,it is finally divided into the following six different areas:network security,data analysis,artificial intelligence,text processing,software development and system terminal.Experimental results show that in terms of automatic evaluation and manual evaluation indicators,using code snippets combined with text for topic modeling,the quality of clustering results division performs well, while combining multiple models for experiments can improve the stability and accuracy of clustering results to a certain extent.

Key words: Code snippets, Topic model, Stack Overflow, Python, Cluster

CLC Number:

TP311

WEI Linlin, SHEN Guohua, HUANG Zhiqiu, CAI Mengnan, GUO Feifei. Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models[J].Computer Science, 2024, 51(6): 44-51.

References

[1]WANG Z Y,XIA X,AHMED E H,et al.What do programmers discuss about blockchain? a case study on the use of balanced lda and the reference architecture of a domain to capture online discussions about blockchain platforms across the stack exchange communities[J].IEEE Transactions on Software Engineering,2021,47(7):1331-1349.
[2]SUWONCHOOCHIT N,SENIVONGSE T.Classification ofDatabase Technology Problems on Stack Overflow[C]//Proceedings of the 2021 IEEE/ACIS 19th International Conference on Software Engineering Research,Management and Applications,2021:21-26.
[3]SYED A,MEHDI B.What do concurrency developers askabout?A large-scale study using stack overflow[C]//Procee-dings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement.2018:1-10.
[4]BAJAJ K,PATTABIRAMAN K,MESBAH A,et al.Miningquestions asked by web developers[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.2014:112-121.
[5]BLEI M,NG Y,JORDAN I,et al.Latent Dirichlet Allocation[J].Journal of machine Learning research,2003,2003(3):993-1022.
[6]COSTA G,ORTALE R.Hierarchical Bayesian text modeling for the unsupervised joint analysis of latent topics and semantic clusters[J].International Journal of Approximate Reasoning,2022,147:23-39.
[7]AJAM G,CARLOS R,BENATALLAH B,et al.API Topics Issues in Stack Overflow Q&As Posts:An Empirical Study[C]//Proceedings of the 2020 XLVI Latin American Computing Conference.2020:147-155.
[8]AHASANUZZAMAN M,ASADUZZAMAN M,ROY K,et al.Classifying stack overflow posts on API issues[C]//Proceedings of the 2018 IEEE 25th International Conference on Software Analysis,Evolution and Reengineering.2018:244-254.
[9]ZHAO H H,LI Y H,LIU F W,et al.State and tendency:anempirical study of deep learning question and answer topics on Stack Overflow[J].Science China Information Sciences,2021,64(11):1-23.
[10]YANG X L,LO D,XIA X,et al.What security questions do developers ask? A large-scale study of stack overflow posts[J].Journal of Computer Science and Technology,2016,31(5):910-924.
[11]ROSEN C,SHIHAB E.What are mobile developers askingabout?A large scale study using stack overflow[J].Empirical Software Engineering,2016,21(3):1192-1223.
[12]CHEN J,LI B,WANG J,et al.Knowledge Graph EnhancedThird-Party Library Recommendation for Mobile Application Development[J].IEEE Access,2020,8:42436-42446.
[13]ALEXANDRE R,OUNI A,SAIED A M,et al.On the Identification of Third-Party Library Usage Patterns for Android Applications[C]//Proceedings of the International Conference on Evaluation and Assessment in Software Engineering.2022:255-259.
[14]ALLAMANIS M,SUTTON C.Why,when,and what:Analy-zing Stack Overflow questions by topic,type,and code[C]//Proceedings of the 10th Working Conference on Mining Software Repositories.2013:53-56.
[15]DEERWESTER S,DUMAIS S,LANDAUER T,et al.Indexing by latentsemantic analysis[J].Journal of the American society for Information Science,1990,41(6):391-407.
[16]HOFMANN T.Probabilistic latent semantic indexing[C]//Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval.1999:50-57.
[17]BLEI D,LAFFERTY J.Correlated topic models[C]//Procee-dings of Advances in Neural Information Processing Systems.2005:147-154.
[18]BLEI D,LAFFERTY J.Dynamic topic models[C]//Proceedings of International Conference on Machine Learning.2006:113-120.
[19]WALLACH H.Topic modeling:beyond bag-of-words[C]//Proceedings of International Conference on Machine Learning.2006:977-984.
[20]BENGIO Y,DUCHARME R,VINCENT P,et al.A neuralprobabilistic language model[J].Machine Learning,2003,3(2003):1137-1155.
[21]BARUA A,THOMAS W,HASSAN E,et al.What are develo-pers talking about?an analysis of topics and trends in stack overflow[J].Empirical Software Engineering,2014,19(3):619-654.
[22]PLETEA D,VASILESCU B,SEREBRENIK A,et al.Security and emotion:sentiment analysis of security discussions on GitHub[C]//Proceedings of the 11th Working Conference on Mining Software Repositories.2014:348-351.
[23]PEDREGOSAF,VAROQUAUX G,VINCENT W,et al.Scikit-learn:machine learning in Python[J].Journal of Machine Lear-ning Research,2011,12(2011):2825-2830.
[24]CHEN Z F,MA W W Y,LIN W,et al.A study on the changes of dynamic feature code when fixing bugs:towards the benefits and costs of Python dynamic features[J].Science China Information Sciences,2018,61(1):1-18.
[25]CHEN L,WU D,MA W W W Y,et al.How C++ templates are used for generic programming[J].ACM Transactions on Software Engineering and Methodology,2020,29(1):1-49.

Related Articles 15

[1]	LI Zi, ZHOU Yu. Sequence-based Program Semantic Rule Mining and Violation Detection [J]. Computer Science, 2024, 51(6): 78-84.
[2]	HE Yifan, HE Yulin, CUI Laizhong, HUANG Zhexue. Subspace-based I-nice Clustering Algorithm [J]. Computer Science, 2024, 51(6): 153-160.
[3]	ZHANG Zhiyuan, ZHANG Weiyan, SONG Yuqiu, RUAN Tong. Multilingual Event Detection Based on Cross-level and Multi-view Features Fusion [J]. Computer Science, 2024, 51(5): 208-215.
[4]	WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[5]	KANG Wei, LI Lihui, WEN Yimin. Semi-supervised Classification of Data Stream with Concept Drift Based on Clustering Model Reuse [J]. Computer Science, 2024, 51(4): 124-131.
[6]	QIAO Fan, WANG Peng, WANG Wei. Multivariate Time Series Classification Algorithm Based on Heterogeneous Feature Fusion [J]. Computer Science, 2024, 51(2): 36-46.
[7]	YANG Bo, LUO Jiachen, SONG Yantao, WU Hongtao, PENG Furong. Time Series Clustering Method Based on Contrastive Learning [J]. Computer Science, 2024, 51(2): 63-72.
[8]	GUO Shangzhi, LIAO Xiaofeng, XIAN Kaiyi. Logical Regression Click Prediction Algorithm Based on Combination Structure [J]. Computer Science, 2024, 51(2): 73-78.
[9]	HUANG Lulu, TANG Shuyu, ZHANG Wei, DAI Xiangguang. Non-negative Matrix Factorization Parallel Optimization Algorithm Based on Lp-norm [J]. Computer Science, 2024, 51(2): 100-106.
[10]	ZHAO Xiaoyan, ZHAO Bin, ZHANG Junna, YUAN Peiyan. Study on Cache-oriented Dynamic Collaborative Task Migration Technology [J]. Computer Science, 2024, 51(2): 300-310.
[11]	XU Jie, WANG Lisong. Contrastive Clustering with Consistent Structural Relations [J]. Computer Science, 2023, 50(9): 123-129.
[12]	HU Shen, QIAN Yuhua, WANG Jieting, LI Feijiang, LYU Wei. Super Multi-class Deep Image Clustering Model Based on Contrastive Learning [J]. Computer Science, 2023, 50(9): 192-201.
[13]	LIU Xiang, ZHU Jing, ZHONG Guoqiang, GU Yongjian, CUI Liyuan. Quantum Prototype Clustering [J]. Computer Science, 2023, 50(8): 27-36.
[14]	CUI Yunsong, WU Ye, XU Xiaoke. Decoupling Analysis of Network Structure Affecting Propagation Effect [J]. Computer Science, 2023, 50(7): 368-375.
[15]	CHEN Zhifei, HAO Yang, CHEN Lin, XIAO Liang. Rule-based Technique for Detecting Risky Dynamic Typing Code [J]. Computer Science, 2023, 50(7): 27-37.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Software Data Clustering Method Combining Code Snippets and Hybrid Topic Models

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0