Computer Science ›› 2018, Vol. 45 ›› Issue (4): 208-214.doi: 10.11896/j.issn.1002-137X.2018.04.035

Previous Articles     Next Articles

Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model

ZHANG Jing and ZHU Guo-bin   

  • Online:2018-04-15 Published:2018-05-11

Abstract: Stack Overflow is a popular programming question and answer(Q&A) website,we can gather the hot programming knowledge which the developers focus on by studying the programming question text semantic mining.Owing to the high dimensionality problem which hinders processing efficiency and the topic distribution which makes topics unclear,it is difficult to detect topics from a large number of short texts in social network.To overcome these problems,this paper proposed a new LDA(Latent Dirichlet Allocation) model based topic detection method called CBOW-LDA topic modeling method.Using the model to target language and clustering similar words by vectors similarity before topic detection can decrease the dimensions of LDA output and make topics more clearly.Through the analysis of topic perplexity in the experiment dataset with different data collection capacity about the POST on Stack Overflow in 2010-2015,it is obvious that topics detected by our method has a lower perplexity,comparing with word frequency weighing based vectors named TF-LDA.In a condition of same number of topic words from the corpus,perplexity is reduced by about 4.87%,which means CBOW-LDA model performs better.When acting CBOW-LDA method in hot topic on Stack Overflow,TF-LDA method was used to be compared as well,and this paper established a manual annotation standard evaluation set and used Recall,Precision and F1 to contrast experiment results.This paper confirmed that the CBOW-LDA method has better effect because each measuring value of CBOW-LDA is better than TF-LDA,which proves that the hot spot mining effect of CBOW-LDA is good.Through ourexperiment,this paper effectively found out the hot issues of the theme and hot words in nearly six years.This paper drew the conclusion that “Java” is the hottest topic in the website,and “JavaScript” and “C” are the favorite words mentioned in questions from the users.

Key words: Stack Overflow,LDA-CBOW language model,Topic detection,Hot topic,Perplexity

[1] BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003,3:993-1022.
[2] MIAO Z,CHEN K,FANG Y,et al.Cost-Effective Online Trending Topic Detection and Popularity Prediction in Microblogging[J].Acm Transactions on Information Systems,2016,35(3):18.
[3] LEE Y J,YEH Y R,WANG Y C F.Anomaly Detection via Online Oversampling Principal Component Analysis[J].IEEE Transactions on Knowledge & Data Engineering,2013,25(7):1460-1470.
[4] WU L,HOI S C H,YU N.Semantics-preserving bag-of-words models and applications [J].IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society,2010,19(7):1908-1920.
[5] RAMAGE D,DUMAIS S T,LIEBLING D J.Characterizing microblogs with topic models[C]∥Fourth International Confe-rence on Weblogs and Social Media.Menlo Park:AAAI Press,2010:130-137.
[6] LEE C H,CHIEN T F.Leveraging microblogging big data with a modified density-based clustering approach for event awareness and topic ranking [J].Journal of Information Science,2013,39(4):523-543.
[7] MIKOLOV T,Language Modeling for Speech Recognition[D].Brno:Brno University of Technology,2007.
[8] MIKOLOV T,KOPECYK J,BURGRT L,et al.Neural network based language models for highly inflective languages[C]∥IEEE International Conference on Acoustics,Speech and Signal Processing.Taipei:IEEE,2009:4725-4728.
[9] TOMAS M,CHEN K,CORRADO G.Efficient estimation ofword representations in vector space[EB/OL].(2013-08-18) [2013-09-07].http://arxiv.org /abs /1301.3781.
[10] PEGHOTY.word2vec中的数学原理[EB/OL].http://blog.csdn.net/itplus/article/details/37969979.
[11] POST数据集来源网址[EB/OL].http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede.
[12] GUO L T,LI Y,MU D J.A Method of Topic Discovery Based on LDA Theme Model [J].Journal of Northwestern Polytechnical University,2016,4(1):698-702.
[13] HUANG B,YANG Y,MAHMOOD A,et al.Microblog TopicDetection Based on LDA Model and Single-Pass Clustering[M]∥Rough Sets and Current Trends in Computing.Springer Berlin Heidelberg,2012:166-171.
[14] GUPTA M,KUMAR P,BHASKER B.Clustering of users on microblogging social media:A rough set based approach[C]∥International Conference on Data Science and Engineering.IEEE,2017:1-6.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!