Computer Science ›› 2017, Vol. 44 ›› Issue (2): 257-261, 274.doi: 10.11896/j.issn.1002-137X.2017.02.042

Previous Articles     Next Articles

Micro-blog Topic Detection Method Integrating BTM Topic Model and K-means Clustering

LI Wei-jiang, WANG Zhen-zhen and YU Zheng-tao   

  • Online:2018-11-13 Published:2018-11-13

Abstract: Recently,the development of micro-blog provides people with convenient communication.Because every piece of micro-blog is limited in 140 words,large scale of short texts appear.In the meantime,discovering topics from short texts genuinely becomes an intractable problem.It is hard for traditional topic model to model short texts,such as probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA).They suffer from the severe data sparsity when disposing short texts.Moreover,K-means clustering algorithm can make topics discriminative when datasets is intensive and the difference between topic documents is distinct.In order to improve data sparsity,BTM topic model was employed to process short texts-micro-blog data for alleviating the problem of sparsity in this paper.At the same time,we integrated K-means clustering algorithm into BTM(Bi-term Topic Model) for topics discovery further.The results of experiments on Sina micro-blog short text collections demonstrate that our method can discover topics effectively.

Key words: Short text,Topic model,Topic discovery,K-means clustering

[1] HUANG S Q,YANG Y T,LI H K,et al.Topic Detection from Microblog Based on Text Clustering and Topic Model Analysis[C]∥2014 Asia-Pacific Services Computing Conference.IEEE,2014:88-92.
[2] HOFMANN T.Probabilistic latent semantic indexing[C]∥Proc.of the 22nd Annual ACM Conference on Research and Development in Information Retrieval.California,Berkeley,1999:50-57.
[3] BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].The Journal of Machine Learning Research,2003(3):993-1022.
[4] YAN X H,GUO J F,LAN Y Y,et al.A Biterm Topic Model for Short Texts[C]∥International Conference on World Wide Web.ACM,2013:1445-1456.
[5] LIU S B,LIU L.Combining Parametric and NonparametricTopic Model to Discover Microblog Event[C]∥Information Science,Electronics and Electrical Engineering(ISEEE).IEEE,2014:1527-1531.
[6] WANG Y Y,WANG L,QI J,et al.Improved Text Clustering Algorithm and Application in Microblogging Pubic Opinin Ana-lysis[C]∥2013 Fourth World Congress on Software Engineering(WCSE).IEEE,2013:27-31.
[7] LU R,XIANG L,LIU M R,et al.Discovering News Topicsfrom Micro-blogs based on Hidden Topics Analysis and Text Clustering[J].Pattern Recognition & Artificial Intelligence,2012,25(3):382-387.(in Chinese) 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现[J].模式识别与人工智能,2012,25(3):382-387.
[8] HAN J W,MICHELINE K.数据挖掘:概念与技术(第2版)[M].范明,孟小峰,译.2007:263-266.
[9] XIONG Z T.Clustering Algorithm Research in Micro-blog Short Text based on Sparse Feature[J].Software Guide,2014,13(1):133-135.(in Chinese) 熊祖涛.基于稀疏特征的中文微博短文本聚类方法研究[J].软件导刊,2014,13(1):133-135.
[10] XIE H,JIANG H.Improved LDA model for micro-blog topic mining[J].Journal of East China Nornal University,2013(6):93-101.
[11] 亓晓青,景晓军.应用于微博的LDA模型改进[EB/OL].
[12] RAMAGE D,DUMAIL S T,LIEBLING D J.Characterizing Micro-blogs with Topic Model[C]∥4th International AAAI Conference on Weblogs and Socail Media.2010:130-137.
[13] HUANG T,PENG D L,CAO L D.Discovering Communitieswith Self-adaptive k Clustering in Micro-blog Data[C]∥2012 Second International Conference on Cloud and Green Computing(CGC).IEEE,2012:383-390.
[14] SUN S P.Research on Chinese Micro-blog Hot Topic Detection and Tracking[D].Beijing:Beijing Jiaotong University,2011.(in Chinese) 孙胜平.中文微博客热点话题检测与跟踪技术研究[D].北京:北京交通大学,2011.
[15] MI W L,SUN Y X.Microblog Hot Topics Discovery Method based on Probabilistic Topic Model[J].Computer Systems & Applications,2014,23(8):163-167.(in Chinese) 米文丽,孙曰昕.利用概率主题模型的微博热点话题发现方法[J].计算机系统应用,2014,23(8):163-167.
[16] ZHENG L.Reserch and Application of Topic Detection on Micro-Blog[D].Harbin:Harbin Institute of Technology,2012.(in Chinese) 郑磊.微博客话题检测的研究与实现[D].哈尔滨:哈尔滨工业大学,2012.

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .