计算机科学 ›› 2017, Vol. 44 ›› Issue (2): 257-261.doi: 10.11896/j.issn.1002-137X.2017.02.042
李卫疆,王真真,余正涛
LI Wei-jiang, WANG Zhen-zhen and YU Zheng-tao
摘要: 近年来,微博等社交网络的发展给人们的沟通交流提供了方便。由于每条微博都限定在140字以内,因此产生了大量的短文本信息。从短文本中发现话题日渐成为一项重要的课题。传统的话题模型(如概率潜在语义分析(PLSA)、潜在狄利克雷分配(LDA)等) 在处理短文本方面都面临着严重的数据稀疏问题。另外,当数据集比较集中并且话题文档间的差别较明显时,K-means 聚类算法能够聚类出有区分度的话题。引入BTM话题模型来处理微博数据这样的短文本,以缓解数据稀疏的问题。同时,整合了K-means聚类算法来对BTM模型所发现的话题进行聚类。在新浪微博短文本集上进行的实验证明了此方法发现话题的有效性。
[1] HUANG S Q,YANG Y T,LI H K,et al.Topic Detection from Microblog Based on Text Clustering and Topic Model Analysis[C]∥2014 Asia-Pacific Services Computing Conference.IEEE,2014:88-92. [2] HOFMANN T.Probabilistic latent semantic indexing[C]∥Proc.of the 22nd Annual ACM Conference on Research and Development in Information Retrieval.California,Berkeley,1999:50-57. [3] BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].The Journal of Machine Learning Research,2003(3):993-1022. [4] YAN X H,GUO J F,LAN Y Y,et al.A Biterm Topic Model for Short Texts[C]∥International Conference on World Wide Web.ACM,2013:1445-1456. [5] LIU S B,LIU L.Combining Parametric and NonparametricTopic Model to Discover Microblog Event[C]∥Information Science,Electronics and Electrical Engineering(ISEEE).IEEE,2014:1527-1531. [6] WANG Y Y,WANG L,QI J,et al.Improved Text Clustering Algorithm and Application in Microblogging Pubic Opinin Ana-lysis[C]∥2013 Fourth World Congress on Software Engineering(WCSE).IEEE,2013:27-31. [7] LU R,XIANG L,LIU M R,et al.Discovering News Topicsfrom Micro-blogs based on Hidden Topics Analysis and Text Clustering[J].Pattern Recognition & Artificial Intelligence,2012,25(3):382-387.(in Chinese) 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现[J].模式识别与人工智能,2012,25(3):382-387. [8] HAN J W,MICHELINE K.数据挖掘:概念与技术(第2版)[M].范明,孟小峰,译.2007:263-266. [9] XIONG Z T.Clustering Algorithm Research in Micro-blog Short Text based on Sparse Feature[J].Software Guide,2014,13(1):133-135.(in Chinese) 熊祖涛.基于稀疏特征的中文微博短文本聚类方法研究[J].软件导刊,2014,13(1):133-135. [10] XIE H,JIANG H.Improved LDA model for micro-blog topic mining[J].Journal of East China Nornal University,2013(6):93-101. [11] 亓晓青,景晓军.应用于微博的LDA模型改进[EB/OL].http://www.paper.edu.cn. [12] RAMAGE D,DUMAIL S T,LIEBLING D J.Characterizing Micro-blogs with Topic Model[C]∥4th International AAAI Conference on Weblogs and Socail Media.2010:130-137. [13] HUANG T,PENG D L,CAO L D.Discovering Communitieswith Self-adaptive k Clustering in Micro-blog Data[C]∥2012 Second International Conference on Cloud and Green Computing(CGC).IEEE,2012:383-390. [14] SUN S P.Research on Chinese Micro-blog Hot Topic Detection and Tracking[D].Beijing:Beijing Jiaotong University,2011.(in Chinese) 孙胜平.中文微博客热点话题检测与跟踪技术研究[D].北京:北京交通大学,2011. [15] MI W L,SUN Y X.Microblog Hot Topics Discovery Method based on Probabilistic Topic Model[J].Computer Systems & Applications,2014,23(8):163-167.(in Chinese) 米文丽,孙曰昕.利用概率主题模型的微博热点话题发现方法[J].计算机系统应用,2014,23(8):163-167. [16] ZHENG L.Reserch and Application of Topic Detection on Micro-Blog[D].Harbin:Harbin Institute of Technology,2012.(in Chinese) 郑磊.微博客话题检测的研究与实现[D].哈尔滨:哈尔滨工业大学,2012. |
No related articles found! |
|