计算机科学 ›› 2017, Vol. 44 ›› Issue (2): 257-261.doi: 10.11896/j.issn.1002-137X.2017.02.042

• 人工智能 • 上一篇    下一篇

基于BTM和K-means的微博话题检测

李卫疆,王真真,余正涛   

  1. 昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受地区科学基金项目:基于统计机器翻译和自动文摘的查询扩展研究(61363045),云南省自然科学基金重点项目(2013FA130),科技部中青年科技创新领军人才项目(2014HE001)资助

Micro-blog Topic Detection Method Integrating BTM Topic Model and K-means Clustering

LI Wei-jiang, WANG Zhen-zhen and YU Zheng-tao   

  • Online:2018-11-13 Published:2018-11-13

摘要: 近年来,微博等社交网络的发展给人们的沟通交流提供了方便。由于每条微博都限定在140字以内,因此产生了大量的短文本信息。从短文本中发现话题日渐成为一项重要的课题。传统的话题模型(如概率潜在语义分析(PLSA)、潜在狄利克雷分配(LDA)等) 在处理短文本方面都面临着严重的数据稀疏问题。另外,当数据集比较集中并且话题文档间的差别较明显时,K-means 聚类算法能够聚类出有区分度的话题。引入BTM话题模型来处理微博数据这样的短文本,以缓解数据稀疏的问题。同时,整合了K-means聚类算法来对BTM模型所发现的话题进行聚类。在新浪微博短文本集上进行的实验证明了此方法发现话题的有效性。

关键词: 短文本,话题模型,话题发现,K-means聚类

Abstract: Recently,the development of micro-blog provides people with convenient communication.Because every piece of micro-blog is limited in 140 words,large scale of short texts appear.In the meantime,discovering topics from short texts genuinely becomes an intractable problem.It is hard for traditional topic model to model short texts,such as probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA).They suffer from the severe data sparsity when disposing short texts.Moreover,K-means clustering algorithm can make topics discriminative when datasets is intensive and the difference between topic documents is distinct.In order to improve data sparsity,BTM topic model was employed to process short texts-micro-blog data for alleviating the problem of sparsity in this paper.At the same time,we integrated K-means clustering algorithm into BTM(Bi-term Topic Model) for topics discovery further.The results of experiments on Sina micro-blog short text collections demonstrate that our method can discover topics effectively.

Key words: Short text,Topic model,Topic discovery,K-means clustering

[1] HUANG S Q,YANG Y T,LI H K,et al.Topic Detection from Microblog Based on Text Clustering and Topic Model Analysis[C]∥2014 Asia-Pacific Services Computing Conference.IEEE,2014:88-92.
[2] HOFMANN T.Probabilistic latent semantic indexing[C]∥Proc.of the 22nd Annual ACM Conference on Research and Development in Information Retrieval.California,Berkeley,1999:50-57.
[3] BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].The Journal of Machine Learning Research,2003(3):993-1022.
[4] YAN X H,GUO J F,LAN Y Y,et al.A Biterm Topic Model for Short Texts[C]∥International Conference on World Wide Web.ACM,2013:1445-1456.
[5] LIU S B,LIU L.Combining Parametric and NonparametricTopic Model to Discover Microblog Event[C]∥Information Science,Electronics and Electrical Engineering(ISEEE).IEEE,2014:1527-1531.
[6] WANG Y Y,WANG L,QI J,et al.Improved Text Clustering Algorithm and Application in Microblogging Pubic Opinin Ana-lysis[C]∥2013 Fourth World Congress on Software Engineering(WCSE).IEEE,2013:27-31.
[7] LU R,XIANG L,LIU M R,et al.Discovering News Topicsfrom Micro-blogs based on Hidden Topics Analysis and Text Clustering[J].Pattern Recognition & Artificial Intelligence,2012,25(3):382-387.(in Chinese) 路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客中新闻话题的发现[J].模式识别与人工智能,2012,25(3):382-387.
[8] HAN J W,MICHELINE K.数据挖掘:概念与技术(第2版)[M].范明,孟小峰,译.2007:263-266.
[9] XIONG Z T.Clustering Algorithm Research in Micro-blog Short Text based on Sparse Feature[J].Software Guide,2014,13(1):133-135.(in Chinese) 熊祖涛.基于稀疏特征的中文微博短文本聚类方法研究[J].软件导刊,2014,13(1):133-135.
[10] XIE H,JIANG H.Improved LDA model for micro-blog topic mining[J].Journal of East China Nornal University,2013(6):93-101.
[11] 亓晓青,景晓军.应用于微博的LDA模型改进[EB/OL].http://www.paper.edu.cn.
[12] RAMAGE D,DUMAIL S T,LIEBLING D J.Characterizing Micro-blogs with Topic Model[C]∥4th International AAAI Conference on Weblogs and Socail Media.2010:130-137.
[13] HUANG T,PENG D L,CAO L D.Discovering Communitieswith Self-adaptive k Clustering in Micro-blog Data[C]∥2012 Second International Conference on Cloud and Green Computing(CGC).IEEE,2012:383-390.
[14] SUN S P.Research on Chinese Micro-blog Hot Topic Detection and Tracking[D].Beijing:Beijing Jiaotong University,2011.(in Chinese) 孙胜平.中文微博客热点话题检测与跟踪技术研究[D].北京:北京交通大学,2011.
[15] MI W L,SUN Y X.Microblog Hot Topics Discovery Method based on Probabilistic Topic Model[J].Computer Systems & Applications,2014,23(8):163-167.(in Chinese) 米文丽,孙曰昕.利用概率主题模型的微博热点话题发现方法[J].计算机系统应用,2014,23(8):163-167.
[16] ZHENG L.Reserch and Application of Topic Detection on Micro-Blog[D].Harbin:Harbin Institute of Technology,2012.(in Chinese) 郑磊.微博客话题检测的研究与实现[D].哈尔滨:哈尔滨工业大学,2012.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!