基于滑动窗口的主题模型

doi:10.11896/j.issn.1002-137X.2016.12.018

计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 101-107.doi: 10.11896/j.issn.1002-137X.2016.12.018

基于滑动窗口的主题模型

常东亚,严建峰,杨璐,刘晓升

苏州大学计算机科学与技术学院苏州215006,苏州大学计算机科学与技术学院苏州215006;香港城市大学创意媒体学院香港999077,苏州大学计算机科学与技术学院苏州215006,苏州大学计算机科学与技术学院苏州215006

出版日期:2018-12-01 发布日期:2018-12-01
基金资助:
本文受国家自然科学基金(61373092,61572339,61272449),江苏省科技支撑计划重点项目(BE2014005)资助

Sliding-window Based Topic Modeling

CHANG Dong-ya, YAN Jian-feng, YANG Lu and LIU Xiao-sheng

Online:2018-12-01 Published:2018-12-01

摘要/Abstract

摘要： LDA(Latent Dirichlet Allocation)是一个分层的概率主题模型,目前被广泛地应用于文本挖掘。这种模型既不考虑文档与文档之间的顺序关系,也不考虑同一篇文档中词与词之间的顺序关系,简化了问题的复杂性,同时也为模型的改进提供了契机。针对此问题提出了基于滑动窗口的主题模型,该模型的基本思想是文档中的一个单词的主题与其附近若干单词的主题关系越紧密,受附近单词主题的影响越大。根据窗口和滑动位移的大小,把文档切割为粒度更小的片段。同时,针对大数据集和数据流问题,提出了在线滑动窗口主题模型。在4个数据集上的实验表明,基于滑动窗口的主题模型训练出来的模型在数据集上有更好的泛化性能和精度。

关键词: 潜在狄利克雷分配,主题模型,滑动窗口

Abstract: LDA(Latent Dirichlet Allocation) is an important hierarchical Bayesian model for probabilistic topic mode-ling,which touches on many important applications of text mining.This model takes neither the order of documents nor the order of words in one document into account,which simplifies the complexity of issues and provides a great chance to improve itself.To achieve this goal,a sliding-window based topic model was proposed.The fundamental idea of this model is that the theme of one word in a specific document has a strong relationship at the words near by and is mainly affected by them.Through modifying the size of window and sliding step,document is cut into smaller pieces.Meanwhile,aiming at the big dataset and data flow,online sliding window theme model was proposed.Experiments show that the sliding-window based topic model has better generalization performance and accuracy on four common datasets.

Key words: Latentdirichlet allocation,Topic model,Sliding window

常东亚,严建峰,杨璐,刘晓升. 基于滑动窗口的主题模型[J]. 计算机科学, 2016, 43(12): 101-107. https://doi.org/10.11896/j.issn.1002-137X.2016.12.018

CHANG Dong-ya, YAN Jian-feng, YANG Lu and LIU Xiao-sheng. Sliding-window Based Topic Modeling[J]. Computer Science, 2016, 43(12): 101-107. https://doi.org/10.11896/j.issn.1002-137X.2016.12.018

参考文献

[1] Blei D M,Ng A Y,Jordan M I.Latent Dirichlet allocation[J].J.Mach.Learn.Res.,2003(3):993-1022
[2] Blei D M.Introduction to Probabilistic Topic Models[J].Communications of the ACM,2011,27(6):55-65
[3] Griffiths T L,Steyvers M.Finding scientific topics[J].Procee-dings of the National Academy of Sciences,2004,101(Suppl 1):5228-5235
[4] Zeng J,Cheung W K,Liu J.Learning Topic Models by BeliefPropagation[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2013,35(5):1121-1134
[5] Wu X,Zeng J,et al.Finding Better Topics:Features,Priors and Constraints[M]∥Advances in Knowledge Discovery and Data Mining.Springer International Publishing,2014:296-310
[6] Zeng J.A topic modeling toolbox using belief propagation[J].The Journal of Machine Learning Research,2012,13(1):2233-2236
[7] Rosen-Zvi M,Griffiths T,Steyvers M,et al.Theauthor-topicmodel for authors and documents[C]∥UAI.2004:487-494
[8] Chang J,Blei D M.Hierarchical Relational models for Document Networks[J].EprintArxiv,2009,4(1):124-150
[9] Takita M,Naziruddin B,Matsumoto S,et al.Expectation-Propogation for the Generative Aspect Model[J].Computer Science,2002,5(11):3257-3269
[10] Schlkopf B,Platt J,Hofmann T.A Collapsed Variational Ba-yesian Inference Algorithm for Latent Dirichlet Allocation[J].Advances in Neural Information Processing Systems,2006(19):1353-1360
[11] Asuncion A,Welling M,Smyth P,et al.On smoothing and infe-rence for topic models[C]∥Proceedings of the Twenty- Fifth Conference on Uncertainty in Artificial Intelligence.AUAI Press,2009:27-34
[12] Yao L,Mimno D,McCallum A.Efficient methods for topic mo-del inference on streaming document collections[C]∥Procee-dings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:937-946
[13] Hoffman M,Bach F R,Blei D M.Online learning for latentdirichlet allocation[C]∥Advances in Neural Information Processing Systems.2010:856-864
[14] Zeng J,Liu Z Q,Cao X Q.Fast Online EM for Big Topic Modding[J].IEEE Transactions on Knowledge & Data Enginee-ring,2016,8(3):675-688
[15] Ye Y,Gong S,Liu C,et al.Online belief propagation algorithm for probabilistic latent semantic analysis[J].Frontiers of Computer Science,2013,7(4):526-535
[16] Asuncion A,Welling M,Smyth P,et al.On smoothing and infe-rence for topic models[C]∥Proceedings of the Twenty- Fifth Conference on Uncertainty in Artificial Intelligence.AUAI Press,2009:27-34
[17] Braun M,McAuliffe J.Variational inference for large-scale mo-dels of discrete choice[J].Journal of the American Statistical Association,2010,105(489):324-335
[18] Wallach H M,Mimno D M,Mccallum A.Rethinking LDA:why priors matter[J].Advances in Neural Information Processing Systems,2009(23):1973-1981
[19] Gao Yang,Yang Lu,Liu Xiao-sheng,et al.Study of Semantic Understanding by LDA[J].Computer Science,2015,2(8):279-282(in Chinese) 高阳,杨璐,刘晓升,等.LDA语义理解研究[J].计算机科学,2015,42(8):279-282

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于滑动窗口的主题模型

Sliding-window Based Topic Modeling

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0