计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 101-107.doi: 10.11896/j.issn.1002-137X.2016.12.018

• 机器学习 • 上一篇    下一篇

基于滑动窗口的主题模型

常东亚,严建峰,杨璐,刘晓升   

  1. 苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006;香港城市大学创意媒体学院 香港999077,苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61373092,61572339,61272449),江苏省科技支撑计划重点项目(BE2014005)资助

Sliding-window Based Topic Modeling

CHANG Dong-ya, YAN Jian-feng, YANG Lu and LIU Xiao-sheng   

  • Online:2018-12-01 Published:2018-12-01

摘要: LDA(Latent Dirichlet Allocation)是一个分层的概率主题模型,目前被广泛地应用于文本挖掘。这种模型既不考虑文档与文档之间的顺序关系,也不考虑同一篇文档中词与词之间的顺序关系,简化了问题的复杂性,同时也为模型的改进提供了契机。针对此问题提出了基于滑动窗口的主题模型,该模型的基本思想是文档中的一个单词的主题与其附近若干单词的主题关系越紧密,受附近单词主题的影响越大。根据窗口和滑动位移的大小,把文档切割为粒度更小的片段。同时,针对大数据集和数据流问题,提出了在线滑动窗口主题模型。在4个数据集上的实验表明,基于滑动窗口的主题模型训练出来的模型在数据集上有更好的泛化性能和精度。

关键词: 潜在狄利克雷分配,主题模型,滑动窗口

Abstract: LDA(Latent Dirichlet Allocation) is an important hierarchical Bayesian model for probabilistic topic mode-ling,which touches on many important applications of text mining.This model takes neither the order of documents nor the order of words in one document into account,which simplifies the complexity of issues and provides a great chance to improve itself.To achieve this goal,a sliding-window based topic model was proposed.The fundamental idea of this model is that the theme of one word in a specific document has a strong relationship at the words near by and is mainly affected by them.Through modifying the size of window and sliding step,document is cut into smaller pieces.Meanwhile,aiming at the big dataset and data flow,online sliding window theme model was proposed.Experiments show that the sliding-window based topic model has better generalization performance and accuracy on four common datasets.

Key words: Latentdirichlet allocation,Topic model,Sliding window

[1] Blei D M,Ng A Y,Jordan M I.Latent Dirichlet allocation[J].J.Mach.Learn.Res.,2003(3):993-1022
[2] Blei D M.Introduction to Probabilistic Topic Models[J].Communications of the ACM,2011,27(6):55-65
[3] Griffiths T L,Steyvers M.Finding scientific topics[J].Procee-dings of the National Academy of Sciences,2004,101(Suppl 1):5228-5235
[4] Zeng J,Cheung W K,Liu J.Learning Topic Models by BeliefPropagation[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2013,35(5):1121-1134
[5] Wu X,Zeng J,et al.Finding Better Topics:Features,Priors and Constraints[M]∥Advances in Knowledge Discovery and Data Mining.Springer International Publishing,2014:296-310
[6] Zeng J.A topic modeling toolbox using belief propagation[J].The Journal of Machine Learning Research,2012,13(1):2233-2236
[7] Rosen-Zvi M,Griffiths T,Steyvers M,et al.Theauthor-topicmodel for authors and documents[C]∥UAI.2004:487-494
[8] Chang J,Blei D M.Hierarchical Relational models for Document Networks[J].EprintArxiv,2009,4(1):124-150
[9] Takita M,Naziruddin B,Matsumoto S,et al.Expectation-Propogation for the Generative Aspect Model[J].Computer Science,2002,5(11):3257-3269
[10] Schlkopf B,Platt J,Hofmann T.A Collapsed Variational Ba-yesian Inference Algorithm for Latent Dirichlet Allocation[J].Advances in Neural Information Processing Systems,2006(19):1353-1360
[11] Asuncion A,Welling M,Smyth P,et al.On smoothing and infe-rence for topic models[C]∥Proceedings of the Twenty- Fifth Conference on Uncertainty in Artificial Intelligence.AUAI Press,2009:27-34
[12] Yao L,Mimno D,McCallum A.Efficient methods for topic mo-del inference on streaming document collections[C]∥Procee-dings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:937-946
[13] Hoffman M,Bach F R,Blei D M.Online learning for latentdirichlet allocation[C]∥Advances in Neural Information Processing Systems.2010:856-864
[14] Zeng J,Liu Z Q,Cao X Q.Fast Online EM for Big Topic Modding[J].IEEE Transactions on Knowledge & Data Enginee-ring,2016,8(3):675-688
[15] Ye Y,Gong S,Liu C,et al.Online belief propagation algorithm for probabilistic latent semantic analysis[J].Frontiers of Computer Science,2013,7(4):526-535
[16] Asuncion A,Welling M,Smyth P,et al.On smoothing and infe-rence for topic models[C]∥Proceedings of the Twenty- Fifth Conference on Uncertainty in Artificial Intelligence.AUAI Press,2009:27-34
[17] Braun M,McAuliffe J.Variational inference for large-scale mo-dels of discrete choice[J].Journal of the American Statistical Association,2010,105(489):324-335
[18] Wallach H M,Mimno D M,Mccallum A.Rethinking LDA:why priors matter[J].Advances in Neural Information Processing Systems,2009(23):1973-1981
[19] Gao Yang,Yang Lu,Liu Xiao-sheng,et al.Study of Semantic Understanding by LDA[J].Computer Science,2015,2(8):279-282(in Chinese) 高阳,杨璐,刘晓升,等.LDA语义理解研究[J].计算机科学,2015,42(8):279-282

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .