计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 120-124, 134.doi: 10.11896/j.issn.1002-137X.2016.12.021

• 机器学习 • 上一篇    下一篇

一种基于动态词汇表的在线LDA算法

张健伟,严建峰,刘晓升,杨璐   

  1. 苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61373092,61572339,61272449),江苏省科技支撑计划重点项目(BE2014005)资助

Online LDA on Dynamic Vocabulary

ZHANG Jian-wei, YAN Jian-feng, LIU Xiao-sheng and YANG Lu   

  • Online:2018-12-01 Published:2018-12-01

摘要: 目前的在线潜在狄利克雷分布模型(LDA)算法大多是基于固定的词汇表,在实际应用中经常会出现词汇表和处理的语料不匹配的情况,影响了模型的实用性。针对这个现象,在置信传播算法(BP)的框架下,使主题单词分布服从狄利克雷过程,重新推导公式,使得词汇表在模型运行之前为空,并且在处理时不断向词汇表中增加发现的新词。实验证明,这种新的基于动态词汇表的算法不仅使得词汇表与语料的贴合度更高,而且使其在混淆度以及互信息指数这两个指标上能够比基于固定词汇表的LDA模型表现得更加优越。

关键词: 潜在狄利克雷分配,动态词汇表,狄利克雷过程,流处理

Abstract: Most of the online LDA algorithms are based on the fixed vocabulary table currently.The vocabulary table may not often match the processed corpus in practice which has a bad effect on the precision of LDA.To solve this problem,we let the topic words distribution subject to the dirichlet process (DP) and re-deduce the model under the framework of BP algorithm.So that we can make the vocabulary table empty before the algorithm running and it can continually add new words to table.Results from the experiments show that,our new algorithm can make the vocabulary table match the corpus better and the dynamic vocabulary table makes the new algorithm achieve better performance on perplexity and PMI compared with other state-of-the-art fixed vocabulary online algorithms.

Key words: Latent dirichlet allocation,Dynamic vocabulary,Dirichlet process,Streaming process

[1] Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].The Journal of Machine Learning Research,2003,3(1):993-1022
[2] Zeng J,Cheung W K,Liu J.Learning topic models by beliefpropagation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(5):1121-1134
[3] Heinrich G.Parameter estimation for text analysis[R].Technical Report,2005
[4] Sethuraman J.A constructive definition of Dirichlet priors[R].Florida State Univ Tallahassee Dept of Statistics,1991
[5] Zhai K,Boyd-Graber J.Online Latent Dirichlet Allocation with Infinite Vocabulary[C]∥Proceedings of The 30th International Conference on Machine Learning.2013:561-569
[6] Mimno D,Hoffman M,Blei D.Sparse stochastic inference for latent Dirichlet allocation[J].arXiv,2012(3):362-365
[7] Newman S K D,Cavedon L.External evaluation of topic models[C]∥Australasian Document Computing Symposium.2012:11-18
[8] Hoffman A F M,Blei D.Online inference of topics with latent dirichlet allocation[C]∥NIPS.2010:856-864
[9] Yao L,Mimno D,McCallum A.Efficient methods for topic mo-del inference on streaming document collections[C]∥Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:937-946
[10] Zeng J,Liu Z Q,Cao X Q.Online belief propagation for topicmodeling[J].arXiv preprint arXiv:1210.2179,2012
[11] Ishwaran H,Zarepour M.Dirichlet prior sieves in finite normal mixtures[J].Statistica Sinica,2002,12(3):941-963
[12] Mei S Y,Wang F,Zhou S G.Dirichlet process mixture model,extensions and appication[J].Chin Sci Bull,2012,7(34):3243-3257(in Chinese) 梅素玉,王飞,周水庚.狄利克雷过程混合模型、扩展模型及应用[J].科学通报,2012,57(34):3243-3257
[13] Gong Sheng-rong,Ye Yun,Liu Chun-ping,et al.Topic Tracking Based on Online Belief Propagation[J].Chinese Journal of Computers,2015,8(2):249-260(in Chinese) 龚声蓉,叶芸,刘纯平,等.基于在线消息传递的主题追踪方法[J].计算机学报,2015,8(2):249-260

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .