计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 101-107.doi: 10.11896/j.issn.1002-137X.2016.12.018
常东亚,严建峰,杨璐,刘晓升
CHANG Dong-ya, YAN Jian-feng, YANG Lu and LIU Xiao-sheng
摘要: LDA(Latent Dirichlet Allocation)是一个分层的概率主题模型,目前被广泛地应用于文本挖掘。这种模型既不考虑文档与文档之间的顺序关系,也不考虑同一篇文档中词与词之间的顺序关系,简化了问题的复杂性,同时也为模型的改进提供了契机。针对此问题提出了基于滑动窗口的主题模型,该模型的基本思想是文档中的一个单词的主题与其附近若干单词的主题关系越紧密,受附近单词主题的影响越大。根据窗口和滑动位移的大小,把文档切割为粒度更小的片段。同时,针对大数据集和数据流问题,提出了在线滑动窗口主题模型。在4个数据集上的实验表明,基于滑动窗口的主题模型训练出来的模型在数据集上有更好的泛化性能和精度。
[1] Blei D M,Ng A Y,Jordan M I.Latent Dirichlet allocation[J].J.Mach.Learn.Res.,2003(3):993-1022 [2] Blei D M.Introduction to Probabilistic Topic Models[J].Communications of the ACM,2011,27(6):55-65 [3] Griffiths T L,Steyvers M.Finding scientific topics[J].Procee-dings of the National Academy of Sciences,2004,101(Suppl 1):5228-5235 [4] Zeng J,Cheung W K,Liu J.Learning Topic Models by BeliefPropagation[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2013,35(5):1121-1134 [5] Wu X,Zeng J,et al.Finding Better Topics:Features,Priors and Constraints[M]∥Advances in Knowledge Discovery and Data Mining.Springer International Publishing,2014:296-310 [6] Zeng J.A topic modeling toolbox using belief propagation[J].The Journal of Machine Learning Research,2012,13(1):2233-2236 [7] Rosen-Zvi M,Griffiths T,Steyvers M,et al.Theauthor-topicmodel for authors and documents[C]∥UAI.2004:487-494 [8] Chang J,Blei D M.Hierarchical Relational models for Document Networks[J].EprintArxiv,2009,4(1):124-150 [9] Takita M,Naziruddin B,Matsumoto S,et al.Expectation-Propogation for the Generative Aspect Model[J].Computer Science,2002,5(11):3257-3269 [10] Schlkopf B,Platt J,Hofmann T.A Collapsed Variational Ba-yesian Inference Algorithm for Latent Dirichlet Allocation[J].Advances in Neural Information Processing Systems,2006(19):1353-1360 [11] Asuncion A,Welling M,Smyth P,et al.On smoothing and infe-rence for topic models[C]∥Proceedings of the Twenty- Fifth Conference on Uncertainty in Artificial Intelligence.AUAI Press,2009:27-34 [12] Yao L,Mimno D,McCallum A.Efficient methods for topic mo-del inference on streaming document collections[C]∥Procee-dings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2009:937-946 [13] Hoffman M,Bach F R,Blei D M.Online learning for latentdirichlet allocation[C]∥Advances in Neural Information Processing Systems.2010:856-864 [14] Zeng J,Liu Z Q,Cao X Q.Fast Online EM for Big Topic Modding[J].IEEE Transactions on Knowledge & Data Enginee-ring,2016,8(3):675-688 [15] Ye Y,Gong S,Liu C,et al.Online belief propagation algorithm for probabilistic latent semantic analysis[J].Frontiers of Computer Science,2013,7(4):526-535 [16] Asuncion A,Welling M,Smyth P,et al.On smoothing and infe-rence for topic models[C]∥Proceedings of the Twenty- Fifth Conference on Uncertainty in Artificial Intelligence.AUAI Press,2009:27-34 [17] Braun M,McAuliffe J.Variational inference for large-scale mo-dels of discrete choice[J].Journal of the American Statistical Association,2010,105(489):324-335 [18] Wallach H M,Mimno D M,Mccallum A.Rethinking LDA:why priors matter[J].Advances in Neural Information Processing Systems,2009(23):1973-1981 [19] Gao Yang,Yang Lu,Liu Xiao-sheng,et al.Study of Semantic Understanding by LDA[J].Computer Science,2015,2(8):279-282(in Chinese) 高阳,杨璐,刘晓升,等.LDA语义理解研究[J].计算机科学,2015,42(8):279-282 |
No related articles found! |
|