LDA语义理解研究

计算机科学 ›› 2015, Vol. 42 ›› Issue (8): 279-282.

LDA语义理解研究

高阳,杨璐,刘晓升,严建峰

苏州大学计算机科学与技术学院苏州215006,苏州大学计算机科学与技术学院苏州215006,苏州大学计算机科学与技术学院苏州215006,苏州大学计算机科学与技术学院苏州215006

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受国家自然科学基金(61373092,61033013,61272449,61202029),江苏省教育厅重大项目(12KJA520004),江苏省科技支撑计划重点项目(BE2014005),广东省重点实验室开放课题(SZU-GDPHPCL-2012-09)资助

Study of Semantic Understanding by LDA

GAO Yang, YANG Lu, LIU Xiao-sheng and YAN Jian-feng

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 潜在狄利克雷分配(LDA)被广泛应用于文本的聚类。有效理解信息检索的查询和文本,被证明能提高信息检索的性能。其中吉布斯采样和置信传播是求解LDA模型的两种热门的近似推理算法。比较了两种近似推理算法在不同主题规模下对信息检索性能的影响,并比较了LDA对文本解释的两种不同方式,即用文档的主题分布来替换原查询和文本,以及用文档的单词重构来替换原查询和文本。实验结果表明,文档的主题解释以及吉布斯采样算法能够有效提高信息检索的性能。

关键词: 潜在狄利克雷分配,信息检索,近似推理,文本解释

Abstract: Latent Dirichlet allocation(LDA) is a popular model used in text cluster,and is proved to improve the performance of information retrieval by explaining queries and documents effectively.There are mainly two algorithms to solve the inference of LDA model:Gibbs sampling and belief propagation.This paper compared the effect of these two inference algorithms on information retrieval in different topic scales,and used two different ways to explain queries and documents.One way is representing them with document-topic distribution,the other is representing them with word refactoring.Experimental results show that document-topic distribution and Gibbs sampling inference algorithm can improve the performance of information retrieval.

Key words: Latent Dirichlet allocation,Information retrieval,Approximate inference,Textual interpretation

高阳,杨璐,刘晓升,严建峰. LDA语义理解研究[J]. 计算机科学, 2015, 42(8): 279-282. https://doi.org/

GAO Yang, YANG Lu, LIU Xiao-sheng and YAN Jian-feng. Study of Semantic Understanding by LDA[J]. Computer Science, 2015, 42(8): 279-282. https://doi.org/

参考文献

[1] Liu X Y,Croft W B.Cluster-based retrieval using language mo-dels[C]∥Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrievel.2004:186-193
[2] Wei X,Croft W B.Lda-based document models for ad-hoc re-trieval[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrievel.2006:178-185
[3] Blei D M,Ng A,Jordan M.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3(1):993-1022
[4] Griffiths T L,Steyvers M.Finding scientific topics[J].Procee-dings of the National Academy of Sciences of USA,2004,101(1):5228-5235
[5] Zeng Jia,Cheung W K,Liu Ji-ming.Learning topic models by belief Propagation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,33(5):1121-1134
[6] Asuncion A U,Welling M,Smyth P,et al.On smoothing and inference for topic models[C]∥Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.2009:27-34
[7] Yao L,Mimno D M,McCallum A.Efficient methods for topic model inference on streaming document collections[C]∥ Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2009:937-946
[8] Porteous I,Newman D,Ihler A T,et al.Fast collapsed gibbs sampling for latent dirichlet allocation[C]∥Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Knowledge Discovery and Data Mi-ning.2008:569-577
[9] Manning C D,Raghavan P,Schütze H.Introduction to information retrieval[M].England:Cambridge University Press,2008
[10] 李峰,李芳.中文词语语义相似度计算——基于《知网》 2000[J].中文信息学报,2007,21(3):99-105 Li Feng,Li Fang.An New Approach Measuring Semantic Similarity in Hownet 2000[J].Journal of Chinese Information Processing,2007,1(3):99-105
[11] 江敏,肖诗斌,等.一种改进的基于《知网》的词语语义相似度计算[J].中文信息学报,2008,22(5):84-90 Jiang Min,Xiao Shi-bin,et al.An Improved Word Similarity Computing Method Based on Hownet[J].Journal of Chinese Information Processing,2008,2(5):84-90

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed