基于LDA主题模型的文本相似度计算

计算机科学 ›› 2013, Vol. 40 ›› Issue (12): 229-232.

基于LDA主题模型的文本相似度计算

王振振,何明,杜永萍

北京工业大学计算机学院北京100124;北京工业大学计算机学院北京100124;北京工业大学计算机学院北京100124

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
本文受国家自然科学基金(60803086),北京市自然科学基金(4123091),北京市教委科研计划(KM20110005013, KM200910005009)资助

Text Similarity Computing Based on Topic Model LDA

WANG Zhen-zhen,HE Ming and DU Yong-ping

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： LDA(Latent Dirichlet Allocation)模型是近年来提出的一种具有文本表示能力的非监督学习模型。提出了一种基于LDA主题模型的文本相似度计算方法,该方法利用LDA为语料库建模,利用MCMC中的Gibbs抽样进行推理,间接计算模型参数,挖掘隐藏在文本内的不同主题与词之间的关系,得到文本的主题分布,并以此分布来计算文本之间的相似度,最后对文本相似度矩阵进行聚类实验来评估聚类效果。实验结果表明,该方法能够明显提高文本相似度计算的准确率和文本聚类效果。

关键词: 主题模型,LDA,文本相似度,Gibbs抽样

Abstract: Latent Dirichlet Allocation (LDA) is an unsupervised model which exhibits superiority on latent topic mode-ling of text data in the research of recent years．This paper presented a method which improves text similarity calculation by using LDA model．This method models corpus and text with LDA．Parameters are estimated with Gibbs sampling of MCMC and the word probability is represented．It can mine the hidden relationship between the different topics and the words from texts,get the topic distribution,and compute the similarity between the text．Finally,the text similarity matrix clustering experiments are carrieel out to assess the effect of clustering．Experimental results show that the method can improve the text similarity accurate rate and clustering quality effectively.

Key words: Topic model,Latent Dirichlet Allocation(LDA),Text similarity,Gibbs sampling

王振振,何明,杜永萍. 基于LDA主题模型的文本相似度计算[J]. 计算机科学, 2013, 40(12): 229-232. https://doi.org/

WANG Zhen-zhen,HE Ming and DU Yong-ping. Text Similarity Computing Based on Topic Model LDA[J]. Computer Science, 2013, 40(12): 229-232. https://doi.org/

参考文献

[1] Salton G,Wong A,Yang C S．A Vector Space Model for Automatic lndexing[J].Communications of the ACM,1975,18:613-620
[2] Blei D,Ng A,Jordan M.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993
[3] 徐谦,周俊生,陈家骏.Dirichlet过程及其在自然语言处理中的应用[J]．中文信息学报,2009(5):125
[4] 刘振鹿,王大玲,冯时,等.一种基于LDA的潜在语义区划分及Web文档聚类算法[J].中文信息学报,2011,5(1):60-67
[5] 曹娟,张勇东.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,1(10):1780-1788
[6] 李文波,孙乐,黄瑞红,等.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,1(4):620-627
[7] 石晶,范猛,李万龙.基于LDA模型的主题分析[J].自动化报,2009,6:1586-1593
[8] Wei Xing,Croft W B．LDA-Based Document Models for Ad-hoc Retrieval[C]∥SIGIR’06.Seattle,WA,USA,August 2006
[9] Friedman N,Geiger D,Goldszmidt M.Bayesian Network Classifiers[J]．Machine Learning,1997,2:131
[10] 姚全珠,宋志理,彭程.基于LDA模型的文本分类研究 [J]．计算机工程与应用,2011,3:29-38
[11] 徐戈,黄厚峰．自然语言处理中主题模型的发展[J].计算机学报,2011,4(8):1423-1437
[12] 张明慧,王红玲,周国栋.基于LDA主题特征的自动文摘方法[J]．计算机应用与软件,2011,0:215
[13] Doucet A,Godsill S,Andrieu C.On sequential Monte Carlo sampling methods for Bayesian filtering[J]．Statistics and Computing,2000,3:197
[14] 马海云.基于Gibbs抽样的测试用例生成技术研究[J]．自动化与仪器仪表,2011,2:89-118
[15] Duda R O,Hart P E,Stork D G．Pattern Classification(2ed)[M].李宏东,姚天翔,等译.机械工业出版社,2003:508
[16] Lin J．Divergence measures based on Shannon entropy[J]．IEEE Transactions on Information Theory,1991,7(14):145
[17] 王燕.一种改进的k-means聚类算法[J].计算机应用与软件,2004,0(3):122
[18] 周昭涛.文本聚类分析效果评价及文本表示研究[D]．北京:中国科学院研究生院,2005

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed