基于主题模型与信息熵的中文文档自动摘要技术研究

计算机科学 ›› 2014, Vol. 41 ›› Issue (Z11): 298-300.

基于主题模型与信息熵的中文文档自动摘要技术研究

李然,张华平,赵燕平,商建云

北京理工大学计算机学院北京100081 北京理工大学管理与经济学院北京100081北京理工大学软件学院北京100081;北京理工大学计算机学院北京100081 北京理工大学管理与经济学院北京100081北京理工大学软件学院北京100081;北京理工大学计算机学院北京100081 北京理工大学管理与经济学院北京100081北京理工大学软件学院北京100081;北京理工大学计算机学院北京100081 北京理工大学管理与经济学院北京100081北京理工大学软件学院北京100081

出版日期:2018-11-14 发布日期:2018-11-14

Automatic Text Summarization Research Based on Topic Model and Information Entropy

LI Ran,ZHANG Hua-ping,ZHAO Yan-ping and SHANG Jian-yun

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 提出了一种基于LDA模型以及信息熵的文档自动摘要技术,即通过LDA模型对文档进行浅层语义分析,得到文档的主题分布以及不同主题下的词语分布；通过对主题的分析,可以得到最能代表文档中心思想的主题,以及该主题下的词语分布。同时,提出了一种新的基于信息熵的度量句子重要性的方法,并将该方法应用于文档的关键句抽取过程中。该方法将文档中句子的出现看成一个随机变量,通过对随机变量建模并度量它的信息熵来选取文档中的关键性语句。实验结果表明,应用主题模型与信息熵摘取的文档摘要能有效地从文档中摘出中心句。

关键词: 摘要,LDA模型,主题,信息熵

Abstract: This paper presented a method for automatic summarization based on LDA model and information entropy for Chinese document.It uses LDA model to do shallow semantic analysis work on documents and gets the distribution of topics under each document.Through analyzing the topics of document,we got the topic which has the best expression of central idea for document.Meanwhile,this paper proposed a new method to compute the sentence weight and extract the most important sentence based on measuring the information entropy for each sentence.It treats the sentence as a random variable and calculates the information entropy for every random variable.Experimental results show that this method can pick out the most important sentence in the document.

Key words: Summarization,LDA,Topic,Information entropy

李然,张华平,赵燕平,商建云. 基于主题模型与信息熵的中文文档自动摘要技术研究[J]. 计算机科学, 2014, 41(Z11): 298-300. https://doi.org/

LI Ran,ZHANG Hua-ping,ZHAO Yan-ping and SHANG Jian-yun. Automatic Text Summarization Research Based on Topic Model and Information Entropy[J]. Computer Science, 2014, 41(Z11): 298-300. https://doi.org/

参考文献

[1] Luhn,Hans P.The automatic creation of literature abstracts[J].IBM Journal of research and development,1958,2(2):159-165
[2] Edmundson,Harold P,Wyllys R E.Automatic abstracting and indexing—survey and recommendations[J].Communications of the ACM,1961,4(5):226-234
[3] Edmundson,Harold P.New methods in automatic extracting[J].Journal of the ACM(JACM),1969,16(2):264-285
[4] Pollock,Joseph J,Zamora A.Automatic abstracting research at chemical abstracts service[J].Journal of Chemical Information and Computer Sciences,1975,15(4):226-232
[5] Paice,Chris D.The automatic generation of literature abstracts:an approach based on the identification of self-indicating phrases[C]∥Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval.Butterworth & Co.,1980
[6] Salton,Gerard,et al.Automatic text structuring and summariza-tion[J].Information Processing & Management,1997,33(2):193-207
[7] Blair-Goldensohn,Sasha,et al.Columbia university at duc 2004[C]∥Proceedings of the Document Understanding Conference,DUC-2004.Boston,USA,2004
[8] 王继成,武港山.一种篇章结构指导的中文Web文档自动摘要方法[J].计算机研究与发展,200340(3):398-405
[9] 张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用[J].中文信息学报,2005,19(2):93-99
[10] 尹存燕,戴新宇,陈家骏.Internet 上文本的自动摘要技术[J].计算机工程,2006,32(3):88-90
[11] 张云涛,龚玲,王永成.基于综合方法的文本主题句的自动抽取[J].上海交通大学学报,2006,40(5):771-774
[12] 纪文倩,等.一种基于 LexRank 算法的改进的自动文摘系统[J].计算机科学,2010,37(5):151-154
[13] 罗文娟,等.权衡熵和相关度的自动摘要技术研究[J].中文信息学报,2011,25(5):9-16
[14] 任昭春,马军,陈竹敏.基于动态主题建模的Web论坛文档摘要[J].计算机研究与发展,2013,49(11):2359-2367
[15] 刘平安.基于 HLDA 模型的中文多文档摘要技术研究[D].北京:北京邮电大学,2013
[16] http://zh.wikipedia.org/wiki/隐含狄利克雷分布
[17] Blei,David M,Ng A Y,et al.Latent dirichlet allocation[J].the Journal of machine Learning research,2003,(3):993-1022
[18] Wei X,Croft W B.LDA-based document models for ad-hoc retrieval[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:178-185

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed