基于主题模型与信息熵的中文文档自动摘要技术研究

Abstract

Abstract: This paper presented a method for automatic summarization based on LDA model and information entropy for Chinese document.It uses LDA model to do shallow semantic analysis work on documents and gets the distribution of topics under each document.Through analyzing the topics of document,we got the topic which has the best expression of central idea for document.Meanwhile,this paper proposed a new method to compute the sentence weight and extract the most important sentence based on measuring the information entropy for each sentence.It treats the sentence as a random variable and calculates the information entropy for every random variable.Experimental results show that this method can pick out the most important sentence in the document.

Key words: Summarization,LDA,Topic,Information entropy

LI Ran,ZHANG Hua-ping,ZHAO Yan-ping and SHANG Jian-yun. Automatic Text Summarization Research Based on Topic Model and Information Entropy[J].Computer Science, 2014, 41(Z11): 298-300.

References

[1] Luhn,Hans P.The automatic creation of literature abstracts[J].IBM Journal of research and development,1958,2(2):159-165
[2] Edmundson,Harold P,Wyllys R E.Automatic abstracting and indexing—survey and recommendations[J].Communications of the ACM,1961,4(5):226-234
[3] Edmundson,Harold P.New methods in automatic extracting[J].Journal of the ACM(JACM),1969,16(2):264-285
[4] Pollock,Joseph J,Zamora A.Automatic abstracting research at chemical abstracts service[J].Journal of Chemical Information and Computer Sciences,1975,15(4):226-232
[5] Paice,Chris D.The automatic generation of literature abstracts:an approach based on the identification of self-indicating phrases[C]∥Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval.Butterworth & Co.,1980
[6] Salton,Gerard,et al.Automatic text structuring and summariza-tion[J].Information Processing & Management,1997,33(2):193-207
[7] Blair-Goldensohn,Sasha,et al.Columbia university at duc 2004[C]∥Proceedings of the Document Understanding Conference,DUC-2004.Boston,USA,2004
[8] 王继成,武港山.一种篇章结构指导的中文Web文档自动摘要方法[J].计算机研究与发展,200340(3):398-405
[9] 张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用[J].中文信息学报,2005,19(2):93-99
[10] 尹存燕,戴新宇,陈家骏.Internet 上文本的自动摘要技术[J].计算机工程,2006,32(3):88-90
[11] 张云涛,龚玲,王永成.基于综合方法的文本主题句的自动抽取[J].上海交通大学学报,2006,40(5):771-774
[12] 纪文倩,等.一种基于 LexRank 算法的改进的自动文摘系统[J].计算机科学,2010,37(5):151-154
[13] 罗文娟,等.权衡熵和相关度的自动摘要技术研究[J].中文信息学报,2011,25(5):9-16
[14] 任昭春,马军,陈竹敏.基于动态主题建模的Web论坛文档摘要[J].计算机研究与发展,2013,49(11):2359-2367
[15] 刘平安.基于 HLDA 模型的中文多文档摘要技术研究[D].北京:北京邮电大学,2013
[16] http://zh.wikipedia.org/wiki/隐含狄利克雷分布
[17] Blei,David M,Ng A Y,et al.Latent dirichlet allocation[J].the Journal of machine Learning research,2003,(3):993-1022
[18] Wei X,Croft W B.LDA-based document models for ad-hoc retrieval[C]∥Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2006:178-185

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Automatic Text Summarization Research Based on Topic Model and Information Entropy

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0