中文文本的主题关键短语提取技术

doi:10.11896/j.issn.1002-137X.2017.11A.092

摘要/Abstract

摘要： 在大数据时代,信息量暴增,人们接触最多的信息就是文本信息,每天在互联网上都有无数文本信息被上传或下载。快速掌握这些文本信息内容的重要方法之一就是关键词提取。然而,在传统关键词提取算法中,通常忽略了两个重要的方面:词语长度和文本主题。针对以上两方面问题,提出了提取中文文本的主题关键短语技术。将LDA主题模型与频繁短语发现算法相结合,生成不同长度的频繁候选短语；然后,利用所提的完整性筛选和排序函数对候选短语进行筛选和排序；最后,根据排序结果选择最终的主题关键短语。

关键词: 关键词提取,LDA主题模型,频繁短语,完整性筛选,排序函数

Abstract: In the big data era,the information is exploding.The most popular information among people connection is text message.On the Internet,there are countless text information upload or download every day.The important way to quickly grasp content of countless text message is extracting keywords.However,the traditional work of extracting keywords from text corpora ignores two problems:the length of keywords and the topic of text corpora.In this paper,a new algorithm which is in consideration of two aspects mentioned above was proposed.This paper combined the LDA topic model and frequent phrases discovery algorithm to generate frequent candidate phrases with different length,at the same time,this paper proposed an algorithm of completeness filter and rank function to filt and rank candidate.Finally,according to the rank list,the real keyphrases were chosen.

Key words: Extracting keywords,LDA topic model,Frequent phrases,Completeness filter,Rank function

杨玥,张德生. 中文文本的主题关键短语提取技术[J]. 计算机科学, 2017, 44(Z11): 432-436. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.092

YANG Yue and ZHANG De-sheng. Technology of Extracting Topical Keyphrases from Chinese Corpora[J]. Computer Science, 2017, 44(Z11): 432-436. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.092

参考文献

[1] FELDMAN R,DAGAN I.Knowledge discovery in textual databases[C]∥ International Conference on Knowledge Discovery & Data Mining.1995:112-117.
[2] 刘静.面向中文微博关键词提取技术研究[D].长沙:中南大学,2014.
[3] TAN P N,STEINBACH M,KUMEA R V.Introduction to Data Mining[M].Beijing:China Machine Press,2010.
[4] LUO S M,WANG Z K,WANG Z P.Big-Data Analytics:Challenges,Key Technologies and Prospects[J].ZTE Communication,2013(2):11-17.
[5] 陈晓云.文本挖掘若干关键技术研究[D].上海:复旦大学,2005.
[6] rickjin.通俗理解LDA主题模型[EB/OL].http://blog.csdn.net/v_july_v/article/details/41209515?utm_source=tuicool & utm_medium=referral.2014.
[7] DAMILEVSKY M,WANG C,DESAI N,et al.Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents[C]∥SDM.2014.
[8] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
[9] HOFMANN T.Probabilistic Latent Semantic Indexing[C]∥ACM Proceeding of the 1999 ACM SIGMOD International Conference on Management of Data.New York:ACM,1999:50-57.
[10] RAMAGE D,HALL D,NALLAPATI R,et al.Labeled LDA:A supervised topic model for credit attribution in multi-labeled corpora[C]∥Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1.Assciation for Computational Linguistics,2009:248-256.
[11] LIU J L,SHANG J B,WANG C,et al.Mining Quality Phrases from Massive Text Corpora[C]∥ACM Proceeding of The 2015 ACM SIGMOD International Conference on Management of Data.New York:ACM,2015:1729-1744.
[12] ZHAO W X,JIANG J,YANG J H,et al.Topical Keyphrase Extraction from Twitter[C]∥Proceedings of the 49^th Annual Meeting of the Association for Computational Linguistics.2011:379-388.
[13] HAN J,PEI J,YIN Y.Mining Frequent Patterns without Candidate Generation[C]∥ Proceedings of the 2000 ACM-SIGMOD International Conference on Management of Data.New York:ACM,2000:1-12.
[14] LIU Z,HUANG W,ZHENG Y,et al.Automatic keyphrases extraction via topic decomposition[C]∥EMNLP.2010.
[15] 郝峰.文本关联分析中频繁项集挖掘的研究与改进[D].太原:太原理工大学,2008.
[16] 蔡鹏飞.基于概率图模型的关联规则更新方法与实现[D].昆明:云南大学,2013.
[17] 李艳美.基于贝叶斯网络的数据挖掘应用研究[D].西安:西安电子科技大学,2008.
[18] 徐文海,温有奎.一种基于TF-IDF方法的中文关键词抽取算法[J].情报理论与实践,2008,1(2):298-302.
[19] YAN X,GUO J,LIU S,et al.Learning topics in short texts by non-negative matrix factorization on term correlation matrix[C]∥SDM.2013.
[20] RAJARAMAN A,ULLMAN J D.Mining of Massive Datasets[M].Cambridge:Cambridge University Press,2012.
[21] 章志刚,吉根林.一种基于FP-树和数组技术的频繁模式挖掘算法[J].计算机工程与应用,2014,0(2):103-106.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed