Computer Science ›› 2017, Vol. 44 ›› Issue (Z11): 432-436.doi: 10.11896/j.issn.1002-137X.2017.11A.092

Previous Articles     Next Articles

Technology of Extracting Topical Keyphrases from Chinese Corpora

YANG Yue and ZHANG De-sheng   

  • Online:2018-12-01 Published:2018-12-01

Abstract: In the big data era,the information is exploding.The most popular information among people connection is text message.On the Internet,there are countless text information upload or download every day.The important way to quickly grasp content of countless text message is extracting keywords.However,the traditional work of extracting keywords from text corpora ignores two problems:the length of keywords and the topic of text corpora.In this paper,a new algorithm which is in consideration of two aspects mentioned above was proposed.This paper combined the LDA topic model and frequent phrases discovery algorithm to generate frequent candidate phrases with different length,at the same time,this paper proposed an algorithm of completeness filter and rank function to filt and rank candidate.Finally,according to the rank list,the real keyphrases were chosen.

Key words: Extracting keywords,LDA topic model,Frequent phrases,Completeness filter,Rank function

[1] FELDMAN R,DAGAN I.Knowledge discovery in textual databases[C]∥ International Conference on Knowledge Discovery & Data Mining.1995:112-117.
[2] 刘静.面向中文微博关键词提取技术研究[D].长沙:中南大学,2014.
[3] TAN P N,STEINBACH M,KUMEA R V.Introduction to Data Mining[M].Beijing:China Machine Press,2010.
[4] LUO S M,WANG Z K,WANG Z P.Big-Data Analytics:Challenges,Key Technologies and Prospects[J].ZTE Communication,2013(2):11-17.
[5] 陈晓云.文本挖掘若干关键技术研究[D].上海:复旦大学,2005.
[6] rickjin.通俗理解LDA主题模型[EB/OL]. & utm_medium=referral.2014.
[7] DAMILEVSKY M,WANG C,DESAI N,et al.Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents[C]∥SDM.2014.
[8] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
[9] HOFMANN T.Probabilistic Latent Semantic Indexing[C]∥ACM Proceeding of the 1999 ACM SIGMOD International Conference on Management of Data.New York:ACM,1999:50-57.
[10] RAMAGE D,HALL D,NALLAPATI R,et al.Labeled LDA:A supervised topic model for credit attribution in multi-labeled corpora[C]∥Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1.Assciation for Computational Linguistics,2009:248-256.
[11] LIU J L,SHANG J B,WANG C,et al.Mining Quality Phrases from Massive Text Corpora[C]∥ACM Proceeding of The 2015 ACM SIGMOD International Conference on Management of Data.New York:ACM,2015:1729-1744.
[12] ZHAO W X,JIANG J,YANG J H,et al.Topical Keyphrase Extraction from Twitter[C]∥Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.2011:379-388.
[13] HAN J,PEI J,YIN Y.Mining Frequent Patterns without Candidate Generation[C]∥ Proceedings of the 2000 ACM-SIGMOD International Conference on Management of Data.New York:ACM,2000:1-12.
[14] LIU Z,HUANG W,ZHENG Y,et al.Automatic keyphrases extraction via topic decomposition[C]∥EMNLP.2010.
[15] 郝峰.文本关联分析中频繁项集挖掘的研究与改进[D].太原:太原理工大学,2008.
[16] 蔡鹏飞.基于概率图模型的关联规则更新方法与实现[D].昆明:云南大学,2013.
[17] 李艳美.基于贝叶斯网络的数据挖掘应用研究[D].西安:西安电子科技大学,2008.
[18] 徐文海,温有奎.一种基于TF-IDF方法的中文关键词抽取算法[J].情报理论与实践,2008,1(2):298-302.
[19] YAN X,GUO J,LIU S,et al.Learning topics in short texts by non-negative matrix factorization on term correlation matrix[C]∥SDM.2013.
[20] RAJARAMAN A,ULLMAN J D.Mining of Massive Datasets[M].Cambridge:Cambridge University Press,2012.
[21] 章志刚,吉根林.一种基于FP-树和数组技术的频繁模式挖掘算法[J].计算机工程与应用,2014,0(2):103-106.

No related articles found!
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .