基于词或词组长度和频数的短中文文本关键词提取算法

doi:10.11896/j.issn.1002-137X.2016.12.009

摘要/Abstract

摘要： 中文文本的关键词提取是自然语言处理研究中的难点。国内外大部分关键词提取的研究都是基于英文文本的, 但其并不适用于中文文本的关键词提取。已有的针对中文文本的关键词提取算法大多适用于长文本,如何从一段短中文文本中准确地提取出具有实际意义且与此段中文文本的主题密切相关的词或词组是研究的重点。提出了面向中文文本的基于词或词组长度和频数的关键词提取算法,此算法首先提取文本中出现频数较高的词或词组,再根据这些词或词组的长度以及在文本中出现的频数计算权重,从而筛选出关键词或词组。该算法可以准确地从中文文本中提取出相对重要的词或词组,从而快速、准确地提取此段中文文本的主题。实验结果表明,基于词或词组长度和频数的中文文本关键词提取算法与已有的其他算法相比,可用于处理中文文本,且具有更高的准确性。

关键词: 关键词提取,中文文本处理,音译词,网络新词

Abstract: Keyword extraction for Chinese text is an important and difficult part of the text processing research,especially in the field of natural language processing research.Most existing studies focus on English text or long Chinese text,but due to their nature limitations,those keyword extraction algorithms can not apply to Chinese text.Those keyword extraction algorithms for English text are unsuitable for extracting keywords from Chinese texts.How to extract words or phrases accurately from Chinese text which are meaningful and closely related to the topics of this paragraph is the point of this paper.This paper presented a novel keyword extraction algorithm based on length and frequency of words or phrases for Chinese texts.This algorithm firstly extracts words or phrases with high frequency in the paragraph,then calculates the weight of the words or phrases according to the frequency and length of these words or phrases.Lastly, according to their weights,keywords are filtered out.This algorithm can extract the relative important words or phrases from the Chinese text accurately,which can help us find out the theme of this section efficiently and accurately.Experimental results show that compared with other keyword extraction algorithms,the proposed keyword extraction algorithm can process Chinese text with higher accuracy.

Key words: Keyword extraction,Chinese text processing,Transliterated words,Internet new words

陈伟鹤,刘云. 基于词或词组长度和频数的短中文文本关键词提取算法[J]. 计算机科学, 2016, 43(12): 50-57. https://doi.org/10.11896/j.issn.1002-137X.2016.12.009

CHEN Wei-he and LIU Yun. Keyword Extraction Algorithm Based on Length and Frequency of Words or Phrases for Short Chinese Texts[J]. Computer Science, 2016, 43(12): 50-57. https://doi.org/10.11896/j.issn.1002-137X.2016.12.009

参考文献

[1] Manaris B.Natural language processing:A Human-ComputerInteraction Perpective [R].Computer Science Department.University of Southwestern Louisiana Lafayette:Advanced in Computers.Louisiana Volume 47,1999:1-66
[2] Wang Hui,Zhang Wei-de,Zeng Qiang,et al.Extracting important information from Chinese Operation Notes with natural language processing methods[J].Journal of Biomedical Informa-tics,2014,48(2014):130-136
[3] Che Hai-yan,Fen Tie,Zhang Jia-chen,et al.Automatic Knowledge Extraction from Chinese Natural Language Documents[J].Journal of Computer Research and Development,2013,0(4):834-842(in Chinese) 车海燕,冯铁,张家晨,等.面向中文自然语言文档的自动知识抽取方法[J].计算机研究与发展,2013,50(4):834-842
[4] Zong Cheng-qing.Statistical Natural Language Processing[M].Beijing:Tsinghua University Press,2013:5(in Chinese) 宗成庆.统计自然语言处理[M].北京:清华大学出版社,2013:5
[5] Iliopoulos C S,Rahman M S.New efficient algorithms for theLCS and constrained LCS problems[J].Information Processing Letters,2008,106(2008):13-18
[6] Pan Hong,Xu Chao-jun.Application of LCS- Based Algorithm in Chinese Term Extraction[J].Journal of the China Society for Scientific and Technical Infomation,2010,29(5):853-857(in Chinese) 潘虹,徐朝军.LCS算法在术语抽取中的应用研究[J].情报学报,2010,29(5):853-857
[7] Sidorov G,Velasquez F,Stamatatos E,et al.Syntactic n-grams as machine learning features for natural language processing[J].Expert Systems with Applications,2014,41(3):853-860
[8] Hirschberg D S.A Linear Space Algorithm for Computing Maxi-mal Common Subsequences[J].Communication of the ACM,1975,18(18):341-343
[9] Nakatsu N,Kambayashi Y,Yajima S.A longest common subsequence algorithm suitable for similar text strings[J].Acta Infomatica,1982,18(2):171-179
[10] Sproat R,Shih C.A statistical method for finding word boundaries in Chinese text[J].Computer Processing of Chinese and Oritental Languages,1990,4:336-351
[11] Liu Zhi-yuan,Chen Xin-xiong,Sun Mao-song.Mining the inte-rests of Chinese microbloggers via keyword extraction[J].Frontiers of Computer Science in China (FCSC),2012,6(1):76-87
[12] He Gan-jun.Multimensional Investigation of Chinese Transliterration Words[J].Journal of Jiangxi Social Sciences,2012,2(4):194-197(in Chinese) 何干俊.汉语音译词的多维考察[J].江西社会科学,2012,2(4):194-197
[13] Lv Shu-xiang,Ding Sheng-shu.Modern Chinese Dictionary[M].Beijing:Dictionary of the Language Institute of the Chinese Academy of Social Sciences:Foreign Language Teching and Research Press,2003:2148-2149(in Chinese) 吕叔湘,丁声树.现代汉语词典[M].北京,中国社会科学院语言研究所词典编辑室:外语教学与研究出版社,2003:2148-2149
[14] Wan Xiao-jun,Yang Jian-wu,Xiao Jian-guo.Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction[C]∥ACL 2007.2007
[15] Chen H.Research on Chinese segmentation algorithm based on Hadoop cloud platform[C]∥2015 Information Technology and Mechatronics Engineering Conference.Atlantis Press,2015
[16] Chen Chuan-peng,Qin Zhong-ping.A Systolic Architecture with Linear Space Complexity for Longest Common Subsequence Problem[C]∥IEEE 8th International Conference on ASIC,2009(ASICON’09).2009:33-36
[17] Ye Ning,Zhu Da-ming,Zhang Qian-qian,et al.A Fast Algo-rithm of Constrained Longest Common Subsequence[J].Journal of Nanjing University(Natural Sciences),2009,45(5):576-584(in Chinese) 业宁,朱大铭,张倩倩,等.带约束最长公共子序列快速算法[J].南京大学学报(自然科学版),2009,45(5):576-584
[18] Zhai Zhong-wu,Xu Hua,Li Jun,et al.Sentiment Classification for Chinese Reviews Based on Key Substring Features[C]∥International Conference on Natural Language Processing and Knowledge Engineering,2009(NLP-KE 2009).IEEE,2009:1-8
[19] Han Xue-jiao.The Research of Keyword Extraction Algorithms on English Short Test Text[D].Beijing:North China University of Technology,2013(in Chinese) 韩雪娇.英语试题关键词抽取算法研究[D].北京:北方工业大学,2013
[20] Wang Bing-kun,Huang Yong-feng,Yang Wan-xia,et al.Short text classification based on strong feature thesaurus [J].Journal of Zhejiang University Science C(Computers & Electronics),2012,3(9):649-659
[21] Zhang Yun-tao,Gong Ling,Wang Yong-cheng.An improvedTF-IDF approach for text classification[J].Journal of Zhejiang University Science A(Science in Engineering),2005(1):50-56
[22] Zhang Feng,Fan Xiao-zhong,Xu Yun.Chinese Term Extraction Based on PAT Tree[J].Journal of Beijing Institute of Technology(English Edition),2006(2):162-166
[23] Bu Tao,Wang Ji-cheng,Huang Yuan.Design and Implementation of Chinese Document Automatic Classification System[J].Journal of Chinese Information Processing,1999,13(3):26-32(in Chinese) 部涛,王继成,黄源.中文文档自动分类系统的设计与实现[J].中文信息学报,1999,13(3):26-32
[24] Chen Ping,Zhou Chang-le,Lian Rui-ting.An Improved Ap-proach to Keyword Extraction Using KEA[J].Journal of Mind and Computation,2011(2):48-54(in Chinese) 陈平,周昌乐,练睿婷.一种改进的KEA关键词抽取算法研宄[J].心智与计算,2011(2):48-54
[25] Zhang Liang,Zou Fu-tai,Ma Fan-yuan.KRBKSS:a keyword relationship based keyword-set search system for peer-to- peer networks[J].Journal of Zhejiang University Science A(Science in Engineering),2005(6):577-582
[26] Feng Yong,Li Hua,Zhong Jiang,et al.Text Classification Algorithm Based on Adaptive Chinese Word Segmentation and Proximal SVM[J].Computer Science,2010,7(1):251-254(in Chinese) 冯勇,李华,钟将,等.基于自适应中文分词和近似SVM的文本分类算法[J].计算机科学,2010,7(1):251-254
[27] Fang Jun,Guo Xiao,Wang Xiao-dong.Semantically ImprovedAutomatic Keyphrase Extraction[J].Computer Science,2008,5(6):148-151(in Chinese) 方俊,郭霄,王晓东.基于语义的关键词提取算法[J].计算机科学,2008,5(6):148-151

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed