计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 50-57.doi: 10.11896/j.issn.1002-137X.2016.12.009
陈伟鹤,刘云
CHEN Wei-he and LIU Yun
摘要: 中文文本的关键词提取是自然语言处理研究中的难点。国内外大部分关键词提取的研究都是基于英文文本的, 但其并不适用于中文文本的关键词提取。已有的针对中文文本的关键词提取算法大多适用于长文本,如何从一段短中文文本中准确地提取出具有实际意义且与此段中文文本的主题密切相关的词或词组是研究的重点。 提出了面向中文文本的基于词或词组长度和频数的关键词提取算法,此算法首先提取文本中出现频数较高的词或词组,再根据这些词或词组的长度以及在文本中出现的频数计算权重,从而筛选出关键词或词组。该算法可以准确地从中文文本中提取出相对重要的词或词组,从而快速、准确地提取此段中文文本的主题。实验结果表明,基于词或词组长度和频数的中文文本关键词提取算法与已有的其他算法相比,可用于处理中文文本,且具有更高的准确性。
[1] Manaris B.Natural language processing:A Human-ComputerInteraction Perpective [R].Computer Science Department.University of Southwestern Louisiana Lafayette:Advanced in Computers.Louisiana Volume 47,1999:1-66 [2] Wang Hui,Zhang Wei-de,Zeng Qiang,et al.Extracting important information from Chinese Operation Notes with natural language processing methods[J].Journal of Biomedical Informa-tics,2014,48(2014):130-136 [3] Che Hai-yan,Fen Tie,Zhang Jia-chen,et al.Automatic Knowledge Extraction from Chinese Natural Language Documents[J].Journal of Computer Research and Development,2013,0(4):834-842(in Chinese) 车海燕,冯铁,张家晨,等.面向中文自然语言文档的自动知识抽取方法[J].计算机研究与发展,2013,50(4):834-842 [4] Zong Cheng-qing.Statistical Natural Language Processing[M].Beijing:Tsinghua University Press,2013:5(in Chinese) 宗成庆.统计自然语言处理[M].北京:清华大学出版社,2013:5 [5] Iliopoulos C S,Rahman M S.New efficient algorithms for theLCS and constrained LCS problems[J].Information Processing Letters,2008,106(2008):13-18 [6] Pan Hong,Xu Chao-jun.Application of LCS- Based Algorithm in Chinese Term Extraction[J].Journal of the China Society for Scientific and Technical Infomation,2010,29(5):853-857(in Chinese) 潘虹,徐朝军.LCS算法在术语抽取中的应用研究[J].情报学报,2010,29(5):853-857 [7] Sidorov G,Velasquez F,Stamatatos E,et al.Syntactic n-grams as machine learning features for natural language processing[J].Expert Systems with Applications,2014,41(3):853-860 [8] Hirschberg D S.A Linear Space Algorithm for Computing Maxi-mal Common Subsequences[J].Communication of the ACM,1975,18(18):341-343 [9] Nakatsu N,Kambayashi Y,Yajima S.A longest common subsequence algorithm suitable for similar text strings[J].Acta Infomatica,1982,18(2):171-179 [10] Sproat R,Shih C.A statistical method for finding word boundaries in Chinese text[J].Computer Processing of Chinese and Oritental Languages,1990,4:336-351 [11] Liu Zhi-yuan,Chen Xin-xiong,Sun Mao-song.Mining the inte-rests of Chinese microbloggers via keyword extraction[J].Frontiers of Computer Science in China (FCSC),2012,6(1):76-87 [12] He Gan-jun.Multimensional Investigation of Chinese Transliterration Words[J].Journal of Jiangxi Social Sciences,2012,2(4):194-197(in Chinese) 何干俊.汉语音译词的多维考察[J].江西社会科学,2012,2(4):194-197 [13] Lv Shu-xiang,Ding Sheng-shu.Modern Chinese Dictionary[M].Beijing:Dictionary of the Language Institute of the Chinese Academy of Social Sciences:Foreign Language Teching and Research Press,2003:2148-2149(in Chinese) 吕叔湘,丁声树.现代汉语词典[M].北京,中国社会科学院语言研究所词典编辑室:外语教学与研究出版社,2003:2148-2149 [14] Wan Xiao-jun,Yang Jian-wu,Xiao Jian-guo.Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction[C]∥ACL 2007.2007 [15] Chen H.Research on Chinese segmentation algorithm based on Hadoop cloud platform[C]∥2015 Information Technology and Mechatronics Engineering Conference.Atlantis Press,2015 [16] Chen Chuan-peng,Qin Zhong-ping.A Systolic Architecture with Linear Space Complexity for Longest Common Subsequence Problem[C]∥IEEE 8th International Conference on ASIC,2009(ASICON’09).2009:33-36 [17] Ye Ning,Zhu Da-ming,Zhang Qian-qian,et al.A Fast Algo-rithm of Constrained Longest Common Subsequence[J].Journal of Nanjing University(Natural Sciences),2009,45(5):576-584(in Chinese) 业宁,朱大铭,张倩倩,等.带约束最长公共子序列快速算法[J].南京大学学报(自然科学版),2009,45(5):576-584 [18] Zhai Zhong-wu,Xu Hua,Li Jun,et al.Sentiment Classification for Chinese Reviews Based on Key Substring Features[C]∥International Conference on Natural Language Processing and Knowledge Engineering,2009(NLP-KE 2009).IEEE,2009:1-8 [19] Han Xue-jiao.The Research of Keyword Extraction Algorithms on English Short Test Text[D].Beijing:North China University of Technology,2013(in Chinese) 韩雪娇.英语试题关键词抽取算法研究[D].北京:北方工业大学,2013 [20] Wang Bing-kun,Huang Yong-feng,Yang Wan-xia,et al.Short text classification based on strong feature thesaurus [J].Journal of Zhejiang University Science C(Computers & Electronics),2012,3(9):649-659 [21] Zhang Yun-tao,Gong Ling,Wang Yong-cheng.An improvedTF-IDF approach for text classification[J].Journal of Zhejiang University Science A(Science in Engineering),2005(1):50-56 [22] Zhang Feng,Fan Xiao-zhong,Xu Yun.Chinese Term Extraction Based on PAT Tree[J].Journal of Beijing Institute of Technology(English Edition),2006(2):162-166 [23] Bu Tao,Wang Ji-cheng,Huang Yuan.Design and Implementation of Chinese Document Automatic Classification System[J].Journal of Chinese Information Processing,1999,13(3):26-32(in Chinese) 部涛,王继成,黄源.中文文档自动分类系统的设计与实现[J].中文信息学报,1999,13(3):26-32 [24] Chen Ping,Zhou Chang-le,Lian Rui-ting.An Improved Ap-proach to Keyword Extraction Using KEA[J].Journal of Mind and Computation,2011(2):48-54(in Chinese) 陈平,周昌乐,练睿婷.一种改进的KEA关键词抽取算法研宄[J].心智与计算,2011(2):48-54 [25] Zhang Liang,Zou Fu-tai,Ma Fan-yuan.KRBKSS:a keyword relationship based keyword-set search system for peer-to- peer networks[J].Journal of Zhejiang University Science A(Science in Engineering),2005(6):577-582 [26] Feng Yong,Li Hua,Zhong Jiang,et al.Text Classification Algorithm Based on Adaptive Chinese Word Segmentation and Proximal SVM[J].Computer Science,2010,7(1):251-254(in Chinese) 冯勇,李华,钟将,等.基于自适应中文分词和近似SVM的文本分类算法[J].计算机科学,2010,7(1):251-254 [27] Fang Jun,Guo Xiao,Wang Xiao-dong.Semantically ImprovedAutomatic Keyphrase Extraction[J].Computer Science,2008,5(6):148-151(in Chinese) 方俊,郭霄,王晓东.基于语义的关键词提取算法[J].计算机科学,2008,5(6):148-151 |
No related articles found! |
|