基于短语的中文标签自动生成混合算法

计算机科学 ›› 2014, Vol. 41 ›› Issue (Z6): 87-90.

基于短语的中文标签自动生成混合算法

刘栋,张彩环

洛阳师范学院信息技术学院洛阳471022;洛阳师范学院数学科学学院洛阳471022

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受2010年度河南省基础与前沿技术研究资助

Keyphrase-based Chinese Tags Generation Hybrid Algorithm

LIU Dong and ZHANG Cai-huan

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 对中文文档标签生成的算法进行了研究,提出了一种中文文档标签生成的混合算法(Hybrid Tags Generation Algorithm)。鉴于短语在表达文档主题方面的优势,先进行短语模式匹配,然后利用短语的统计特性,综合考虑TF-IDF、词跨度和位置3个特征进行权重计算,从而抽取出权重较高的词语或短语作为标签。通过对实验数据的分析表明,该算法在查准率方面表现较好。通过人工比对可知,标签表达文档内容主题的效果相当或优于测试集标准答案的比率超过六成,取得了比较好的结果。

关键词: 关键词抽取,标签生成,短语,中文标签,算法中图法分类号TP301.4文献标识码A

Abstract: This work provided an algorithm HTGA(Hybrid Tags Generation Algorithm) to generate tags for Chinese documents,which extracts phrase chunks as candidate keywords,and considers other factors like TF.IDF,words span etc．Experiments show that this algorithm improves the accuracy of keyword extraction,and has a stable performance over various texts．Some samples were extracted and compared with the standard answers．There are more than 60% results that are as well as or better than the standard answers in reflection of document topics.

Key words: Keyword extraction,Tag generation,Keyphrase,Chinese tags,Algorithm

刘栋,张彩环. 基于短语的中文标签自动生成混合算法[J]. 计算机科学, 2014, 41(Z6): 87-90. https://doi.org/

LIU Dong and ZHANG Cai-huan. Keyphrase-based Chinese Tags Generation Hybrid Algorithm[J]. Computer Science, 2014, 41(Z6): 87-90. https://doi.org/

参考文献

[1] 章成志．自动标引研究的回顾与展望[J]．现代图书情报技术,2007(11):33-39
[2] SCWS分词软件．http://www.xunsearch.com/scws/
[3] Liu Zhi-yuan,Chen Xin-xiong,Zheng Ya-bin,et al．Automatic keyphrase extraction by bridging vocabulary gap[C]∥Procee-dings of the Fifteenth Conference on Computational Natural Language Learning．Association for Computational Linguistics．2011:135-144
[4] 谢晋．基于词跨度的中文文本关键词提取及在文本分类中的应用[D]．杭州:浙江工业大学,2011
[5] 刘华．基于关键短语的文本内容标引研究[D]．北京:北京语言大学,2005
[6] 韩艳．基于统计的中文文本关键短语自动抽取方法研究[D]．苏州:苏州大学,2009
[7] Mihalcea R,Tarau P．TextRank:Bringing Order into Texts[C]∥Proceedings of EMNLP．2004:404-411
[8] 刘知远．基于文档主题结构的关键词抽取方法研究[D]．北京:清华大学,2011
[9] 方俊,郭雷,王晓东．基于语义的关键词提取算法[J]．计算机科学,2008,35(6):148-151
[10] 索红光,刘玉树．一种基于词汇链的关键词抽取方法[J]．中文信息学报,2006,20(6):25-30
[11] 胡燕,吴虎子,钟珞．中文文本分类中基于词性的特征提取方法研究[J]．武汉理工大学学报,2007,4
[12] 赵军,黄吕宁．汉语基本名词短语结构分析模型[J]．计算机学报,1999,22(2):141-146
[13] 赵蕾蕾．基于词和基本短语模式的特征提取方法[D]．保定:河北大学,2009
[14] 王军．词表的自动丰富——从元数据中提取关键词及其定位[J]．中文信息学报,2005,19(6):36-43
[15] Hulth A．Improved Automatic Keyword Extraction Given More Linguistic Knowledge[C]∥Proceedings of EMNLP．2003:216-223
[16] Peter D．Turney,Learning Algorithms for Keyphrase Extraction[J]．Information Retrieval,2000,2(4):303-336
[17] Frank E,Paynter G W,Witten I H,et al．Domain-specific Keyphrase Extraction[C]∥Proceedings of IJCAI．1999:668-673
[18] 李素建,王厚峰,俞士汶,等．关键词自动标引的最大熵模型应用研究[J]．计算机学报,2004,27(9):l192-1197
[19] Zhang K,Xu H,Tang J,et al．Keyword Extraction Using Support Vector Machine[C]∥Proc．of the Seventh International Conference on Web-Age Information Management(WAIM2006)．2006:85-96
[20] Zhang Cheng-zhi,Wang Hui-lin,Liu Yao,et al．Automatic Keyword Extraction from Documents Using Conditional Random Fields[J]．Journal of Computational Information Systems,2008,4(3):1169-1180
[21] 钱爱兵,江岚．基于改进TFIDF的中文网页关键词抽取一以新闻网页为例[J]．情报理论与实践,2008,6
[22] 郑家恒,卢娇丽．关键词抽取方法的研究[J]．计算机工程,2005,31(18)
[23] 都云程,周伟,韩艳铧,等．基于字同现频率的关键词自动抽取[J]．北京信息科技大学学报,2011,26(6)
[24] 肖根胜．改进TFIDF和谱分割的关键词自动抽取方法研究[D]．武汉:华中师范大学,2012
[25] 赵鹏,蔡庆生,王清毅,等．一种基于复杂网络特征的中文文档关键词抽取算法[J]．模式识别与人工智能,2007,20(6)
[26] 汪小帆,李翔,陈关荣．复杂网络理论及其应用[M]．北京:清华大学出版社,2006

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed