计算机科学 ›› 2014, Vol. 41 ›› Issue (6): 204-207.doi: 10.11896/j.issn.1002-137X.2014.06.040

• 人工智能 • 上一篇    下一篇

关键词自动提取方法的研究与改进

黄磊,伍雁鹏,朱群峰   

  1. 湖南大学信息科学与工程学院 长沙410082;邵阳学院信息工程系 邵阳422000;邵阳学院信息工程系 邵阳422000;邵阳学院信息工程系 邵阳422000
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受湖南省教育厅一般项目(09C887):基于语义网的网络教学资源检索系统研究资助

Research and Improvement of TFIDF Text Feature Weighting Method

HUANG Lei,WU Yan-peng and ZHU Qun-feng   

  • Online:2018-11-14 Published:2018-11-14

摘要: 关键词提取技术是信息检索和文本分类领域的基础与关键技术之一。首先分析了TFIDF算法中存在的不足,即IDF(Inverse Document Frequency)权值中没有考虑特征词在类内以及类别间的分布情况。因此,原有的TFIDF方法会出现有些不能代表文档主题的低频词的IDF值很高,而有些能够代表文档主题的高频词的IDF值却很低的情况,这会导致关键词提取不准确。通过增加一个新的权值,即类内离散度DI(Distribution Information)来增加关键的特征词条的权重,提出了一种新的算法DI-TFIDF。实验中使用的是搜狗语料库,选择其中的体育、教育和军事3类文档各1000篇作为实验的语料库,分别用基于传统TFIDF方法和基于DI-TFIDF方法提取关键词。实验结果表明,所提出的DI-TFIDF方法提取关键词的准确度要高于传统的TFIDF算法。

关键词: 关键词提取,特征权重,TFIDF,DI-TFIDF 中图法分类号TP391.1文献标识码A

Abstract: Keywords extraction method plays a very important role in the areas of text classification and information retrieval.This paper firstly analysed the shortage of the original TFIDF algorithm,that is the IDF (Inverse Document Frequency) algorithm does not consider the distribution of feature term between categories.So some problems will appear,such as the terms with low frequency and the high IDF weights,and some words with high frequency and low IDF weights,which can cause that the precision of keywords extraction is not accurate.After analysis of these problems,by increasing a new weight DI (Distribution Information),we got a new DI-TFIDF algorithm.A corpus used in the experiment was downloaded from the Sogou corpus and we selected the 1000article of sports,education and military documents as an experiment based on the traditional TFIDF method and the DI-TFIDF method.Experimental results show that our proposed DI-TFIDF method can extract the keywords in a higher accuracy than traditional TFIDF algorithm.

Key words: Keywords extraction,Term-weighting,TFIDF,DI-TFIDF

[1] Luhn H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J].IBM Journal of Research and Development,1957,1(4):309-317
[2] Edmundson H P,Oswald V A.Automatic Indexing and Abstracting of the Contents of Documents[R].Planing Reserarch Corp,Document PRC R-126,ASTLA AD No.231606.Los Angeles,1959:1-142
[3] Lois L E.Experiments in Automatic Indexing and Extracting[J].Information Storage and Retrieval,1970,6:313-334
[4] Turney P D.Learning to Extract Keyphrases from Tex[R].NRC Technical Report ERB-1057.National Research Council,Canada,1999:1-43
[5] Witten I H,Paynter G W,Frank E,et al.Practical Automatic Keyphrase Extraction[C]∥California:Proceedings of The 4th ACM Conference on Digital Libraries.1999:254-256
[6] Tomokiyo T,Hurst M.A language Model Approach to Key-phrase Extraction[C]∥Proceedings of the ACL Workshop on Multiword Expressions:Ananlysis,Acquisition & Treatment.Sapporo,Japan,2003:33-40
[7] Hulth A.Improved Automatic Keyword Extraction Given More Linguistic Knowledge[C]∥Proceeding of the 2003Conference on Emprical Methods in Natural Language Processing.Sapporo,Japan,2003:216-223
[8] Samhaa R.A Simple System for Effective Keyphrase Extraction[C]∥Proceeding of 3th IEEE International Conference on Innovations in Information Technology.2006:1-5
[9] Ercan G,Cicekli I.Using Lexical Chains for Keyword Extraction[J].Information Processing & Management,2007,3(6):1705-1714
[10] Niraj K,Kannan S.Automatic Keyphrase Extraction from Scientific Documents Using N-Gram Filtration Technique[C]∥Proceeding of DocEng’08Conference.2008:199-208
[11] Basils R,Moschitti A,Pazienza M.A text classifier based on linguistic processing[C]∥Proceedings of UCAI,Machine Learning for Information Filtering.1999:36-40
[12] How B C,Narayanan K.An empirical study of feature selection for text categorization based on term weightage[C]∥Proceeding of the 2004IEEE/WIC/ACM Intemational Conference on Web Intelligence.Washington DC:IEEE Computer Society,2004:599-602
[13] 张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,2(19):77-78
[14] 张玉芳,陈小莉,熊忠阳.基于信息增益的特征词权重调整算法研究[J].计算机工程与应用,2007,3(35):159-160
[15] 沈志斌,白清源.文本分类中特征权重算法的改进[J].南京师范大学学报:工程技术版,2008,8(4):95-149
[16] 施聪莺,徐朝军,杨晓江.TFIDF算法研究综述[J].计算机应用,2009,9(6):167-170
[17] 张保富,施化吉,马素琴.基于TFIDF文本特征加权方法的改进研究[J].计算机应用与软件,2011,8(2):17-21
[18] 李学明,李海瑞,薛亮,等.基于信息增益与信息熵的TFIDF算法[J].计算机工程,2012,7(8):37-40
[19] Wang D X,Gao X,Andreae P.Automatic Keyword Extraction from Single-Sentence Natural Language Queries[C]∥PRICAI 2012.Berlin:Springer-Verlag,2012:637-648
[20] 张颖颖,谢强,丁秋林.基于同义词链的屮文关键词提取算法[J].计算机工程,2010,6(19):93-95
[21] 刘铭,王晓龙,刘远超.基于词汇链的关键短语抽取法的研究[J].计算机学报,2010,3(7):1246-1255

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!