计算机科学 ›› 2014, Vol. 41 ›› Issue (6): 204-207.doi: 10.11896/j.issn.1002-137X.2014.06.040
黄磊,伍雁鹏,朱群峰
HUANG Lei,WU Yan-peng and ZHU Qun-feng
摘要: 关键词提取技术是信息检索和文本分类领域的基础与关键技术之一。首先分析了TFIDF算法中存在的不足,即IDF(Inverse Document Frequency)权值中没有考虑特征词在类内以及类别间的分布情况。因此,原有的TFIDF方法会出现有些不能代表文档主题的低频词的IDF值很高,而有些能够代表文档主题的高频词的IDF值却很低的情况,这会导致关键词提取不准确。通过增加一个新的权值,即类内离散度DI(Distribution Information)来增加关键的特征词条的权重,提出了一种新的算法DI-TFIDF。实验中使用的是搜狗语料库,选择其中的体育、教育和军事3类文档各1000篇作为实验的语料库,分别用基于传统TFIDF方法和基于DI-TFIDF方法提取关键词。实验结果表明,所提出的DI-TFIDF方法提取关键词的准确度要高于传统的TFIDF算法。
[1] Luhn H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J].IBM Journal of Research and Development,1957,1(4):309-317 [2] Edmundson H P,Oswald V A.Automatic Indexing and Abstracting of the Contents of Documents[R].Planing Reserarch Corp,Document PRC R-126,ASTLA AD No.231606.Los Angeles,1959:1-142 [3] Lois L E.Experiments in Automatic Indexing and Extracting[J].Information Storage and Retrieval,1970,6:313-334 [4] Turney P D.Learning to Extract Keyphrases from Tex[R].NRC Technical Report ERB-1057.National Research Council,Canada,1999:1-43 [5] Witten I H,Paynter G W,Frank E,et al.Practical Automatic Keyphrase Extraction[C]∥California:Proceedings of The 4th ACM Conference on Digital Libraries.1999:254-256 [6] Tomokiyo T,Hurst M.A language Model Approach to Key-phrase Extraction[C]∥Proceedings of the ACL Workshop on Multiword Expressions:Ananlysis,Acquisition & Treatment.Sapporo,Japan,2003:33-40 [7] Hulth A.Improved Automatic Keyword Extraction Given More Linguistic Knowledge[C]∥Proceeding of the 2003Conference on Emprical Methods in Natural Language Processing.Sapporo,Japan,2003:216-223 [8] Samhaa R.A Simple System for Effective Keyphrase Extraction[C]∥Proceeding of 3th IEEE International Conference on Innovations in Information Technology.2006:1-5 [9] Ercan G,Cicekli I.Using Lexical Chains for Keyword Extraction[J].Information Processing & Management,2007,3(6):1705-1714 [10] Niraj K,Kannan S.Automatic Keyphrase Extraction from Scientific Documents Using N-Gram Filtration Technique[C]∥Proceeding of DocEng’08Conference.2008:199-208 [11] Basils R,Moschitti A,Pazienza M.A text classifier based on linguistic processing[C]∥Proceedings of UCAI,Machine Learning for Information Filtering.1999:36-40 [12] How B C,Narayanan K.An empirical study of feature selection for text categorization based on term weightage[C]∥Proceeding of the 2004IEEE/WIC/ACM Intemational Conference on Web Intelligence.Washington DC:IEEE Computer Society,2004:599-602 [13] 张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,2(19):77-78 [14] 张玉芳,陈小莉,熊忠阳.基于信息增益的特征词权重调整算法研究[J].计算机工程与应用,2007,3(35):159-160 [15] 沈志斌,白清源.文本分类中特征权重算法的改进[J].南京师范大学学报:工程技术版,2008,8(4):95-149 [16] 施聪莺,徐朝军,杨晓江.TFIDF算法研究综述[J].计算机应用,2009,9(6):167-170 [17] 张保富,施化吉,马素琴.基于TFIDF文本特征加权方法的改进研究[J].计算机应用与软件,2011,8(2):17-21 [18] 李学明,李海瑞,薛亮,等.基于信息增益与信息熵的TFIDF算法[J].计算机工程,2012,7(8):37-40 [19] Wang D X,Gao X,Andreae P.Automatic Keyword Extraction from Single-Sentence Natural Language Queries[C]∥PRICAI 2012.Berlin:Springer-Verlag,2012:637-648 [20] 张颖颖,谢强,丁秋林.基于同义词链的屮文关键词提取算法[J].计算机工程,2010,6(19):93-95 [21] 刘铭,王晓龙,刘远超.基于词汇链的关键短语抽取法的研究[J].计算机学报,2010,3(7):1246-1255 |
No related articles found! |
|