结合语义扩展度和词汇链的关键词提取算法

计算机科学 ›› 2013, Vol. 40 ›› Issue (12): 264-269.

结合语义扩展度和词汇链的关键词提取算法

刘端阳,王良芳

浙江工业大学计算机科学与技术学院杭州310023;浙江工业大学计算机科学与技术学院杭州310023

出版日期:2018-11-16 发布日期:2018-11-16
基金资助:
本文受国家自然科学基金(61202204)资助

Extraction Algorithm Based on Semantic Expansion Integrated with Lexical Chain

LIU Duan-yang and WANG Liang-fang

Online:2018-11-16 Published:2018-11-16

摘要/Abstract

摘要： 针对影响关键词提取质量的一词多义现象、同义词现象以及文章主题准确全面表达的难点,提出了一种基于语义的关键词提取算法KESELC,利用《同义词词林》语义词典和统计信息计算语义相似度和相关度,进而得出语义扩展度及其计算方法,将语义扩展度和词汇链方法相结合,对文本分别作预处理、多义词词义消歧、同义词合并、词汇链构建、有效特征选取及对权重综合计算的处理,提取出的关键词不仅避免了同义词冗余表达,而且较准确全面地覆盖文本的主题。通过实验对比分析,验证了基于KESELC的方法比基于TFIDF的方法以及基于词汇链的方法具有较优的提取效果,具有一定的实际应用价值。

关键词: 同义词词林,语义扩展度,词汇链,关键词提取,语义分析

Abstract: For the difficulties that affect the quality of keywords extraction,such as the phenomenon of polysemy,synonyms as well as the accurate and comprehensive expression of the subjects in the text,a method named KESELC based on the semantics of keyword extraction was proposed．By calculating semantic similarity and semantic relevancy based on the tongyici cilin and statistical information,then the concept of semantic expansion and its calculation method were proposed.By combining semantic expansion with lexical chain,it made the text processing in terms of preprocess,polysemy disambiguation,synonym mergence,the construction of lexical chains,feature selection and improvement of weights computation．The extracted keywords not only avoid a redundant expression,but also cover the subjects of the article accurately and comprehensively．The experimental results show that the method of keyword extraction based on KESELC has better performance than the ones based on TFIDF and Lexical chain, and has a certain practical value.

Key words: Tongyici cilin,Semantic expansion,Lexical chain,Keyword extraction,Semantic analysis

刘端阳,王良芳. 结合语义扩展度和词汇链的关键词提取算法[J]. 计算机科学, 2013, 40(12): 264-269. https://doi.org/

LIU Duan-yang and WANG Liang-fang. Extraction Algorithm Based on Semantic Expansion Integrated with Lexical Chain[J]. Computer Science, 2013, 40(12): 264-269. https://doi.org/

参考文献

[1] Bao Hong,Deng Zhen．An extended keyword extraction method[C]∥Proceedings of the 2012International Conference on Applied Physics and Industrial Engineering．USA:Elsevier,2012:1120-1127
[2] 李霞,李战怀,张利军,等.MXDR:一种基于关键字的XML多文档分布式检索方法[J].计算机科学,2011,8(10):152-156
[3] 郑斐然,苗夺谦,张志飞,等.一种中文微博新闻话题检测的方法[J].计算机科学,2012,9(1):138-141
[4] G′abor B,Rich′ard F．SZTERGAK:Feature engineering forkeyphrase extraction[C]∥Proceedings of the 5th International Workshop on Semantic Evaluation．Sweden:ACM,2010:186-189
[5] Witten I H,Paynter G W,Frank E,et al．KEA:Practical automatic keyphrase extraction[C]∥Proceedings of the 4th ACM Conference on Digital Libraries．Berkeley,California,US:ACM,1999:254-256
[6] Lopez P,Romary L．HUMB:automatic key term extraction from scientific articles in GROBID[C]∥Proceedings of the 5th International Workshop on Semantic Evaluation．Uppsala,Sweden:ACM,2010:248-251
[7] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859
[8] 方俊,郭雷,王晓东.基于语义的关键词提取算法[J].计算机科学,2008,35(6):148-151
[9] Meng Wen-chao,Liu Lian-chen,Dai Ting．A modified approach to keyword extraction based on word-similarity[C]∥Procee-dings of the 2009IEEE International Conference on Intelligent Computing and Intelligent Systems(ICIS)．Shanghai,China:IEEE,2009:388-392
[10] Li Gang,Dai Qiang-bin,Wei Quan.A new approach to compute semantic relevance of Chinese words[C]∥Proceedings of the 2010IEEE International Conference on Artificial Intelligence and Education (ICAIE)．Wuhan,China:IEEE,2010:610-613
[11] 聂卉,龙朝辉.结合语义相似度与相关度的概念扩展[J].情报学报,2007,6(5):728-732
[12] LI Xing-hua,WU Xin-dong,HU Xue-gang,et al．Keyword extraction based on lexical chains and word co-occurrence for Chinese news Web pages[C]∥Proceedings of the 2008IEEE International Conference on Data Mining Workshops．Pisa,Italy:IEEE,2008:744-751
[13] 梅家驹,竺一鸣,高蕴琦,等.同义词词林[M].上海:上海辞书出版社,1993:106-108
[14] 陆洋．基于语义分析的文本挖掘研究[D]．杭州:浙江工业大学,2011
[15] Institute of Computing Technology,Chinese Academy of Sci-ences．ICTCLAS [EB/OL]．http://ictclas.org/index.html,2012-04-01
[16] 田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报:信息科学版,2010,28(6):603-608
[17] Satanjeev B,Ted P．Extended gloss overlaps as a measure of semantic relatedness[C]∥Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence．Acapulco,Mexico:Aminer,2003:805-810
[18] Jane M,Graeme H．Lexical cohesion computed by thesaural relations as an indicator of the structure of text[J]．Computational Linguistics,1991,17(1):21-48
[19] Li Rong-lu．Fudan university text corpus [DB/OL]．http://www.nlp.org.cn/docs/doclist.php?cat_id=16&type=15,2012-04-01

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed