计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 110-115.doi: 10.11896/jsjkx.190700041

• 数据库&大数据&数据科学 • 上一篇    下一篇

融合语义特征的关键词提取方法

高楠,李利娟,李伟,祝建明   

  1. (浙江工业大学计算机科学与技术学院 杭州310023)
  • 收稿日期:2019-06-04 出版日期:2020-03-15 发布日期:2020-03-30
  • 通讯作者: 高楠(gaonan@zjut.edu.cn)
  • 基金资助:
    国家自然科学基金项目(61702456);浙江省科技厅公益科项目(2017C33108)

Keywords Extraction Method Based on Semantic Feature Fusion

GAO Nan,LI Li-juan,Wei-william LEE,ZHU Jian-ming   

  1. (School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China)
  • Received:2019-06-04 Online:2020-03-15 Published:2020-03-30
  • About author:GAO Nan,born in 1983,Ph.D,is member of China Computer Federation.Her main research interests include data mining,machine learning and intelligent transportation system.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61702456) and Zhejiang Public Welfare Technology Research Program (2017C33108).

摘要: 关键词提取被广泛应用于文本挖掘领域,是文本自动摘要、自动分类、自动聚类等研究的基础。因此,提取高质量的关键词具有十分重要的研究意义。已有关键词提取方法研究中大多仅考虑了部分文本的统计特征,没有考虑词语的隐式语义特征,导致提取结果的准确率不高,且关键词缺乏语义信息。针对这一问题,文中设计了一种针对词语与文本主题之间的特征进行量化的算法。该算法首先用词向量的方法挖掘文本中词语的上下文语义关系,然后通过聚类方法抽取文本中主要的语义特征,最后用相似距离的方式计算词语与文本主题之间的距离并将其作为该词语的语义特征。此外,通过将语义特征与多种描述词语的词频、长度、位置和语言等特征结合,文中还提出了一种融合语义特征的短文本关键词提取方法,简称SFKE方法。该方法从统计信息和语义层面分析了词语的重要性,从而可以综合多方面因素提取出最相关的关键词集合。实验结果表明,相比TFIDF,TextRank,Yake,KEA和AE等方法,融合多种特征的关键词提取方法的性能有了明显的提升。该方法与基于有监督的AE方法相比,F-Score提升了9.3%。最后,用信息增益的方法对特征的重要性进行评估,结果表明,添加语义特征后模型的F-Score提升了7.2%。

关键词: 分类模型, 统计特征, 文本挖掘, 语义特征, 支持向量机

Abstract: Keyword extraction is widely used in the field of text mining,which is the prerequisite technology of text automatic summarization,classification and clustering.Therefore,it is very important to extract high quality keywords.At present,most researches on keyword extraction methods only consider some statistical features,but not the implicit semantic features of words,which leads to the low accuracy of extraction results and the lack of semantic information of keywords.To solve this problem,this paper designed a quantification method of the features between words and text themes.First,the word vector method is used to mine the context semantic relations of words.Then the main semantic features of the text is extracted by clustering.Finally,the distance between the words and the topic with the similar distance method is calculated.It is regarded as the semantic features of word.In addition,by combining the semantic features of word with the features of word frequency,length,location,language and other various description of words,a keywords extraction method of short text with semantic features was proposed,namely SFKE method.This method analyzes the importance of words from the statistical and semantic aspects,thus can extract the most relevant keyword set by integrating many factors.Experimental results show that the keyword extraction method integrating multiple features has significant improvement compared with TFIDF,TextRank,Yake,KEA,AE methods.The F-Score of this methodhas improved by 9.3% compared with AE.In addition,this paper used the method of information gain to evaluate the importance of features.The experimental results show that the F-Score of the model is increased by 7.2% after adding semantic feature.

Key words: Classification model, Semantic features, Statistical features, Support vector machine, Text mining

中图分类号: 

  • TP391
[1]ZHAO J S,ZHU Q M,ZHOU G D,et al.Review of Research in Automatic Keyword Extraction[J].Journal of Software,2017,28(9):2431-2449.
[2]BABAR S A,PATIL P D.Improving Performance of Text Summarization[J].Procedia Computer Science,2015,46:354-363.
[3]ONAN A,KORUKGLU S,BULUT H.Ensemble of Keyword Extraction Methods and Classifiers in Text Classification[J].Expert Systems with Applications,2016,57(C):232-247.
[4]LUHN H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information [J].IBM Journal of Research and Development 1957,1(4):309-317.
[5]MIHALCEA R,TARAU P.TextRank:Bringing Order into Texts[C]∥Proceeding Conference on Empirical Methods in Natural Language Processing.Barcelona,Spain:2004:404-411.
[6]CHEN W,WU Y Z,CHEN W L,et al.Automatic keyword extraction Based on BiLSTM-CRF[J].Computer Science,2018,45(S1):104-109.
[7]CAMPPOS R,MANGARAVITE V,PASQUALI A,et al.A Text Feature Based Automatic Keyword Extraction Method for Single Documents[C]∥Advances in Information Retrieval (EDS).Cham:Springer,2018:10772.
[8]ARDIANSYAH S,MAJID M A,ZAIN J M.Knowledge of extraction from trained neural network by using decision tree[C]∥International Conference on Science in Information Technology.IEEE,2017.
[9]FRANK E,PAYNTER G W,et al.Domain-Specic Keyphrase Extraction [C]∥International Joint Conference on Artificial Intelligence.1999:668-673.
[10]CHEN Y,YIN J,ZHU W,et al.Novel Word Features for Keyword Extraction [M]∥Web-Age Information Management.Springer International Publishing,2015:148-160.
[11]KANIS J.Digging Language Model-Maximum Entropy Phrase Extraction[C]∥International Conference on Text.Speech:Brno,Czech,2016:46-53.
[12]ZHOU C,LI S.Research of Information Extraction Algorithm based on Hidden Markov Model[C]∥International Conference on Information Science and Engineering.Springer,2010:1-4.
[13]ZHANG C.Automatic Keyword Extraction from Documents Using Conditional Random Fields[J].Journal of Computational Information Systems,2008,4(3):1169-1180.
[14]ZHANG Q,WANG Y,GONG Y,et al.Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]∥Empirical Methods in Nnatural Language Processing.2016:836-845.
[15]AQUINO,GERMAN O,LANZARINI L C.Keyword Identification in Spanish Documents using Neural Networks[J].Journal of Computer Science & Technology,2015,15(2):55-60.
[16]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]∥International Conference on Learning Representations(ICLR).2013:1301-3781.
[17]LIU Z Y.Research on Keyword Extraction Method Based on Document Topic Structure[D].Beijing:Tsinghua University,2011.
[18]GitHub[OL].https://github.com/uk9921/StopWords.
[19]CHEN Y C,ZHANG Y X,WANG H,et al.Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J].Journal of Software,2018,29(7):2046-2070.
[20]LI S,ZHAO Z,HU R,et al.Analogical Reasoning on Chinese Morphological and Semantic Relations[J].Meeting of the Association for Computational Linguistics,2018,2:138-143.
[1] 单晓英, 任迎春.
基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别
Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm
计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216
[2] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[3] 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真.
一种基于支持向量机的主动度量学习算法
Active Metric Learning Based on Support Vector Machines
计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[4] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[5] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[6] 邓维斌, 朱坤, 李云波, 胡峰.
FMNN:融合多神经网络的文本分类模型
FMNN:Text Classification Model Fused with Multiple Neural Networks
计算机科学, 2022, 49(3): 281-287. https://doi.org/10.11896/jsjkx.210200090
[7] 白勇, 张占龙, 熊隽迪.
基于FP-Growth算法和GRNN的电力知识文本挖掘
Power Knowledge Text Mining Based on FP-Growth Algorithm and GRNN
计算机科学, 2021, 48(8): 86-90. https://doi.org/10.11896/jsjkx.210600031
[8] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[9] 郭福民, 张华, 胡瑢华, 宋岩.
一种基于表面肌电信号的腕部肌力估计方法研究
Study on Method for Estimating Wrist Muscle Force Based on Surface EMG Signals
计算机科学, 2021, 48(6A): 317-320. https://doi.org/10.11896/jsjkx.200600021
[10] 卓雅倩, 欧博.
噪声环境下的人脸防伪识别算法研究
Face Anti-spoofing Algorithm for Noisy Environment
计算机科学, 2021, 48(6A): 443-447. https://doi.org/10.11896/jsjkx.200900207
[11] 张同明, 张宁.
股票市场投资者情绪指数研究综述
Review of Research on Investor Sentiment Index in Stock Market
计算机科学, 2021, 48(6A): 143-150. https://doi.org/10.11896/jsjkx.201000016
[12] 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇.
基于整车EMC标准测试和机器学习的反向诊断方法
Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning
计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[13] 王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳.
基于用户兴趣词典和LSTM的个性化情感分类方法
User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification
计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202
[14] 王士浩, 王中卿, 李寿山, 周国栋.
基于门控图卷积与动态依存池化的事件论元抽取
Event Argument Extraction Using Gated Graph Convolution and Dynamic Dependency Pooling
计算机科学, 2021, 48(11A): 52-56. https://doi.org/10.11896/jsjkx.201200259
[15] 曹素娥, 杨泽民.
基于聚类分析算法和优化支持向量机的无线网络流量预测
Prediction of Wireless Network Traffic Based on Clustering Analysis and Optimized Support Vector Machine
计算机科学, 2020, 47(8): 319-322. https://doi.org/10.11896/jsjkx.190800075
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!