计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 110-115.doi: 10.11896/jsjkx.190700041
高楠,李利娟,李伟,祝建明
GAO Nan,LI Li-juan,Wei-william LEE,ZHU Jian-ming
摘要: 关键词提取被广泛应用于文本挖掘领域,是文本自动摘要、自动分类、自动聚类等研究的基础。因此,提取高质量的关键词具有十分重要的研究意义。已有关键词提取方法研究中大多仅考虑了部分文本的统计特征,没有考虑词语的隐式语义特征,导致提取结果的准确率不高,且关键词缺乏语义信息。针对这一问题,文中设计了一种针对词语与文本主题之间的特征进行量化的算法。该算法首先用词向量的方法挖掘文本中词语的上下文语义关系,然后通过聚类方法抽取文本中主要的语义特征,最后用相似距离的方式计算词语与文本主题之间的距离并将其作为该词语的语义特征。此外,通过将语义特征与多种描述词语的词频、长度、位置和语言等特征结合,文中还提出了一种融合语义特征的短文本关键词提取方法,简称SFKE方法。该方法从统计信息和语义层面分析了词语的重要性,从而可以综合多方面因素提取出最相关的关键词集合。实验结果表明,相比TFIDF,TextRank,Yake,KEA和AE等方法,融合多种特征的关键词提取方法的性能有了明显的提升。该方法与基于有监督的AE方法相比,F-Score提升了9.3%。最后,用信息增益的方法对特征的重要性进行评估,结果表明,添加语义特征后模型的F-Score提升了7.2%。
中图分类号:
[1]ZHAO J S,ZHU Q M,ZHOU G D,et al.Review of Research in Automatic Keyword Extraction[J].Journal of Software,2017,28(9):2431-2449. [2]BABAR S A,PATIL P D.Improving Performance of Text Summarization[J].Procedia Computer Science,2015,46:354-363. [3]ONAN A,KORUKGLU S,BULUT H.Ensemble of Keyword Extraction Methods and Classifiers in Text Classification[J].Expert Systems with Applications,2016,57(C):232-247. [4]LUHN H P.A Statistical Approach to Mechanized Encoding and Searching of Literary Information [J].IBM Journal of Research and Development 1957,1(4):309-317. [5]MIHALCEA R,TARAU P.TextRank:Bringing Order into Texts[C]∥Proceeding Conference on Empirical Methods in Natural Language Processing.Barcelona,Spain:2004:404-411. [6]CHEN W,WU Y Z,CHEN W L,et al.Automatic keyword extraction Based on BiLSTM-CRF[J].Computer Science,2018,45(S1):104-109. [7]CAMPPOS R,MANGARAVITE V,PASQUALI A,et al.A Text Feature Based Automatic Keyword Extraction Method for Single Documents[C]∥Advances in Information Retrieval (EDS).Cham:Springer,2018:10772. [8]ARDIANSYAH S,MAJID M A,ZAIN J M.Knowledge of extraction from trained neural network by using decision tree[C]∥International Conference on Science in Information Technology.IEEE,2017. [9]FRANK E,PAYNTER G W,et al.Domain-Specic Keyphrase Extraction [C]∥International Joint Conference on Artificial Intelligence.1999:668-673. [10]CHEN Y,YIN J,ZHU W,et al.Novel Word Features for Keyword Extraction [M]∥Web-Age Information Management.Springer International Publishing,2015:148-160. [11]KANIS J.Digging Language Model-Maximum Entropy Phrase Extraction[C]∥International Conference on Text.Speech:Brno,Czech,2016:46-53. [12]ZHOU C,LI S.Research of Information Extraction Algorithm based on Hidden Markov Model[C]∥International Conference on Information Science and Engineering.Springer,2010:1-4. [13]ZHANG C.Automatic Keyword Extraction from Documents Using Conditional Random Fields[J].Journal of Computational Information Systems,2008,4(3):1169-1180. [14]ZHANG Q,WANG Y,GONG Y,et al.Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]∥Empirical Methods in Nnatural Language Processing.2016:836-845. [15]AQUINO,GERMAN O,LANZARINI L C.Keyword Identification in Spanish Documents using Neural Networks[J].Journal of Computer Science & Technology,2015,15(2):55-60. [16]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]∥International Conference on Learning Representations(ICLR).2013:1301-3781. [17]LIU Z Y.Research on Keyword Extraction Method Based on Document Topic Structure[D].Beijing:Tsinghua University,2011. [18]GitHub[OL].https://github.com/uk9921/StopWords. [19]CHEN Y C,ZHANG Y X,WANG H,et al.Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J].Journal of Software,2018,29(7):2046-2070. [20]LI S,ZHAO Z,HU R,et al.Analogical Reasoning on Chinese Morphological and Semantic Relations[J].Meeting of the Association for Computational Linguistics,2018,2:138-143. |
[1] | 单晓英, 任迎春. 基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别 Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm 计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216 |
[2] | 陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149 |
[3] | 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034 |
[4] | 邢云冰, 龙广玉, 胡春雨, 忽丽莎. 基于SVM的类别增量人体活动识别方法 Human Activity Recognition Method Based on Class Increment SVM 计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024 |
[5] | 武玉坤, 李伟, 倪敏雅, 许志骋. 单类支持向量机融合深度自编码器的异常检测模型 Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder 计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142 |
[6] | 邓维斌, 朱坤, 李云波, 胡峰. FMNN:融合多神经网络的文本分类模型 FMNN:Text Classification Model Fused with Multiple Neural Networks 计算机科学, 2022, 49(3): 281-287. https://doi.org/10.11896/jsjkx.210200090 |
[7] | 白勇, 张占龙, 熊隽迪. 基于FP-Growth算法和GRNN的电力知识文本挖掘 Power Knowledge Text Mining Based on FP-Growth Algorithm and GRNN 计算机科学, 2021, 48(8): 86-90. https://doi.org/10.11896/jsjkx.210600031 |
[8] | 侯春萍, 赵春月, 王致芃. 基于自反馈最优子类挖掘的视频异常检测算法 Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining 计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146 |
[9] | 郭福民, 张华, 胡瑢华, 宋岩. 一种基于表面肌电信号的腕部肌力估计方法研究 Study on Method for Estimating Wrist Muscle Force Based on Surface EMG Signals 计算机科学, 2021, 48(6A): 317-320. https://doi.org/10.11896/jsjkx.200600021 |
[10] | 卓雅倩, 欧博. 噪声环境下的人脸防伪识别算法研究 Face Anti-spoofing Algorithm for Noisy Environment 计算机科学, 2021, 48(6A): 443-447. https://doi.org/10.11896/jsjkx.200900207 |
[11] | 张同明, 张宁. 股票市场投资者情绪指数研究综述 Review of Research on Investor Sentiment Index in Stock Market 计算机科学, 2021, 48(6A): 143-150. https://doi.org/10.11896/jsjkx.201000016 |
[12] | 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204 |
[13] | 王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳. 基于用户兴趣词典和LSTM的个性化情感分类方法 User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification 计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202 |
[14] | 王士浩, 王中卿, 李寿山, 周国栋. 基于门控图卷积与动态依存池化的事件论元抽取 Event Argument Extraction Using Gated Graph Convolution and Dynamic Dependency Pooling 计算机科学, 2021, 48(11A): 52-56. https://doi.org/10.11896/jsjkx.201200259 |
[15] | 曹素娥, 杨泽民. 基于聚类分析算法和优化支持向量机的无线网络流量预测 Prediction of Wireless Network Traffic Based on Clustering Analysis and Optimized Support Vector Machine 计算机科学, 2020, 47(8): 319-322. https://doi.org/10.11896/jsjkx.190800075 |
|