计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 73-77.doi: 10.11896/jsjkx.200300121

• 人工智能 • 上一篇    下一篇

基于多特征融合的关键词抽取

段建勇, 游世薪, 张梅, 王昊   

  1. 北方工业大学信息学院 北京 100144
  • 出版日期:2020-11-15 发布日期:2020-11-17
  • 通讯作者: 段建勇(duanjy@ncut.edu.cn)
  • 基金资助:
    国家自然科学基金(61972003,61672040)

Keyword Extraction Based on Multi-feature Fusion

DUAN Jian-yong, YOU Shi-xin, ZHANG Mei, WANG Hao   

  1. School of Information,North China University of Technology,Beijing 100144,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:DUAN Jian-yong,born in 1978,Ph.D,professor,is a member of China Computer Federation.His main research interests include natural language processing and so on.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61972003,61672040).

摘要: 随着互联网的发展,网页数据以及新媒体文本等数据日益增多,全文信息检索的效率已经不足以支撑海量数据的检索,因而关键词抽取技术广泛应用于搜索引擎(如百度搜索)和新媒体服务等领域(如新闻检索)。融合模型是一种使用BiLSTM-CRF结构并融合多重手工特征的模型,可以更有效地完成关键词抽取任务。融合模型在词嵌入特征的基础上,融入了词性、词频、词长和词位置特征,多维度的特征信息可以更加全面地辅助模型提取到关键词的深层特征信息。融合模型将深度学习的广覆盖度、高学习能力等特点与手工特征的精确表达能力相结合,以进一步提高特征挖掘能力并缩短训练所需时间。此外,该模型使用了一种新的“LMRSN”标记方法,可以更有效地完成关键短语的抽取。实验结果表明,融合模型在与传统模型的对比中取得了62.08的F1分值,性能远高于传统模型。

关键词: 长短期记忆网络, 抽取, 深度学习, 特征融合, 信息检索

Abstract: With the development of the Internet,webpage data,new media text and other data are increasing,the efficiency of information retrieval based on full text is not enough to support the retrieval of massive data,so the keyword extraction technology is widely used in search engines (such as Baidu search) and new media services (such as news retrieval).The fusion model is a model that uses the BiLSTM-CRF structure and fuses multiple manual features,which can more effectively complete the task of keyword extraction.Based on the features of words embedding,the fusion model incorporates the features of part of speech,word frequency,word length and word position.Themultidimensional feature information can help the model to extract deep keyword feature information more comprehensively.The fusion model combines the features of deep learning,such as wide coverage and high learning ability,with the ability of accurate expression of manual features to further improve the feature mining ability and shorten the training time.In addition,a labeling method called LMRSN is adopted in this modelto extract key phrases moreeffec-tively.Experimental results show that the fusion model achieves F1 score of 62.08 in comparison with the traditional model,and its performance is much better than that of the traditional model.

Key words: Deep learning, Feature fusion, Information retrieval, Keyword extraction, Long and short term memory network

中图分类号: 

  • TP391.1
[1] SALTON G,BUCKLEY C.Term-Weighting approaches in automatic text retrieval[J].Information Processing & Management,1988,24(5):513-523.
[2] HUANG L,WU Y P,ZHU Q F.Research and Improvement of TFIDF Text Feature Weighting Method[J].Computer Science,2014,41(6):204-207.
[3] BESILS R,MOSCHITTI A,PAZIENZA M.A text classifierbased on linguistic processing[C]//Proc.of the Int'l Joint Conf.on Artificial Intelligence.UCAI,1999:3640.
[4] MIHALCEA R,TARAU P.TextRank:Bringing order into text[C]//Proc.of the EMNLP 2004.Unt Scholarly Works,2004:404411.
[5] MA X Z,HOVY E.2016.End-to-end sequence labeling via bi-directional lstm-cnns-crf[J].arXiv:1603.01354.
[6] VITERBI A.Error bounds for convolutional codes and an asymptotically optimum decoding algorithm[J].IEEE Transactions on Information Theory,1967,13(2):260-269.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 王馨彤, 王璇, 孙知信.
基于多尺度记忆残差网络的网络流量异常检测模型
Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network
计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[6] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[7] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[8] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[10] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[11] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[12] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[13] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[14] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[15] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!