计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 142-145.

• 智能计算 • 上一篇    下一篇

基于加权TextRank的文本关键词提取方法

徐立   

  1. 商丘职业技术学院软件学院 河南 商丘476100;
    中国科学技术大学苏州研究院 江苏 苏州 215000
  • 出版日期:2019-06-14 发布日期:2019-07-02
  • 作者简介:徐 立(1982-),男,硕士,讲师,主要研究方向为数据挖掘,E-mail:xuli1999@163.com。

Text Keyword Extraction Method Based on Weighted TextRank

XU Li   

  1. School of Software,Shangqiu Polytechnic,Shangqiu,Henan 476100,China;
    Suzhou Research Institute,University of Science and Technology of China,Suzhou,Jiangsu 215000,China
  • Online:2019-06-14 Published:2019-07-02

摘要: 为提升提取文本关键词的准确性,文中提出了一种文本关键词提取方法。该方法融合词频、词长、词语位置及词性等关键词提取影响因素,提出了候选关键词的权重公式;通过实验获取权重公式的相对最优权重系数;将权重公式应用到TextRank算法的候选关键词得分公式中,以提升提取文本关键词的准确性。通过实验对比了OPW-Text-Rank算法与TextRank算法对单文本关键词提取的准确率、召回率及F值,结果表明,OPW-TextRank算法在窗口大小为6时,提取关键词的准确率高于TextRank算法。在以文本关键词提取为基础的自然语言处理系统中所提算法具有一定的实用性。

关键词: TextRank, 词频, 关键词提取, 加权

Abstract: To improve the accuracy of keyword extraction,a text keyword extraction me-thod was proposed.This methodcombines the influence factors such as word frequency,word length,word position and word length,proposes the weight formula of candidate keywords.Then it obtains the relative optimal weight coefficient in the weight formula by experiment,applies the weight formula to the candidate keyword scoring formula of TextRank algorithm,and extracts the accuracy of text keywords.The accuracy,recall and F value of OPW-TextRank algorithm and TextRank algorithm in single text keyword extraction were compared through the experiment.The results show that the accuracy of OPW-TextRank algorithm is higher than that of TextRank algorithm when the window size is 6.It is useful in natural language processing keyword system based on text keyword extraction.

Key words: Keyword extraction, TextRank, Weighting, Word frequency

中图分类号: 

  • TP391.1
[1]张璐,芦天亮,杜彦辉.基于WMF_LDA主题模型的文本相似度计算[J/OL].计算机应用研究,2019(10):1-8.
[2]HASSAINE A,MECHETER S,JAOUA A.Text Categorization Using Hyper Rectangular Keyword Extraction:Application to News Articles Classification[C]∥International Conference on Relational and Algebraic Methods in Computer Science.Springer International Publishing,2015:312-325.
[3]曲靖野,陈震,胡轶楠.共词分析与LDA模型分析在文本主题挖掘中的比较研究[J].情报科学,2018,36(2):18-23.
[4]ZHANG W N,MING Z Y,ZHANG Y,et al.Exploring Key Concept Paraphrasing Based on Pivot Language Translation for Question Retrieval[C]∥Design Automation and Test in Europe.2015:1-4.
[5]夏火松,甄化春.大数据环境下舆情分析与决策支持研究文献综述[J].情报杂志,2015,34(2):1-6,21.
[6]SALTON G,BUCKLEY C.Term-weighting approaches in automatic text retrieval[J].Information Processing & Management,1987,24(5):513-523.
[7]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[8]MIHALCEA R,TARAU P.TextRank:Bringing Order into Texts[J].Emnlp,2004:404-411.
[9]李鹏,王斌,石志伟,等.Tag-TextRank:一种基于Tag的网页关键词抽取方法[J].计算机研究与发展,2012,49(11):2344-2351.
[10]ORTEGA F J,VALLEJO C G.STR:A GRAPH-BASED TAGGING TECHNIQUE[J].International Journal on Artificial Intelligence Tools,2011,20(5):955-967.
[11]夏天.词语位置加权TextRank的关键词抽取研究[J].现代图书情报技术,2013(9):30-34.
[12]顾益军,夏天.融合LDA与TextRank的关键词抽取研究[J].现代图书情报技术,2014(Z1):41-47.
[13]杨玥,张德生.中文文本的主题关键短语提取技术[J].计算机科学,2017,44(S2):432-436.
[14]张建娥.基于多特征融合的中文文本关键词提取方法[J].情报理论与实践,2013,36(10):105-108.
[15]SCHMIDHUBER J.Deep learning in neural networks:An overview[J].Neural Networks,2015,61:85-117.
[16]CSOMAI A.Keywords in the mist:Automated keyword extraction for very large documents and back of the book indexing[J].Unt Theses & Dissertations,2008.
[17]DOSTÀL M,JEZEK K.Automatic Keyphrase Extraction based on NLP and Statistical Methods[C]∥Dateso 2011 International Workshop on Databases,Texts,Specifications and Objects.Pisek,Czech Republic,DBLP,2011:140-145.
[18]TIMONEN M,TOIVANEN T,TENG Y,et al.Informative-ness-based Keyword Extraction from Short Documents[C]∥KDIR.2012:411-421.
[1] 杨文坤, 原晓佩, 陈小锋, 郭睿.
三维激光雷达点云空间多特征分割
Spatial Multi-feature Segmentation of 3D Lidar Point Cloud
计算机科学, 2022, 49(8): 143-149. https://doi.org/10.11896/jsjkx.210300275
[2] 石先让, 宋廷伦, 唐得志, 戴振泳.
一种新颖的单目视觉深度学习算法:H_SFPN
Novel Deep Learning Algorithm for Monocular Vision:H_SFPN
计算机科学, 2021, 48(4): 130-137. https://doi.org/10.11896/jsjkx.200400090
[3] 储杰, 张正军, 汤鑫瑶, 黄振生.
基于加权样本和共识率的标记传播算法
Label Propagation Algorithm Based on Weighted Samples and Consensus-rate
计算机科学, 2021, 48(3): 214-219. https://doi.org/10.11896/jsjkx.191200103
[4] 张天瑞, 魏铭琦, 高秀秀.
基于IPSO-WRF的选择性激光烧结件气泡溶解时间预测模型
Prediction Model of Bubble Dissolution Time in Selective Laser Sintering Based on IPSO-WRF
计算机科学, 2021, 48(11A): 638-643. https://doi.org/10.11896/jsjkx.210300080
[5] 毛湘科, 黄少滨, 余秦勇.
一种基于图的文档关键词和摘要协同抽取方法研究
Graph Based Collaborative Extraction Method for Keywords and Summary from Documents
计算机科学, 2021, 48(10): 44-50. https://doi.org/10.11896/jsjkx.200900082
[6] 朱珍, 黄锐, 臧铁钢, 卢世军.
基于加权近红外图像融合的单幅图像除雾方法
Single Image Defogging Method Based on Weighted Near-InFrared Image Fusion
计算机科学, 2020, 47(8): 241-244. https://doi.org/10.11896/jsjkx.200300068
[7] 宋传鸣, 洪旭, 王相海.
空-频域联合投票的交通视频阴影去除方法
Shadow Removal of Traffic Surveillance Video by Joint Voting in Spatial-Frequency Domain
计算机科学, 2020, 47(5): 129-136. https://doi.org/10.11896/jsjkx.190400040
[8] 朱莹,夏亦犁,裴文江.
基于改进的BEMD的红外与可见光图像融合方法
Fusion of Infrared and Color Visible Images Based on Improved BEMD
计算机科学, 2020, 47(3): 124-129. https://doi.org/10.11896/jsjkx.190100038
[9] 吴甜甜,王洁.
基于可能回答集程序的多Agent信念协调
Belief Coordination for Multi-agent System Based on Possibilistic Answer Set Programming
计算机科学, 2020, 47(2): 201-205. https://doi.org/10.11896/jsjkx.190100101
[10] 古雪梅,刘嘉勇,程芃森,何祥.
基于增强BiLSTM-CRF模型的推文恶意软件名称识别
Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model
计算机科学, 2020, 47(2): 245-250. https://doi.org/10.11896/jsjkx.190500063
[11] 陈庆超, 王韬, 尹世庄, 冯文博.
多级字典存储的未知文本协议候选关键词链式合并方法
Chain Merging Method for Unknown Text Protocol Candidate Keyword Stored in Multi-levelDictionary
计算机科学, 2020, 47(12): 332-335. https://doi.org/10.11896/jsjkx.190900116
[12] 刘志, 曹诗鹏, 沈阳, 杨曦.
基于改进深度强化学习方法的单交叉口信号控制
Signal Control of Single Intersection Based on Improved Deep Reinforcement Learning Method
计算机科学, 2020, 47(12): 226-232. https://doi.org/10.11896/jsjkx.200300021
[13] 张文华, 刘晓鸽, 王沛沛, 刘静静, 程敬亮.
肝脏多b值扩散加权图像的三维配准
3D Registration for Multi-b-value Diffusion Weighted Images of Liver
计算机科学, 2020, 47(11A): 241-243. https://doi.org/10.11896/jsjkx.200400060
[14] 张良成, 王运锋.
动态自适应的多雷达信息加权融合方法
Dynamic Adaptive Multi-radar Tracks Weighted Fusion Method
计算机科学, 2020, 47(11A): 321-326. https://doi.org/10.11896/jsjkx.2004000145
[15] 易玉根, 李世成, 裴洋, 陈磊, 代江艳.
联合多流形结构和自表示的特征选择方法
Feature Selection Method Combined with Multi-manifold Structures and Self-representation
计算机科学, 2020, 47(11A): 474-478. https://doi.org/10.11896/jsjkx.200100037
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!