计算机科学 ›› 2021, Vol. 48 ›› Issue (10): 44-50.doi: 10.11896/jsjkx.200900082

• 人工智能* 上一篇    下一篇

一种基于图的文档关键词和摘要协同抽取方法研究

毛湘科1,2,3, 黄少滨1, 余秦勇2,3   

  1. 1 哈尔滨工程大学计算机科学与技术学院 哈尔滨150001
    2 中电科大数据研究院有限公司 贵阳550022
    3 提升政府治理能力大数据应用技术国家工程实验室 贵阳550022
  • 收稿日期:2020-09-10 修回日期:2021-03-10 出版日期:2021-10-15 发布日期:2021-10-18
  • 通讯作者: 黄少滨(huangshaobin@hrbeu.edu.cn)
  • 作者简介:maotiamo@hrbeu.edu.cn
  • 基金资助:
    提升政府治理能力大数据应用技术国家工程实验室开放基金项目

Graph Based Collaborative Extraction Method for Keywords and Summary from Documents

MAO Xiang-ke1,2,3, HUANG Shao-bin1, YU Qin-yong2,3   

  1. 1 College of Computer Science and Technology,Harbin Engineering University,Harbin 150001,China
    2 CETC Big Data Research Institute Co.,Ltd.,Guiyang 550022,China
    3 Big Data Application on Improving Governance Capabilities National Engineering Laboratory,Guiyang 550022,China
  • Received:2020-09-10 Revised:2021-03-10 Online:2021-10-15 Published:2021-10-18
  • About author:MAO Xiang-ke,born in 1992,Ph.D.His main research interests include natural language processing and machine lear-ning.
    HUANG Shao-bin,born in 1965,professor.His main research interests include data mining,natural language proces-sing and machine learning.
  • Supported by:
    Big Data Application on Improving Governance Capabilities National Engineering Laboratory Open Fund Project.

摘要: 关键词提取和摘要抽取的目的都是从原文档中选择关键内容并对原文档的主要意思进行概括。评价关键词和摘要抽取质量的好坏主要看其能否对文档的主题进行良好的覆盖。在现有基于图模型的关键词提取和摘要抽取方法中,很少涉及到将关键词提取和摘要抽取任务协同进行的,而文中提出了一种基于图模型的方法进行关键词提取和摘要的协同抽取。该方法首先利用文档中词、主题和句子之间的6种关系,包括词和词、主题和主题、句子和句子、词和主题、主题和句子、词和句子,进行图的构建;然后利用文档中词和句子的统计特征对图中各顶点的先验重要性进行评价;接着采用迭代的方式对词和句子进行打分;最后根据词和句子的得分,得到关键词和摘要。为验证所提方法的效果,文中在中英文数据集上进行关键词提取和摘要抽取实验,发现该方法不管是在关键词提取还是摘要抽取任务上都取得了良好的效果。

关键词: 关键词提取, 图模型, 摘要抽取, 主题覆盖

Abstract: The purpose of keywords extraction and summary extraction is to select key content from the original document to express the main meaning of the original document.The evaluation of keywords and summarization quality mainly depends on whether it can cover the main topics of the document.In the existing methods of keywords extraction and summary extraction based on graph models,it rarely involves the task of keywords extraction and summary extraction collaboratively.The article proposes a method based on a graph model for simultaneous keywords extraction and summary extraction.The method first uses the six relationships among words,topics,and sentences in the document,including words-words,topics-topics,sentences-sentences,words-topics,topics-sentences,words-sentences,to construct the graph;then uses the statistical characteristics of the words and sentences in the document to evaluate the prior importance of each vertex in the graph;next,it uses an iterative way to score words and sentences;finally,we get the final keywords and summary based on the scores of words and sentences.In order to verify the effectiveness of the proposed method,keywords extraction and summary extraction experiments are carried out on Chinese and English datasets.It is found that the proposed method achievs good results in both keywords extraction and summary extraction tasks.

Key words: Extractive summarization, Graph model, Keywords extraction, Topic cover

中图分类号: 

  • TP311.131
[1]CARBONELL J,GOLDSTEIN J.The use of MMR,diversity-based reranking for reordering documents and producing summaries[C]//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.1998:335-336.
[2]PAGE L,BRIN S,MOTWANI R,et al.The PageRank citation ranking:bringing order to the web[R].Stanford InfoLab,1999.
[3]MIHALCEA R,TARAU P.Textrank:bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[4]ERKAN G,RADEV D R.Lexrank:graph-based lexical centrality as salience in text summarization[J].Journal of Artificial Intelligence Research,2004,22:457-479.
[5]WAN X,XIAO J.Exploiting neighborhood knowledge for single document summarization and keyphrase extraction[J].ACM Transactions on Information Systems (TOIS),2010,28(2):1-34.
[6]GOLLAPALLI S D,CARAGEA C.Extracting keyphrases from research papers using citation networks[C]//Twenty-Eighth AAAI Conference on Artificial Intelligence.2014.
[7]YU Y,NG V.Wikirank:improving keyphrase extraction based on background knowledge[J].arXiv:1803.09000,2018.
[8]WANG R,LIU W,MCDONALD C.Corpus-independent generic keyphrase extraction using word embedding vectors[C]//Software Engineering Research Conference.2014:1-8.
[9]WANG H,YE J,YU Z,et al.Unsupervised keyword extraction methods based on a word graph network[J].International Journal of Ambient Computing and Intelligence (IJACI),2020,11(2):68-79.
[10]LIU Z,HUANG W,ZHENG Y,et al.Automatic keyphrase extraction via topic decomposition[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Proces-sing.2010:366-376.
[11]FLORESCU C,CARAGEA C.Positionrank:an unsupervisedapproach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).2017:1105-1115.
[12]TENEVA N,CHENG W.Salience rank:efficient keyphrase extraction with topic modeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers).2017:530-535.
[13]BISWAS S K,BORDOLOI M,SHREYA J.A graph based keyword extraction model using collective node weight[J].Expert Systems with Applications,2018,97:51-59.
[14]BOUGOUIN A,BOUDIN F,DAILLE B.Topicrank:graph-based topic ranking for keyphrase extraction[C]//International Joint Conference on Natural Language Processing (IJCNLP).2013:543-551.
[15]AL-KHASSAWNEH Y A,SALIM N,JARRAH M.Improving triangle-graph based text summarization using hybrid similarity function[J].Indian Journal of Science and Technology,2017,10(8):1-15.
[16]GOYAL P,BEHERA L,MCGINNITY T M.A context-basedword indexing model for document summarization[J].IEEE Transactions on Knowledge and Data Engineering,2012,25(8):1693-1705.
[17]RAMESH A,SRINIVASA K G,PRAMOD N.SentenceRank-A graph based approach to summarize text[C]//The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014).IEEE,2014:177-182.
[18]SANKARASUBRAMANIAM Y,RAMANATHAN K,GHO-SH S.Text summarization using Wikipedia[J].Information Processing & Management,2014,50(3):443-461.
[19]CHENGZHANG X,DAN L.Chinese text summarization algorithm based on word2vec[J].Journal of Physics:Conference Series,2018,976(1):012006.
[20]ROUANE O,BELHADEF H,BOUAKKAZ M.Word Embedding-Based Biomedical Text Summarization[C]//International Conference of Reliable Information and Communication Technology.Cham:Springer,2019:288-297.
[21]YANG K,AL-SABAHI K,XIANG Y,et al.An integratedgraph model for document summarization[J].Information,2018,9(9):232.
[22]ERKAN G.Using biased random walks for focused summarization[C]//Proceedings of the 2006 Document Understanding Conference held at the Human Language Technology Confe-rence of the North American Chapter of the Association for Computational Linguistics.2006.
[23]OTTERBACHER J,ERKAN G,RADEV D.Using randomwalks for question-focused sentence retrieval[C]//Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing.2005:915-922.
[24]MAO X,YANG H,HUANG S,et al.Extractive summarization using supervised and unsupervised learning[J].Expert Systems with Applications,2019,133:173-181.
[25]WAN X,YANG J,XIAO J.Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics.2007:552-559.
[26]FANG C,MU D,DENG Z,et al.Word-sentence co-ranking for automatic extractive text summarization[J].Expert Systems with Applications,2017,72:189-195.
[27]MAO X,HUANG S,LI R,et al.Automatic Keywords Extraction Based on Co-Occurrence and Semantic Relationships Between Words[J].IEEE Access,2020,8:117528-117538.
[28]REIMERS N,GUREVYCH I.Sentence-bert:Sentence embed-dings using siamese bert-networks[J].arXiv:1908.10084,2019.
[29]LIN C Y.Rouge:a package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81.
[1] 梁静茹, 鄂海红, 宋美娜.
基于属性图模型的领域知识图谱构建方法
Method of Domain Knowledge Graph Construction Based on Property Graph Model
计算机科学, 2022, 49(2): 174-181. https://doi.org/10.11896/jsjkx.210500076
[2] 陈庆超, 王韬, 尹世庄, 冯文博.
多级字典存储的未知文本协议候选关键词链式合并方法
Chain Merging Method for Unknown Text Protocol Candidate Keyword Stored in Multi-levelDictionary
计算机科学, 2020, 47(12): 332-335. https://doi.org/10.11896/jsjkx.190900116
[3] 徐立.
基于加权TextRank的文本关键词提取方法
Text Keyword Extraction Method Based on Weighted TextRank
计算机科学, 2019, 46(6A): 142-145.
[4] 王旸, 蔡淑琴, 邹新文, 陈梓桐.
质量嵌入的大数据产品生产系统超图模型及其生产线决策研究
Quality-embedded Hypergraph Model for Big Data Product Manufacturing System and Decision for Production Lines
计算机科学, 2019, 46(2): 11-17. https://doi.org/10.11896/j.issn.1002-137X.2019.02.002
[5] 王凯祥.
面向查询的自动文本摘要技术研究综述
Survey of Query-oriented Automatic Summarization Technology
计算机科学, 2018, 45(11A): 12-16.
[6] 杨玥,张德生.
中文文本的主题关键短语提取技术
Technology of Extracting Topical Keyphrases from Chinese Corpora
计算机科学, 2017, 44(Z11): 432-436. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.092
[7] 徐慧,燕雪峰,周勇.
一种基于UML类图和活动图的故障树生成方法
Fault Tree Generation Method Based on UML Class Diagram and Activity Diagram
计算机科学, 2016, 43(7): 180-185. https://doi.org/10.11896/j.issn.1002-137X.2016.07.033
[8] 陈伟鹤,刘云.
基于词或词组长度和频数的短中文文本关键词提取算法
Keyword Extraction Algorithm Based on Length and Frequency of Words or Phrases for Short Chinese Texts
计算机科学, 2016, 43(12): 50-57. https://doi.org/10.11896/j.issn.1002-137X.2016.12.009
[9] 阿力甫·阿不都克里木,李晓.
基于TextRank算法和互信息相似度的维吾尔文关键词提取及文本分类
Uyghur Keyword Extraction and Text Classification Based on TextRank Algorithm and Mutual Information Similarity
计算机科学, 2016, 43(12): 36-40. https://doi.org/10.11896/j.issn.1002-137X.2016.12.006
[10] 薛占熬,王朋函,刘杰,朱泰隆,薛天宇.
基于概率图的三支决策模型研究
Three-way Decision Model Based on Probabilistic Graph
计算机科学, 2016, 43(1): 30-34. https://doi.org/10.11896/j.issn.1002-137X.2016.01.007
[11] 刘建伟,崔立鹏,黎海恩,罗雄麟.
概率图模型推理方法的研究进展
Research and Development on Inference Technique in Probabilistic Graphical Models
计算机科学, 2015, 42(4): 1-18. https://doi.org/10.11896/j.issn.1002-137X.2015.04.001
[12] 何远舵,陈之昀,王亚沙.
一种面向浏览式购物行为模式的LBS购书移动应用
Browse-shopping-behavior-pattern-oriented Indoor LBS Mobile Application for Book Shopping
计算机科学, 2015, 42(12): 32-35.
[13] 王俊丽,魏绍臣,管敏.
基于图排序算法的自动文摘研究综述
Survey on Graph Model-based Document Summarization
计算机科学, 2015, 42(12): 1-7.
[14] 俞刚,张泉方.
一种改进的无偏节点标签预测方法研究
Improved Unbiased Node Label Prediction Algorithm
计算机科学, 2015, 42(11): 248-250. https://doi.org/10.11896/j.issn.1002-137X.2015.11.050
[15] 王丽,秦小麟,许建秋.
室内概率阈值反向最近邻查询
Probabilistic Threshold Reverse Nearest Neighbor Queries for Indoor Moving Objects
计算机科学, 2015, 42(1): 201-205. https://doi.org/10.11896/j.issn.1002-137X.2015.01.045
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!