计算机科学 ›› 2022, Vol. 49 ›› Issue (10): 243-251.doi: 10.11896/jsjkx.210800176

• 人工智能 • 上一篇    下一篇

联合知识图谱和预训练模型的中文关键词抽取方法

姚奕, 杨帆   

  1. 陆军工程大学指挥控制工程学院 南京 210007
  • 收稿日期:2021-08-19 修回日期:2021-12-06 出版日期:2022-10-15 发布日期:2022-10-13
  • 通讯作者: 杨帆(ivan10240@163.com)
  • 作者简介:(yaoyi226@aliyun.com)
  • 基金资助:
    军事类研究生资助课题(JY2019C078)

Chinese Keyword Extraction Method Combining Knowledge Graph and Pre-training Model

YAO Yi, YANG Fan   

  1. College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China
  • Received:2021-08-19 Revised:2021-12-06 Online:2022-10-15 Published:2022-10-13
  • About author:YAO Yi,born in 1981,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include software engineering,natural language processing and know-ledge graph.
    YANG Fan,born in 1997,postgraduate.His main research interests include knowledge graph and natural language processing.
  • Supported by:
    Military Postgraduate Research Project(JY2019C078).

摘要: 关键词表征了文本的主题,是文本概念和主题的凝练。通过关键词,读者可以快速了解文档表达的主旨和思想,从而提升信息检索效率;此外,关键词抽取也可以为自动摘要、文本分类提供支撑。近年来,自动抽取关键词的研究引起了广泛关注,但如何精准地抽取文档的关键词仍是一个挑战。一方面,关键词是人们主观的认识,判断一个词是否是关键词本身具有主观性;另一方面,中文词汇往往具有丰富的语义信息,单纯依赖传统统计特征和主题特征难以准确提炼文本所表达的主旨思想。针对中文关键词抽取中存在的准确率低、信息冗余和信息缺失等问题,提出了一种联合知识图谱和预训练模型的无监督关键词抽取方法。该方法首先利用预训练模型进行主题聚类,并通过一种以句子为单位的聚类方法保证最终选取的关键词对全文内容的覆盖度;同时,通过知识图谱进行实体链接,以此实现精准分词及歧义消除;然后,根据主题信息构建语义词图,并以此为基础计算词语间的语义权重;最后,通过加权的PageRank算法进行关键词排序。在DUC 2001和CSL两个公开数据集和一个单独标注的CLTS数据集上,以预测结果的准确率、召回率及F1值为指标进行对比实验。实验结果表明,该模型相比多种基线方法,准确率均有所提升,在CLTS数据集上与传统统计方法TF-IDF相比F1值提高了9.14%,与传统图方法TextRank相比F1值提高了4.82%。

关键词: 关键词抽取, 知识图谱, 句嵌入, 聚类, 图算法, 预训练模型

Abstract: Keywords represent the theme of the text,which is the condensed concept and content of the text.Through keywords,readers can quickly understand the gist and idea of the text and improve the efficiency of information retrieval.In addition,keyword extraction can also provide support for automatic text summarization and text classification.In recent years,research on automatic keyword extraction has attracted wide attention,but how to extract keywords from documents accurately remains a challenge.On the one hand,the keyword is people’s subjective understanding,judging whether a word is a keyword itself is subjective.On the other hand,Chinese words are often rich in semantic information and it is difficult to accurately extract the main idea expressed in the text by solely relying on traditional statistical features and thematic features.Aiming at the problems of low accuracy,information redundancy and information missing in Chinese keyword extraction,this paper proposes an unsupervised keyword extraction method combining knowledge graph and pre-training model.Firstly,topic clustering is carried out by using the pre-training model,and a sentence-based clustering method is proposed to ensure the coverage of the final selected keyword.Then,the knowledge graph is used for entity linking to achieve accurate word segmentation and semantic disambiguation.After that,the semantic word graph is constructed based on the topic information to calculate the semantic weight between words.Finally,keywords are sorted by the weighted PageRank algorithm.Experiments are conducted on two public datasets,DUC 2001 and CSL,and a separate annotated CLTS dataset,the prediction accuracy,recall rate and F1 score are taken as indicators in comparative experiments.Experimental results show that the accuracy of the proposed method has improved compared with other baseline methods,F1 value is increased by 9.14% compared with the traditional statistical method TF-IDF,and increased by 4.82% compared with the traditional graph method TextRank on CLTS dataset.

Key words: extraction, Knowledge graph, Sentence embedding, Clustering, Graph-based algorithms, Pre-trained model

中图分类号: 

  • TP391
[1]ZHAO J S,ZHU Q M,ZHOU G D,et al.Review of research in automatic keyword extraction[J].Journal of Software,2017,28(9):2431-2449.
[2]LIU Z Y.Research on keyword extraction using document topical structure[D].Beijing:Tsinghua University,2011.
[3]CHEN T,MIAO D,ZHANG Y.A Graph-Based keyphrase extraction model with three-way decision[C]//Proceedings of the Rough Sets-International Joint Conference.Havana,Cuba,2020:111-121.
[4]DING Z,ZHANG Q,HUANG X.Keyphrase extraction fromonline news using binary integer programming[C]//Procee-dings of the 5th International Joint Conference on Natural Language Processing.Chiang Mai,Thailand,2011:165-173.
[5]CHANG Y C,ZHANG Y X,WANG H,et al.Features oriented survey of state-of-the-art keyphrase extraction algorithms[J].Journal of Software,2018,29(7):2046-2070.
[6]YU Y,NG V.WikiRank:Improving keyphrase extraction based on background knowledge[J].arXiv:1803.09000,2018.
[7]GRINEVA M,GRINEV M,LIZORKIN D.Extracting keyterms from noisy and multitheme documents[C]//Proceedings of the 18th International Conference on World Wide Web.Madrid,Spain,2009:661-670.
[8]TSATSARONIS G,VARLAMIS I,NORVAG K.Semantic-Rank:Ranking keywords and sentences using semantic graphs[C]//Proceedings ofthe 23rd International Conference on Computational Linguistics.Beijing,2010:1074-1082.
[9]BO X,YONG X,LIANG J,et al.CN-DBpedia:A Never-Ending Chinese Knowledge Extraction System[C]//Proceedings of the 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems(IEA/AIE2017).Arras,France,2017:428-438.
[10]OVER P.Introduction to DUC 2001:An intrinsic evaluation of generic news text summarization systems[C]//Proceedings of the Document Understanding Conference.2001.
[11]LIU X,ZHANG C,CHEN X,et al.CLTS:A new chinese long text summarization dataset[C]//Proceedings of the Natural Language Processing and Chinese Computing(NLPCC 2020).Cham:Springer,2020:531-542.
[12]DUAN J Y,YOU S X,ZHANG M,et al.Keyword Extraction Based on Multi-feature Fusion[J].Computer Science,2020,47(S2):73-77.
[13]HOFMANN T.Probabilistic latent semantic indexing[J].Pro-ceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval,1999,51(2):50-57.
[14]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003(3):993-1022.
[15]PU X,JIN R,WU G,et al.Topic Modeling in Semantic Space with Keywords[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management.New York,2015:1141-1150.
[16]LIU X J,XIE F.KeywordExtraction Method Combining Topic Distribution with Statistical Features[J].Computer Enginee-ring,2017,43(7):217-222.
[17]ALREHAMY H H,WALKER C.SemCluster:UnsupervisedAutomatic Keyphrase Extraction Using Affinity Propagation [C]//Advances in Computational Intelligence Systems(UKCI 2017).2017:222-235.
[18]AWAN M N,BEG M O.TOP-Rank:A TopicalPostionRank for Extraction and Classification of Keyphrases in Text [J].Journal of Computer Speech Language,2021(65):101-116.
[19]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space [J].arXiv:1301.3781v3,2013.
[20]PENNINGTON J,SOCHER R,MANNING C.Glove:GlobalVectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP 2014).Doha,Qatar,2014:1532-1543.
[21]WANG R,LIU W,MCDONALD C.Corpus-independent Gene-ric Keyphrase Extraction Using Word Embedding Vectors[C]//Proceedings of the Software Engineering Research Conference.2015.
[22]MAHATA D,KURIAKOSE J,SHAH R R,et al.Key2Vec:Automatic Ranked Keyphrase Extraction from Scientific Articles using Phrase Embeddings [C]//Proceedings of NAACL-HLT.New Orleans,Louisiana,2018:634-639.
[23]ZHANG Y,LIU H,WANG S,et al.Automatic keyphrase extraction using word embeddings [J].Journal of Soft Computing,2020(24):1-16.
[24]QUILLIAN M R.Semantic networks [J].Approaches toKnowledge Representation Research Studies,1968,23(92):1-50.
[25]MIHALCEA R,TARAU P.TextRank:Bringing Order intoText[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.Barcelona,Spain,2004:404-411.
[26]WAN X,XIAO J.Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]//Proceedings of the 23rd AAAI Conference on Artificial Intelligence.Palo Alto,2008:855-860.
[27]BOUGOUIN A,BOUDIN F,DAILLE B.TopicRank:Graph-Based Topic Ranking for Keyphrase Extraction [C]//Procee-dings of International Joint Conference on Natural Language Processin.Nagoya,Japan,2013:543-551.
[28]FLORESCU C,CARAGEA C.A Position-Biased PageRank Algorithm for Keyphrase Extraction[C]//Proceedings of the Association for the Advancement of Artificial Intelligence.San Francisco,California,2017:582-592.
[29]BOUDIN F.Unsupervised Keyphrase Extraction with Multipartite Graphs[C]//Proceedings of NAACL-HLT.New Orleans,Louisiana,2018:667-672.
[30]SHI W,ZHENG W,YU J X,et al.Keyphrase Extraction Using Knowledge Graphs [C]//Asia-Pacific Web(APWeb) and Web-Age Information Management(WAIM) Joint Conference on Web and Big Data.Cham:Springer,2017.
[31]GAO T,YAO X,CHEN D.SimCSE:Simple Contrastive Lear-ning of Sentence Embeddings [J].arXiv:2104.08821,2021.
[32]SU J,CAO J,LIU W,et al.Whitening Sentence Representations for Better Semantics and Faster Retrieval [J].arXiv:2103.15316,2021.
[33]JI H,GRISHMAN R,DANG H T,et al.Overview of the TAC 2010 knowledge base population track[C]//Proceedings of the Third Text Analysis Conference(TAC).Gaithersburg,Maryland,2010.
[34]SUN M S,CHEN X X,ZHANG K X,et al.THULAC:An Efficient Lexical Analyzer for Chinese[EB/OL].https://nlp.csai.tsinghua.edu.cn/project/thulac/.
[35]XIA T.Extracting Key-phrases from Chinese Scholarly Papers[J].Data Analysis and Knowledge Discovery,2020,4(7):76-86.
[36]LIANG Y.Chinese keyword extraction based on weighted complex network[C]//Proceedings of International Conference on Intelligent Systems and Knowledge Engineering(ISKE).Nanjing,China,2017:1-5.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3] 吴子仪, 李邵梅, 姜梦函, 张建朋.
基于自注意力模型的本体对齐方法
Ontology Alignment Method Based on Self-attention
计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190
[4] 孔世明, 冯永, 张嘉云.
融合知识图谱的多层次传承影响力计算与泛化研究
Multi-level Inheritance Influence Calculation and Generalization Based on Knowledge Graph
计算机科学, 2022, 49(9): 221-227. https://doi.org/10.11896/jsjkx.210700144
[5] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[6] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[7] 秦琪琦, 张月琴, 王润泽, 张泽华.
基于知识图谱的层次粒化推荐方法
Hierarchical Granulation Recommendation Method Based on Knowledge Graph
计算机科学, 2022, 49(8): 64-69. https://doi.org/10.11896/jsjkx.210600111
[8] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[9] 王杰, 李晓楠, 李冠宇.
基于自适应注意力机制的知识图谱补全算法
Adaptive Attention-based Knowledge Graph Completion
计算机科学, 2022, 49(7): 204-211. https://doi.org/10.11896/jsjkx.210400129
[10] 刘丽, 李仁发.
医疗CPS协作网络控制策略优化
Control Strategy Optimization of Medical CPS Cooperative Network
计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230
[11] 马瑞新, 李泽阳, 陈志奎, 赵亮.
知识图谱推理研究综述
Review of Reasoning on Knowledge Graph
计算机科学, 2022, 49(6A): 74-85. https://doi.org/10.11896/jsjkx.210100122
[12] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[13] 杜晓明, 袁清波, 杨帆, 姚奕, 蒋祥.
军事指控保障领域命名实体识别语料库的构建
Construction of Named Entity Recognition Corpus in Field of Military Command and Control Support
计算机科学, 2022, 49(6A): 133-139. https://doi.org/10.11896/jsjkx.210400132
[14] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于DBSCAN聚类的集群联邦学习方法
Clustered Federated Learning Methods Based on DBSCAN Clustering
计算机科学, 2022, 49(6A): 232-237. https://doi.org/10.11896/jsjkx.211100059
[15] 郁舒昊, 周辉, 叶春杨, 王太正.
SDFA:基于多特征融合的船舶轨迹聚类方法研究
SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion
计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!