Computer Science ›› 2022, Vol. 49 ›› Issue (10): 243-251.doi: 10.11896/jsjkx.210800176

• Artificial Intelligence • Previous Articles     Next Articles

Chinese Keyword Extraction Method Combining Knowledge Graph and Pre-training Model

YAO Yi, YANG Fan   

  1. College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210007,China
  • Received:2021-08-19 Revised:2021-12-06 Online:2022-10-15 Published:2022-10-13
  • About author:YAO Yi,born in 1981,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include software engineering,natural language processing and know-ledge graph.
    YANG Fan,born in 1997,postgraduate.His main research interests include knowledge graph and natural language processing.
  • Supported by:
    Military Postgraduate Research Project(JY2019C078).

Abstract: Keywords represent the theme of the text,which is the condensed concept and content of the text.Through keywords,readers can quickly understand the gist and idea of the text and improve the efficiency of information retrieval.In addition,keyword extraction can also provide support for automatic text summarization and text classification.In recent years,research on automatic keyword extraction has attracted wide attention,but how to extract keywords from documents accurately remains a challenge.On the one hand,the keyword is people’s subjective understanding,judging whether a word is a keyword itself is subjective.On the other hand,Chinese words are often rich in semantic information and it is difficult to accurately extract the main idea expressed in the text by solely relying on traditional statistical features and thematic features.Aiming at the problems of low accuracy,information redundancy and information missing in Chinese keyword extraction,this paper proposes an unsupervised keyword extraction method combining knowledge graph and pre-training model.Firstly,topic clustering is carried out by using the pre-training model,and a sentence-based clustering method is proposed to ensure the coverage of the final selected keyword.Then,the knowledge graph is used for entity linking to achieve accurate word segmentation and semantic disambiguation.After that,the semantic word graph is constructed based on the topic information to calculate the semantic weight between words.Finally,keywords are sorted by the weighted PageRank algorithm.Experiments are conducted on two public datasets,DUC 2001 and CSL,and a separate annotated CLTS dataset,the prediction accuracy,recall rate and F1 score are taken as indicators in comparative experiments.Experimental results show that the accuracy of the proposed method has improved compared with other baseline methods,F1 value is increased by 9.14% compared with the traditional statistical method TF-IDF,and increased by 4.82% compared with the traditional graph method TextRank on CLTS dataset.

Key words: extraction, Knowledge graph, Sentence embedding, Clustering, Graph-based algorithms, Pre-trained model

CLC Number: 

  • TP391
[1]ZHAO J S,ZHU Q M,ZHOU G D,et al.Review of research in automatic keyword extraction[J].Journal of Software,2017,28(9):2431-2449.
[2]LIU Z Y.Research on keyword extraction using document topical structure[D].Beijing:Tsinghua University,2011.
[3]CHEN T,MIAO D,ZHANG Y.A Graph-Based keyphrase extraction model with three-way decision[C]//Proceedings of the Rough Sets-International Joint Conference.Havana,Cuba,2020:111-121.
[4]DING Z,ZHANG Q,HUANG X.Keyphrase extraction fromonline news using binary integer programming[C]//Procee-dings of the 5th International Joint Conference on Natural Language Processing.Chiang Mai,Thailand,2011:165-173.
[5]CHANG Y C,ZHANG Y X,WANG H,et al.Features oriented survey of state-of-the-art keyphrase extraction algorithms[J].Journal of Software,2018,29(7):2046-2070.
[6]YU Y,NG V.WikiRank:Improving keyphrase extraction based on background knowledge[J].arXiv:1803.09000,2018.
[7]GRINEVA M,GRINEV M,LIZORKIN D.Extracting keyterms from noisy and multitheme documents[C]//Proceedings of the 18th International Conference on World Wide Web.Madrid,Spain,2009:661-670.
[8]TSATSARONIS G,VARLAMIS I,NORVAG K.Semantic-Rank:Ranking keywords and sentences using semantic graphs[C]//Proceedings ofthe 23rd International Conference on Computational Linguistics.Beijing,2010:1074-1082.
[9]BO X,YONG X,LIANG J,et al.CN-DBpedia:A Never-Ending Chinese Knowledge Extraction System[C]//Proceedings of the 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems(IEA/AIE2017).Arras,France,2017:428-438.
[10]OVER P.Introduction to DUC 2001:An intrinsic evaluation of generic news text summarization systems[C]//Proceedings of the Document Understanding Conference.2001.
[11]LIU X,ZHANG C,CHEN X,et al.CLTS:A new chinese long text summarization dataset[C]//Proceedings of the Natural Language Processing and Chinese Computing(NLPCC 2020).Cham:Springer,2020:531-542.
[12]DUAN J Y,YOU S X,ZHANG M,et al.Keyword Extraction Based on Multi-feature Fusion[J].Computer Science,2020,47(S2):73-77.
[13]HOFMANN T.Probabilistic latent semantic indexing[J].Pro-ceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval,1999,51(2):50-57.
[14]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003(3):993-1022.
[15]PU X,JIN R,WU G,et al.Topic Modeling in Semantic Space with Keywords[C]//Proceedings of the 24th ACM International Conference on Information and Knowledge Management.New York,2015:1141-1150.
[16]LIU X J,XIE F.KeywordExtraction Method Combining Topic Distribution with Statistical Features[J].Computer Enginee-ring,2017,43(7):217-222.
[17]ALREHAMY H H,WALKER C.SemCluster:UnsupervisedAutomatic Keyphrase Extraction Using Affinity Propagation [C]//Advances in Computational Intelligence Systems(UKCI 2017).2017:222-235.
[18]AWAN M N,BEG M O.TOP-Rank:A TopicalPostionRank for Extraction and Classification of Keyphrases in Text [J].Journal of Computer Speech Language,2021(65):101-116.
[19]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space [J].arXiv:1301.3781v3,2013.
[20]PENNINGTON J,SOCHER R,MANNING C.Glove:GlobalVectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP 2014).Doha,Qatar,2014:1532-1543.
[21]WANG R,LIU W,MCDONALD C.Corpus-independent Gene-ric Keyphrase Extraction Using Word Embedding Vectors[C]//Proceedings of the Software Engineering Research Conference.2015.
[22]MAHATA D,KURIAKOSE J,SHAH R R,et al.Key2Vec:Automatic Ranked Keyphrase Extraction from Scientific Articles using Phrase Embeddings [C]//Proceedings of NAACL-HLT.New Orleans,Louisiana,2018:634-639.
[23]ZHANG Y,LIU H,WANG S,et al.Automatic keyphrase extraction using word embeddings [J].Journal of Soft Computing,2020(24):1-16.
[24]QUILLIAN M R.Semantic networks [J].Approaches toKnowledge Representation Research Studies,1968,23(92):1-50.
[25]MIHALCEA R,TARAU P.TextRank:Bringing Order intoText[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.Barcelona,Spain,2004:404-411.
[26]WAN X,XIAO J.Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]//Proceedings of the 23rd AAAI Conference on Artificial Intelligence.Palo Alto,2008:855-860.
[27]BOUGOUIN A,BOUDIN F,DAILLE B.TopicRank:Graph-Based Topic Ranking for Keyphrase Extraction [C]//Procee-dings of International Joint Conference on Natural Language Processin.Nagoya,Japan,2013:543-551.
[28]FLORESCU C,CARAGEA C.A Position-Biased PageRank Algorithm for Keyphrase Extraction[C]//Proceedings of the Association for the Advancement of Artificial Intelligence.San Francisco,California,2017:582-592.
[29]BOUDIN F.Unsupervised Keyphrase Extraction with Multipartite Graphs[C]//Proceedings of NAACL-HLT.New Orleans,Louisiana,2018:667-672.
[30]SHI W,ZHENG W,YU J X,et al.Keyphrase Extraction Using Knowledge Graphs [C]//Asia-Pacific Web(APWeb) and Web-Age Information Management(WAIM) Joint Conference on Web and Big Data.Cham:Springer,2017.
[31]GAO T,YAO X,CHEN D.SimCSE:Simple Contrastive Lear-ning of Sentence Embeddings [J].arXiv:2104.08821,2021.
[32]SU J,CAO J,LIU W,et al.Whitening Sentence Representations for Better Semantics and Faster Retrieval [J].arXiv:2103.15316,2021.
[33]JI H,GRISHMAN R,DANG H T,et al.Overview of the TAC 2010 knowledge base population track[C]//Proceedings of the Third Text Analysis Conference(TAC).Gaithersburg,Maryland,2010.
[34]SUN M S,CHEN X X,ZHANG K X,et al.THULAC:An Efficient Lexical Analyzer for Chinese[EB/OL].https://nlp.csai.tsinghua.edu.cn/project/thulac/.
[35]XIA T.Extracting Key-phrases from Chinese Scholarly Papers[J].Data Analysis and Knowledge Discovery,2020,4(7):76-86.
[36]LIANG Y.Chinese keyword extraction based on weighted complex network[C]//Proceedings of International Conference on Intelligent Systems and Knowledge Engineering(ISKE).Nanjing,China,2017:1-5.
[1] LU Chen-yang, DENG Su, MA Wu-bin, WU Ya-hui, ZHOU Hao-hao. Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients [J]. Computer Science, 2022, 49(9): 183-193.
[2] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[3] WU Zi-yi, LI Shao-mei, JIANG Meng-han, ZHANG Jian-peng. Ontology Alignment Method Based on Self-attention [J]. Computer Science, 2022, 49(9): 215-220.
[4] KONG Shi-ming, FENG Yong, ZHANG Jia-yun. Multi-level Inheritance Influence Calculation and Generalization Based on Knowledge Graph [J]. Computer Science, 2022, 49(9): 221-227.
[5] CHAI Hui-min, ZHANG Yong, FANG Min. Aerial Target Grouping Method Based on Feature Similarity Clustering [J]. Computer Science, 2022, 49(9): 70-75.
[6] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[7] WANG Jian, PENG Yu-qi, ZHAO Yu-fei, YANG Jian. Survey of Social Network Public Opinion Information Extraction Based on Deep Learning [J]. Computer Science, 2022, 49(8): 279-293.
[8] QIN Qi-qi, ZHANG Yue-qin, WANG Run-ze, ZHANG Ze-hua. Hierarchical Granulation Recommendation Method Based on Knowledge Graph [J]. Computer Science, 2022, 49(8): 64-69.
[9] ZHANG Yuan, KANG Le, GONG Zhao-hui, ZHANG Zhi-hong. Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM [J]. Computer Science, 2022, 49(7): 31-39.
[10] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[11] CHENG Cheng, JIANG Ai-lian. Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction [J]. Computer Science, 2022, 49(7): 120-126.
[12] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[13] WANG Jie, LI Xiao-nan, LI Guan-yu. Adaptive Attention-based Knowledge Graph Completion [J]. Computer Science, 2022, 49(7): 204-211.
[14] YU Shu-hao, ZHOU Hui, YE Chun-yang, WANG Tai-zheng. SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion [J]. Computer Science, 2022, 49(6A): 256-260.
[15] MAO Sen-lin, XIA Zhen, GENG Xin-yu, CHEN Jian-hui, JIANG Hong-xia. FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition [J]. Computer Science, 2022, 49(6A): 285-290.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!