计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 126-134.doi: 10.11896/jsjkx.230300079

• 人工智能 • 上一篇    下一篇

一种基于主动学习的文本实体与关系联合抽取方法

丁泓馨1,2, 邹佩聂1,3, 赵俊峰1,2, 王亚沙1,2   

  1. 1 北京大学计算机学院 北京100871
    2 高可信软件技术教育部重点实验室 北京100871
    3 北京大学软件与微电子学院 北京102600
  • 收稿日期:2023-03-09 修回日期:2023-06-23 出版日期:2023-10-10 发布日期:2023-10-10
  • 通讯作者: 赵俊峰(zhaojf@pku.edu.cn)
  • 作者简介:(dinghx@pku.edu.cn)
  • 基金资助:
    国家自然科学基金(62172011);中央高校基本科研业务费专项资金

Active Learning-based Text Entity and Relation Joint Extraction Method

DING Hongxin1,2, ZOU Peinie1,3, ZHAO Junfeng1,2, WANG Yasha1,2   

  1. 1 School of Computer Science,Peking University,Beijing 100871,China
    2 Key Laboratory of High Confidence Software Technologies,Ministry of Education,Beijing 100871,China
    3 School of Software & Microelectronics,Peking University,Beijing 102600,China
  • Received:2023-03-09 Revised:2023-06-23 Online:2023-10-10 Published:2023-10-10
  • About author:DING Hongxin,born in 2000,postgra-duate.Her main research interests include knowledge graph,natural language processing and so on.ZHAO Junfeng,born in 1974,Ph.D,research professor,is a member of China Computer Federation.Her main research interests include big data analysis,knowledge graph,urban computing and so on.
  • Supported by:
    National Natural Science Foundation of China(62172011) and Fundamental Research Funds for the Central Universities of Ministry of Education of China.

摘要: 非结构化文本数据中蕴含了大量有价值的知识,从中抽取出实体与关系形成结构化的知识,有助于知识图谱的构建,也可以为下游任务提供支持,具有广泛的应用前景。目前,实体与关系抽取问题多采用深度学习方法,但其模型的训练需要消耗大量标注数据,人工成本高,如何减少人工标注的工作量是当前研究的重点之一。主动学习是机器学习的领域之一,旨在通过选择最有价值的样本交予模型训练,在最大化模型性能增益的同时减少模型训练所需的数据量,其减少模型训练所需数据的潜力与深度学习数据贪婪的特性互补。因此,将主动学习应用到深度学习中的深度主动学习也是目前的研究热点。在上述背景下,使用深度主动学习进行实体与关系的联合抽取,将主动学习用于实体与关系抽取的深度学习模型的训练过程,在保持抽取模型性能的同时尽可能减少模型训练所需的人工标注数据。使用了一个基于统一标签空间、通过矩阵标注实现实体与关系联合抽取的深度学习模型,并在其基础上设计并实现了多种主动学习采样策略,在医疗领域的文本数据集和常用的实体与关系联合抽取数据集上验证了所提方法的有效性。对主动学习停止时机确定问题展开了研究,提出了根据模型训练损失曲线、模型在训练集上的性能、模型在预留数据上的预测稳定性来选择训练停止时机的方法,并通过实验研究了面向实际应用场景选取停止时机的方法。设计并实现了基于主动学习的文本实体与关系联合抽取的智能文本标注工具,可供用户对文本进行实体标注与关系标注,该工具实现了实体与关系抽取的深度学习模型与主动学习方法,可以最大程度地减少用户标注的工作量。

关键词: 主动学习, 知识抽取, 命名实体识别, 关系抽取, 人机交互

Abstract: Unstructured text data contains a large amount of valuable knowledge,entities and relations extracted from which can form structured knowledge and help to build knowledge graphs and support downstream tasks.There is a wide range of application prospects for entity and relation extraction.Currently,entity and relation extraction mostly use deep learning methods.However,the training of deep learning models consumes large amounts of annotated datasets,resulting in high labor cost.Therefore,how to reduce the workload of manual annotation is one of the focuses of research.Active learning is a subfield of machine lear-ning,which aims to maximize a model's performance gain while annotating the fewest samples possible,by selecting the most va-luable samples to be labeled and handed over to the model for training.Its potential to reduce training data complements the data-hungry nature of deep learning.Therefore,deep active learning that applies active learning in deep learning has become a hot research topic in entity and relation extraction.In the above context,using deep active learning for joint entity and relation extraction and appling active learning to the training process of the deep learning model to minimize the manual labeled data required for training while maintaining model performance,a deep learning model based on unified label space and matrix annotation for entity relation joint extraction is implemented and based on it,a variety of active learning query strategies are designed and implemented.The validity of the method is verified on text datasets and common entity and relation joint extraction datasets in the medical field.Several methods are proposed to select the stopping time of model training,including methods based on training loss curve of the model,model performance on the training set,and the prediction stability on reserved data.The method of selecting stop time for practical application scenario is studied by experiments.An intelligent text annotation tool based on active learning for joint extraction of entity and relation is designed and implemented,which allows users to annotate entities and relations in the text.The tool implements a deep learning model for entity and relation extraction and active learning methods to minimize the annotation workload of users.

Key words: Active learning, Knowledge extraction, Named entity recognition, Relation extraction, Human-machine interaction

中图分类号: 

  • TP311
[1]HANISCH D,FUNDEL K,MEVISSEN H T,et al.ProMiner:rule-based protein and gene entity recognition[J].BMC Bioinformatics,2005,6(1):1-9.
[2]ROCKTÄSCHEL T,WEIDLICH M,LESER U.ChemSpot:ahybrid system for chemical named entity recognition[J].Bioinformatics,2012,28(12):1633-1640.
[3]ZHENG S,WANG F,BAO H,et al.Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme[C]//Procee-dings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2017:1227-1236.
[4]WEI Z,SU J,WANG Y,et al.A Novel Cascade Binary Tagging Framework for Relational Triple Extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:1476-1488.
[5]WANG J,LU W.Two are Better than One:Joint Entity and Relation Extraction with Table-Sequence Encoders[C]//Procee-dings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:1706-1721.
[6]WANG Y,SUN C,WU Y,et al.UniRE:A Unified Label Spacefor Entity Relation Extraction[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:220-231.
[7]SHEN Y,YUN H,LIPTON Z C,et al.Deep Active Learning for Named Entity Recognition[C]//Proceedings of the 2nd Workshop on Representation Learning for NLP.2017:252-256.
[8]ZHDANOV F.Diverse mini-batch active learning[J].arXiv:1901.05954,2019.
[9]ASH J T,ZHANG C,KRISHNAMURTHY A,et al.Deepbatch active learning by diverse,uncertain gradient lower bounds[J].arXiv:1906.03671,2019.
[10]ZHANG N,CHEN M,BI Z,et al.CBLUE:A Chinese Biomedi-cal Language Understanding Evaluation Benchmark[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:7888-7915.
[11]HONGYING Z,WENXIN L,KUNLI Z,et al.Building a pediatric medical corpus:Word segmentation and named entity annotation[C]//21st Workshop Chinese Lexical Semantics(CLSW 2020).Hong Kong,China,Revised Selected Papers 21.Springer International Publishing,2021:652-664.
[12]GUAN T,ZAN H,ZHOU X,et al.CMeIE:Construction andevaluation of Chinese medical information extraction dataset[C]//9th CCF International Conference Natural Language Processing and Chinese Computing(NLPCC 2020).2020:270-282.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!