计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 126-134.doi: 10.11896/jsjkx.230300079
丁泓馨1,2, 邹佩聂1,3, 赵俊峰1,2, 王亚沙1,2
DING Hongxin1,2, ZOU Peinie1,3, ZHAO Junfeng1,2, WANG Yasha1,2
摘要: 非结构化文本数据中蕴含了大量有价值的知识,从中抽取出实体与关系形成结构化的知识,有助于知识图谱的构建,也可以为下游任务提供支持,具有广泛的应用前景。目前,实体与关系抽取问题多采用深度学习方法,但其模型的训练需要消耗大量标注数据,人工成本高,如何减少人工标注的工作量是当前研究的重点之一。主动学习是机器学习的领域之一,旨在通过选择最有价值的样本交予模型训练,在最大化模型性能增益的同时减少模型训练所需的数据量,其减少模型训练所需数据的潜力与深度学习数据贪婪的特性互补。因此,将主动学习应用到深度学习中的深度主动学习也是目前的研究热点。在上述背景下,使用深度主动学习进行实体与关系的联合抽取,将主动学习用于实体与关系抽取的深度学习模型的训练过程,在保持抽取模型性能的同时尽可能减少模型训练所需的人工标注数据。使用了一个基于统一标签空间、通过矩阵标注实现实体与关系联合抽取的深度学习模型,并在其基础上设计并实现了多种主动学习采样策略,在医疗领域的文本数据集和常用的实体与关系联合抽取数据集上验证了所提方法的有效性。对主动学习停止时机确定问题展开了研究,提出了根据模型训练损失曲线、模型在训练集上的性能、模型在预留数据上的预测稳定性来选择训练停止时机的方法,并通过实验研究了面向实际应用场景选取停止时机的方法。设计并实现了基于主动学习的文本实体与关系联合抽取的智能文本标注工具,可供用户对文本进行实体标注与关系标注,该工具实现了实体与关系抽取的深度学习模型与主动学习方法,可以最大程度地减少用户标注的工作量。
中图分类号:
[1]HANISCH D,FUNDEL K,MEVISSEN H T,et al.ProMiner:rule-based protein and gene entity recognition[J].BMC Bioinformatics,2005,6(1):1-9. [2]ROCKTÄSCHEL T,WEIDLICH M,LESER U.ChemSpot:ahybrid system for chemical named entity recognition[J].Bioinformatics,2012,28(12):1633-1640. [3]ZHENG S,WANG F,BAO H,et al.Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme[C]//Procee-dings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2017:1227-1236. [4]WEI Z,SU J,WANG Y,et al.A Novel Cascade Binary Tagging Framework for Relational Triple Extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:1476-1488. [5]WANG J,LU W.Two are Better than One:Joint Entity and Relation Extraction with Table-Sequence Encoders[C]//Procee-dings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).2020:1706-1721. [6]WANG Y,SUN C,WU Y,et al.UniRE:A Unified Label Spacefor Entity Relation Extraction[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:220-231. [7]SHEN Y,YUN H,LIPTON Z C,et al.Deep Active Learning for Named Entity Recognition[C]//Proceedings of the 2nd Workshop on Representation Learning for NLP.2017:252-256. [8]ZHDANOV F.Diverse mini-batch active learning[J].arXiv:1901.05954,2019. [9]ASH J T,ZHANG C,KRISHNAMURTHY A,et al.Deepbatch active learning by diverse,uncertain gradient lower bounds[J].arXiv:1906.03671,2019. [10]ZHANG N,CHEN M,BI Z,et al.CBLUE:A Chinese Biomedi-cal Language Understanding Evaluation Benchmark[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:7888-7915. [11]HONGYING Z,WENXIN L,KUNLI Z,et al.Building a pediatric medical corpus:Word segmentation and named entity annotation[C]//21st Workshop Chinese Lexical Semantics(CLSW 2020).Hong Kong,China,Revised Selected Papers 21.Springer International Publishing,2021:652-664. [12]GUAN T,ZAN H,ZHOU X,et al.CMeIE:Construction andevaluation of Chinese medical information extraction dataset[C]//9th CCF International Conference Natural Language Processing and Chinese Computing(NLPCC 2020).2020:270-282. |
|