基于半监督学习的域适应实体解析算法

doi:10.11896/jsjkx.230800102

摘要/Abstract

摘要： 实体解析旨在查找两个数据实体是否引用同一实体,是许多自然语言处理任务中的一项基本任务。现有的基于深度学习的实体解析解决方案通常需要大量的标注数据,即使利用预训练的语言模型进行训练,仍然需要数千个标签才能达到令人满意的准确性。现实场景中,这些标注数据并不容易获得。针对上述问题,提出了一个基于半监督学习的域适应实体解析模型。首先,在源域上训练一个分类器,然后利用域适应减小源域和目标域的分布差异,同时用数据增强后的目标域软伪标签加入源域迭代训练,从而实现从源域到目标域的知识迁移。在13个来自相同或不同领域的数据集上对所提模型进行了对比实验和消融实验,实验结果表明,与无监督基线模型相比,所提模型在多个数据集上的F1值平均提升了2.84%,9.16%和7.1%;与有监督基线模型相比,所提模型只需要20%~40%的标签就可以达到与有监督学习相当的性能。消融实验进一步证明了所提模型的有效性,其总体上可以获得更好的实体解析结果(相关代码已开源¹⁾)。

关键词: 实体解析, 域适应, 伪标签, 预训练语言模型, 数据增强

Abstract: Entity resolution is a fundamental task in many natural language processing tasks,which aims to find out whether two data entities refer to the same entity.Existing deep learning-based solutions for entity resolution typically require a large amount of annotated data,even when pre-trained language models are used for training.Obtaining such annotated data is challenging in real-world scenarios.To address this issue,a domain-adaptive entity resolution model based on semi-supervised learning is proposed.First,a classifier is trained on the source domain,and then domain adaptation is used to reduce the distributional difference between the source and target domains.Soft pseudo-labels from the augmented target domain are then added to the source domain for iterative training,enabling knowledge transfer from the source to the target domain.Comparison and ablation experiments are performed on 13 datasets from various domains.The results show that,compared to unsupervised baseline models,the proposed model achieves an average F1 score improvement of 2.84%,9.16%,and 7.1% across multiple datasets.Compared to supervised baseline models,it achieves comparable performance with only 20% to 40% of the labels required.Ablation experiments further demonstrate the effectiveness of the proposed model,and better entity resolution results can be obtained in general(The relevant code is available¹⁾).

Key words: Entity resolution, Domain adaptation, Pseudo-labels, Pre-trained language model, Data augmentation

中图分类号:

TP391

戴超凡, 丁华华. 基于半监督学习的域适应实体解析算法[J]. 计算机科学, 2024, 51(9): 214-222. https://doi.org/10.11896/jsjkx.230800102

DAI Chaofan, DING Huahua. Domain-adaptive Entity Resolution Algorithm Based on Semi-supervised Learning[J]. Computer Science, 2024, 51(9): 214-222. https://doi.org/10.11896/jsjkx.230800102

参考文献

[1]SINGH R,MEDURI V V,ELMAGARMID A,et al.Synthesizing entity matching rules by examples[J].Proceedings of the VLDB Endowment,2017,11(2):189-202.
[2]BILENKO M,MOONEY R J.Adaptive duplicate detectionusing learnable string similarity measures[C]//Proceedings of the Ninth ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2003:39-48.
[3]DOAN A H,KONDA P,SUGANTHAN GC P,et al.Magellan:toward building ecosystems of entity matching solutions[J].Communications of the ACM,2020,63(8):83-91.
[4]MUDGAL S,LI H,REKATSINAS T,et al.Deep learning for entity matching:A design space exploration[C]//Proceedings of the 2018 International Conference on Management of Data.2018:19-34.
[5]EBRAHEEM M,THIRUMURUGANATHAN S,JOTY S,et al.Distributed representations of tuples for entity resolution[J].Proceedings of the VLDB Endowment,2018,11(11):1454-1467.
[6]LI Y,LI J,SUHARA Y,et al.Deep entity matching with pre-trained language models[J].arXiv:2004.00584,2020.
[7]PRIMPELI A,PEETERS R,BIZER C.The WDC training dataset and gold standard for large-scale product matching[C]//Companion Proceedings of The 2019 World Wide Web Confe-rence.2019:381-386.
[8]TU J,FAN J,TANG N,et al.Domain adaptation for deep entity resolution[C]//Proceedings of the 2022 International Confe-rence on Management of Data.2022:443-457.
[9]ARAZO E,ORTEGO D,ALBERT P,et al.Unsupervised label noise modeling and loss correction[C]//International Confe-rence on Machine Learning.PMLR,2019:312-321.
[10]OLIVER A,ODENA A,RAFFEL C A,et al.Realistic evalu-ation of deep semi-supervised learning algorithms[C]//Procee-dings of theAdvances in Neural Information Processing Systems.2018:3239-3250.
[11]ZHANG Z,RINGEVAL F,DONG B,et al.Enhanced semi-supervised learning for multimodal emotion recognition[C]//2016 IEEEInternational Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:5185-5189.
[12]GONZÁLEZ M,BERGMEIR C,TRIGUERO I,et al.Self-labeling techniques for semi-supervised time series classification:an empirical study[J].Knowledge and Information Systems,2018,55:493-528.
[13]MIYATO T,DAI A M,GOODFELLOW I.Adversarial training methods for semi-supervised text classification[J].arXiv:1605.07725,2016.
[14]LI Y,LIU L,TAN R T.Certainty-driven consistency loss for semi-supervised learning[J].arXiv.1901.05657,2019.
[15]SAJJADI M,JAVANMARDI M,TASDIZEN T.Regularization with stochastic transformations and perturbations for deep semi-supervised learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1171-1179.
[16]ARAZO E,ORTEGO D,ALBERT P,et al.Pseudo-labeling and confirmation bias in deep semi-supervised learning[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[17]ISCEN A,TOLIAS G,AVRITHIS Y,et al.Label propagationfor deep semi-supervised learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:5070-5079.
[18]ZHANG H,CISSE M,DAUPHIN Y N,et al.mixup:Beyondempirical risk minimization[J].arXiv:1710.09412,2017.
[19]SINGH R,MEDURI V,ELMAGARMID A,et al.Generatingconcise entity matching rules[C]//Proceedings of the 2017 ACM International Conference on Management of Data.2017:1635-1638.
[20]SUN B,FENG J,SAENKO K.Return of frustratingly easy domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016.
[21]CHAI C,LI G,LI J,et al.Cost-effective crowdsourced entity resolution:A partial-order approach[C]//Proceedings of the 2016 International Conference on Management of Data.2016:969-984.
[22]CHAI C,LI G,LI J,et al.A partial-order-based framework for cost-effective crowdsourced entity resolution[J].The VLDB Journal,2018,27:745-770.
[23]CUI L,CHEN J,HE W,et al.Achieving approximate global optimization of truth inference for crowdsourcing microtasks[J].Data Science and Engineering,2021,6(3):294-309.
[24]LI G,CHAI C,FAN J,et al.CDB:A crowd-powered database system[J].Proceedings of the VLDB Endowment,2018,11(12):1926-1929.
[25]AZZALINI F,JIN S,RENZI M,et al.Blocking techniques for entity linkage:A semantics-based approach[J].Data Science and Engineering,2021,6:20-38.
[26]CHRISTEN P.Automatic record linkage using seeded nearest neighbour and support vector machine classification[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2008:151-159.
[27]MCCALLUM A,WELLNER B.Conditional models of identity uncertainty with application to noun coreference[C]//Procee-dings of the 17th International Conference on Neural Information Processing Systems.2004:905-912.
[28]YAO D,GU Y,CONG G,et al.Entity resolution with hierarchical graph attention networks[C]//Proceedings of the 2022 International Conference on Management of Data.2022:429-442.
[29]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The Journal of Machine Learning Research,2016,17(1):2096-2030.
[30]LIU T,FAN J,LUO Y,et al.Adaptive data augmentation for supervised learning over missing data[J].Proceedings of the VLDB Endowment,2021,14(7):1202-1214.
[31]LONG M,CAO Y,WANG J,et al.Learning transferable features with deep adaptation networks[C]//International Confe-rence on Machine Learning.PMLR,2015:97-105.
[32]TANG H,JIA K.Discriminative adversarial domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5940-5947.
[33]TZENG E,HOFFMAN J,SAENKO K,et al.Adversarial discriminative domain adaptation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7167-7176.
[34]KUMAGAI A,IWATA T,FUJIWARA Y.Transfer metriclearning for unseen domains[J].Data Science and Engineering,2020,5:140-151.
[35]CHOI Y,CHOI M,KIM M,et al.Stargan:Unified generativeadversarial networks for multi-domain image-to-image translation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8789-8797.
[36]GE Y,CHEN D,LI H.Mutual mean-teaching:Pseudo label refinery for unsupervised domain adaptation on person re-identification[J].arXiv:2001.01526,2020.
[37]LONG M,ZHU H,WANG J,et al.Deep transfer learning with joint adaptation networks[C]//International Conference on Machine Learning.PMLR,2017:2208-2217.
[38]SUN B,FENG J,SAENKO K.Return of frustratingly easy domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016.
[39]ZELLINGER W,GRUBINGER T,LUGHOFER E,et al.Central moment discrepancy(cmd) for domain-invariant representation learning[J].arXiv:1702.08811,2017.
[40]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The journal of machine learning research,2016,17(1):2096-2030.
[41]TZENG E,HOFFMAN J,SAENKO K,et al.Adversarial dis-criminative domain adaptation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7167-7176.
[42]GHIFARY M,KLEIJN W B,ZHANG M,et al.Deep reconstruction-classification networks for unsupervised domain adaptation[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,October 11-14,2016,Proceedings,Part IV 14.Springer International Publishing,2016:597-613.
[43]THIRUMURUGANATHAN S,PARAMBATH S A P,OUZZANI M,et al.Reuse and adaptation for entity resolutionthrough transfer learning[J].arXiv:1809.11084,2018.
[44]KASAI J,QIAN K,GURAJADA S,et al.Low-resource deep entity resolution with transfer and active learning[J].arXiv:1906.08042,2019.
[45]TU J,FAN J,TANG N,et al.Domain adaptation for deep entity resolution[C]//Proceedings of the 2022 International Confe-rence on Management of Data.2022:443-457.
[46]TRABELSI M,HEFLIN J,CAO J.DAME:Domain Adaptation for Matching Entities[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.2022:1016-1024.
[47]TANG N,FAN J,LI F,et al.RPT:relational pre-trained transformer is almost all you need towards democratizing data preparation[J].arXiv:2012.02469,2020.
[48]LEE J,TOUTANOVA K.Pre-training of deep bidirectionaltransformers for language understanding[J].arXiv:1810.04805,2018.
[49]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach[J].arXiv:1907.11692,2019.
[50]SANH V,DEBUT L,CHAUMOND J,et al.DistilBERT,a distilled version of BERT:smaller,faster,cheaper and lighter[J].arXiv:1910.01108,2019.
[51]RIZVE M N,DUARTE K,RAWAT Y S,et al.In defense of pseudo-labeling:An uncertainty-aware pseudo-label selection framework for semi-supervised learning[J].arXiv:2101.06329,2021.
[52]MUKHERJEE S,AWADALLAH A.Uncertainty-aware self-training for few-shot text classification[J].Advances in Neural Information Processing Systems,2020,33:21199-21212.
[53]GAL Y,GHAHRAMANI Z.Dropout as a bayesian approximation:Representing model uncertainty in deep learning[C]//International Conference on Machine Learning.PMLR,2016:1050-1059.
[54]WU R,CHABA S,SAWLANI S,et al.Zeroer:Entity resolution using zero labeled examples[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:1149-1164.
[55]VAN DER MAATEN L.Barnes-hut-sne[J].arXiv:1301.3342,2013.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed