Computer Science ›› 2024, Vol. 51 ›› Issue (9): 214-222.doi: 10.11896/jsjkx.230800102

• Artificial Intelligence • Previous Articles     Next Articles

Domain-adaptive Entity Resolution Algorithm Based on Semi-supervised Learning

DAI Chaofan, DING Huahua   

  1. National Key Laboratory of Information Systems Engineering,National University of Defense Technology,Changsha 410073,China
  • Received:2023-08-16 Revised:2023-11-27 Online:2024-09-15 Published:2024-09-10
  • About author:DAI Chaofan,born in 1973,Ph.D,professor.His main research interests include big data analytics and data quality.
    DING Huahua,born in 1999,postgra-duate.His main research interests include entity resolution and data integration.

Abstract: Entity resolution is a fundamental task in many natural language processing tasks,which aims to find out whether two data entities refer to the same entity.Existing deep learning-based solutions for entity resolution typically require a large amount of annotated data,even when pre-trained language models are used for training.Obtaining such annotated data is challenging in real-world scenarios.To address this issue,a domain-adaptive entity resolution model based on semi-supervised learning is proposed.First,a classifier is trained on the source domain,and then domain adaptation is used to reduce the distributional difference between the source and target domains.Soft pseudo-labels from the augmented target domain are then added to the source domain for iterative training,enabling knowledge transfer from the source to the target domain.Comparison and ablation experiments are performed on 13 datasets from various domains.The results show that,compared to unsupervised baseline models,the proposed model achieves an average F1 score improvement of 2.84%,9.16%,and 7.1% across multiple datasets.Compared to supervised baseline models,it achieves comparable performance with only 20% to 40% of the labels required.Ablation experiments further demonstrate the effectiveness of the proposed model,and better entity resolution results can be obtained in general(The relevant code is available1)).

Key words: Entity resolution, Domain adaptation, Pseudo-labels, Pre-trained language model, Data augmentation

CLC Number: 

  • TP391
[1]SINGH R,MEDURI V V,ELMAGARMID A,et al.Synthesizing entity matching rules by examples[J].Proceedings of the VLDB Endowment,2017,11(2):189-202.
[2]BILENKO M,MOONEY R J.Adaptive duplicate detectionusing learnable string similarity measures[C]//Proceedings of the Ninth ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2003:39-48.
[3]DOAN A H,KONDA P,SUGANTHAN GC P,et al.Magellan:toward building ecosystems of entity matching solutions[J].Communications of the ACM,2020,63(8):83-91.
[4]MUDGAL S,LI H,REKATSINAS T,et al.Deep learning for entity matching:A design space exploration[C]//Proceedings of the 2018 International Conference on Management of Data.2018:19-34.
[5]EBRAHEEM M,THIRUMURUGANATHAN S,JOTY S,et al.Distributed representations of tuples for entity resolution[J].Proceedings of the VLDB Endowment,2018,11(11):1454-1467.
[6]LI Y,LI J,SUHARA Y,et al.Deep entity matching with pre-trained language models[J].arXiv:2004.00584,2020.
[7]PRIMPELI A,PEETERS R,BIZER C.The WDC training dataset and gold standard for large-scale product matching[C]//Companion Proceedings of The 2019 World Wide Web Confe-rence.2019:381-386.
[8]TU J,FAN J,TANG N,et al.Domain adaptation for deep entity resolution[C]//Proceedings of the 2022 International Confe-rence on Management of Data.2022:443-457.
[9]ARAZO E,ORTEGO D,ALBERT P,et al.Unsupervised label noise modeling and loss correction[C]//International Confe-rence on Machine Learning.PMLR,2019:312-321.
[10]OLIVER A,ODENA A,RAFFEL C A,et al.Realistic evalu-ation of deep semi-supervised learning algorithms[C]//Procee-dings of theAdvances in Neural Information Processing Systems.2018:3239-3250.
[11]ZHANG Z,RINGEVAL F,DONG B,et al.Enhanced semi-supervised learning for multimodal emotion recognition[C]//2016 IEEEInternational Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2016:5185-5189.
[12]GONZÁLEZ M,BERGMEIR C,TRIGUERO I,et al.Self-labeling techniques for semi-supervised time series classification:an empirical study[J].Knowledge and Information Systems,2018,55:493-528.
[13]MIYATO T,DAI A M,GOODFELLOW I.Adversarial training methods for semi-supervised text classification[J].arXiv:1605.07725,2016.
[14]LI Y,LIU L,TAN R T.Certainty-driven consistency loss for semi-supervised learning[J].arXiv.1901.05657,2019.
[15]SAJJADI M,JAVANMARDI M,TASDIZEN T.Regularization with stochastic transformations and perturbations for deep semi-supervised learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.2016:1171-1179.
[16]ARAZO E,ORTEGO D,ALBERT P,et al.Pseudo-labeling and confirmation bias in deep semi-supervised learning[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[17]ISCEN A,TOLIAS G,AVRITHIS Y,et al.Label propagationfor deep semi-supervised learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:5070-5079.
[18]ZHANG H,CISSE M,DAUPHIN Y N,et al.mixup:Beyondempirical risk minimization[J].arXiv:1710.09412,2017.
[19]SINGH R,MEDURI V,ELMAGARMID A,et al.Generatingconcise entity matching rules[C]//Proceedings of the 2017 ACM International Conference on Management of Data.2017:1635-1638.
[20]SUN B,FENG J,SAENKO K.Return of frustratingly easy domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016.
[21]CHAI C,LI G,LI J,et al.Cost-effective crowdsourced entity resolution:A partial-order approach[C]//Proceedings of the 2016 International Conference on Management of Data.2016:969-984.
[22]CHAI C,LI G,LI J,et al.A partial-order-based framework for cost-effective crowdsourced entity resolution[J].The VLDB Journal,2018,27:745-770.
[23]CUI L,CHEN J,HE W,et al.Achieving approximate global optimization of truth inference for crowdsourcing microtasks[J].Data Science and Engineering,2021,6(3):294-309.
[24]LI G,CHAI C,FAN J,et al.CDB:A crowd-powered database system[J].Proceedings of the VLDB Endowment,2018,11(12):1926-1929.
[25]AZZALINI F,JIN S,RENZI M,et al.Blocking techniques for entity linkage:A semantics-based approach[J].Data Science and Engineering,2021,6:20-38.
[26]CHRISTEN P.Automatic record linkage using seeded nearest neighbour and support vector machine classification[C]//Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2008:151-159.
[27]MCCALLUM A,WELLNER B.Conditional models of identity uncertainty with application to noun coreference[C]//Procee-dings of the 17th International Conference on Neural Information Processing Systems.2004:905-912.
[28]YAO D,GU Y,CONG G,et al.Entity resolution with hierarchical graph attention networks[C]//Proceedings of the 2022 International Conference on Management of Data.2022:429-442.
[29]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The Journal of Machine Learning Research,2016,17(1):2096-2030.
[30]LIU T,FAN J,LUO Y,et al.Adaptive data augmentation for supervised learning over missing data[J].Proceedings of the VLDB Endowment,2021,14(7):1202-1214.
[31]LONG M,CAO Y,WANG J,et al.Learning transferable features with deep adaptation networks[C]//International Confe-rence on Machine Learning.PMLR,2015:97-105.
[32]TANG H,JIA K.Discriminative adversarial domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:5940-5947.
[33]TZENG E,HOFFMAN J,SAENKO K,et al.Adversarial discriminative domain adaptation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7167-7176.
[34]KUMAGAI A,IWATA T,FUJIWARA Y.Transfer metriclearning for unseen domains[J].Data Science and Engineering,2020,5:140-151.
[35]CHOI Y,CHOI M,KIM M,et al.Stargan:Unified generativeadversarial networks for multi-domain image-to-image translation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:8789-8797.
[36]GE Y,CHEN D,LI H.Mutual mean-teaching:Pseudo label refinery for unsupervised domain adaptation on person re-identification[J].arXiv:2001.01526,2020.
[37]LONG M,ZHU H,WANG J,et al.Deep transfer learning with joint adaptation networks[C]//International Conference on Machine Learning.PMLR,2017:2208-2217.
[38]SUN B,FENG J,SAENKO K.Return of frustratingly easy domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2016.
[39]ZELLINGER W,GRUBINGER T,LUGHOFER E,et al.Central moment discrepancy(cmd) for domain-invariant representation learning[J].arXiv:1702.08811,2017.
[40]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The journal of machine learning research,2016,17(1):2096-2030.
[41]TZENG E,HOFFMAN J,SAENKO K,et al.Adversarial dis-criminative domain adaptation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7167-7176.
[42]GHIFARY M,KLEIJN W B,ZHANG M,et al.Deep reconstruction-classification networks for unsupervised domain adaptation[C]//Computer Vision-ECCV 2016:14th European Conference,Amsterdam,The Netherlands,October 11-14,2016,Proceedings,Part IV 14.Springer International Publishing,2016:597-613.
[43]THIRUMURUGANATHAN S,PARAMBATH S A P,OUZZANI M,et al.Reuse and adaptation for entity resolutionthrough transfer learning[J].arXiv:1809.11084,2018.
[44]KASAI J,QIAN K,GURAJADA S,et al.Low-resource deep entity resolution with transfer and active learning[J].arXiv:1906.08042,2019.
[45]TU J,FAN J,TANG N,et al.Domain adaptation for deep entity resolution[C]//Proceedings of the 2022 International Confe-rence on Management of Data.2022:443-457.
[46]TRABELSI M,HEFLIN J,CAO J.DAME:Domain Adaptation for Matching Entities[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.2022:1016-1024.
[47]TANG N,FAN J,LI F,et al.RPT:relational pre-trained transformer is almost all you need towards democratizing data preparation[J].arXiv:2012.02469,2020.
[48]LEE J,TOUTANOVA K.Pre-training of deep bidirectionaltransformers for language understanding[J].arXiv:1810.04805,2018.
[49]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach[J].arXiv:1907.11692,2019.
[50]SANH V,DEBUT L,CHAUMOND J,et al.DistilBERT,a distilled version of BERT:smaller,faster,cheaper and lighter[J].arXiv:1910.01108,2019.
[51]RIZVE M N,DUARTE K,RAWAT Y S,et al.In defense of pseudo-labeling:An uncertainty-aware pseudo-label selection framework for semi-supervised learning[J].arXiv:2101.06329,2021.
[52]MUKHERJEE S,AWADALLAH A.Uncertainty-aware self-training for few-shot text classification[J].Advances in Neural Information Processing Systems,2020,33:21199-21212.
[53]GAL Y,GHAHRAMANI Z.Dropout as a bayesian approximation:Representing model uncertainty in deep learning[C]//International Conference on Machine Learning.PMLR,2016:1050-1059.
[54]WU R,CHABA S,SAWLANI S,et al.Zeroer:Entity resolution using zero labeled examples[C]//Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data.2020:1149-1164.
[55]VAN DER MAATEN L.Barnes-hut-sne[J].arXiv:1301.3342,2013.
[1] MO Shuyuan, MENG Zuqiang. Multimodal Sentiment Analysis Model Based on Visual Semantics and Prompt Learning [J]. Computer Science, 2024, 51(9): 250-257.
[2] KANG Xinchen, DONG Xueyan, YAO Dengfeng, ZHONG Jinghua. Advancements and Prospects in Dysarthria Speaker Adaptation [J]. Computer Science, 2024, 51(8): 11-19.
[3] TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
[4] TIAN Qing, LU Zhanghu, YANG Hong. Unsupervised Domain Adaptation Based on Entropy Filtering and Class Centroid Optimization [J]. Computer Science, 2024, 51(7): 345-353.
[5] YIN Xudong, CHEN Junyang, ZHOU Bo. Study on Industrial Defect Augmentation Data Filtering Based on OOD Scores [J]. Computer Science, 2024, 51(6A): 230700111-7.
[6] YANG Binxia, LUO Xudong, SUN Kaili. Recent Progress on Machine Translation Based on Pre-trained Language Models [J]. Computer Science, 2024, 51(6A): 230700112-8.
[7] XU Yiran, ZHOU Yu. Prompt Learning Based Parameter-efficient Code Generation [J]. Computer Science, 2024, 51(6): 61-67.
[8] WANG Jiahao, FU Yifu, FENG Hainan, REN Yuheng. Indoor Location Algorithm in Dynamic Environment Based on Transfer Learning [J]. Computer Science, 2024, 51(5): 277-283.
[9] PAN Lei, LIU Xin, CHEN Junyi, CHENG Zhangtao, LIU Leyuan, ZHOU Fan. Event Prediction Based on Dynamic Graph with Local Data Augmentation [J]. Computer Science, 2024, 51(3): 118-127.
[10] JING Yeyiran, YU Zeng, SHI Yunxiao, LI Tianrui. Review of Unsupervised Domain Adaptive Person Re-identification Based on Pseudo-labels [J]. Computer Science, 2024, 51(1): 72-83.
[11] XU Jie, WANG Lisong. Contrastive Clustering with Consistent Structural Relations [J]. Computer Science, 2023, 50(9): 123-129.
[12] CUI Fuwei, WU Xuanxuan, CHEN Yufeng, LIU Jian, XU Jin'an. Survey of Domain Adaptive Methods with Knowledge Integrating [J]. Computer Science, 2023, 50(8): 142-149.
[13] LIANG Jiayin, XIE Zhipeng. Text Paraphrase Generation Based on Pre-trained Language Model and Tag Guidance [J]. Computer Science, 2023, 50(8): 150-156.
[14] ZENG Wu, MAO Guojun. Few-shot Learning Method Based on Multi-graph Feature Aggregation [J]. Computer Science, 2023, 50(6A): 220400029-10.
[15] WANG Qingyu, WANG Hairui, ZHU Guifu, MENG Shunjian. Study on SQL Injection Detection Based on FlexUDA Model [J]. Computer Science, 2023, 50(6A): 220600172-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!