计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 198-205.doi: 10.11896/jsjkx.230400052

• 计算机图形学&多媒体 • 上一篇    下一篇

基于对比学习的视觉增强多模态命名实体识别

于碧辉, 谭淑月, 魏靖烜, 孙林壮, 卜立平, 赵艺曼   

  1. 中国科学院大学 北京 100049
    中国科学院沈阳计算技术研究所 沈阳 110168
  • 收稿日期:2023-04-09 修回日期:2023-09-06 出版日期:2024-06-15 发布日期:2024-06-05
  • 通讯作者: 谭淑月(tanshuyue21@mails.ucas.ac.cn)
  • 作者简介:(tanshuyue21@mails.ucas.ac.cn)
  • 基金资助:
    辽宁省应用基础研究计划项目(2022JH2/101300258)

Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning

YU Bihui, TAN Shuyue, WEI Jingxuan, SUN Linzhuang, BU Liping, ZHAO Yiman   

  1. University of Chinese Academy of Sciences,Beijing 100049,ChinaShenyang
    Institute of Computing Technology,Chinese Academy of Sciences,Shenyang 110168,China
  • Received:2023-04-09 Revised:2023-09-06 Online:2024-06-15 Published:2024-06-05
  • About author:YU Bihui,born in 1982,Ph.D,resear-cher.His main research interest is multi-modal learning.
    TAN Shuyue,born in 2000,postgra-duate.Her main research interests include multimodal alignment and named entity recognition.
  • Supported by:
    Applied Basic Research Project of Liaoning Province(2022JH2/101300258).

摘要: 多模态命名实体识别(MNER)的目的是在给定的图像-文本对中检测实体范围并将其分类为相应的实体类型。尽管现存的MNER方法取得了成功,但它们都集中在使用图像编码器提取视觉特征后,不做增强或过滤处理,直接送入跨模态交互机制。此外,由于文本和图像的表示来自不同的编码器,很难弥合两种模态之间的语义鸿沟,因此,提出了一个基于对比学习的视觉增强多模态命名实体识别模型(MCLAug)。首先,使用ResNet收集图像特征,在此基础上提出金字塔双向融合策略,将低层次高分辨率和高层次强语义的图像信息结合起来,以增强视觉特征。其次,利用CLIP 模型中的多模态对比学习思想,计算并最小化对比损失,使两种模态的表示更加一致。最后,利用跨模态注意力机制和门控融合机制获得融合后的图像和文本表示,并通过CRF解码器来执行MNER任务。在两个公开数据集上进行了对比实验并进行消融研究和案例研究,结果证明了所提模型的有效性。

关键词: 多模态命名实体识别, CLIP, 多模态对比学习, 特征金字塔, Transformer, 门控融合机制

Abstract: Multimodal named entity recognition(MNER) aims to detect ranges of entities in a given image-text pair and classifies them into corresponding entity types.Although existing MNER methods have achieved success,they all focus on using image encoder to extract visual features,without enhancement or filtering,and directly feed them into cross-modal interaction mechanism.Moreover,since the representations of text and images come from different encoders,it is difficult to bridge the semantic gap between the two modalities.Therefore,a vision-enhanced multimodal named entity recognition model based on contrastive learning (MCLAug) is proposed.First,ResNet is used to collect image features.On this basis,a pyramid bidirectional fusion strategy is proposed to combine low-level high-resolution with high-level strong semantic image information to enhance visual features.Se-condly,using the idea of multimodal contrastive learning in the CLIP model,calculate and minimize the contrastive loss to make the representations of the two modalities more consistent.Finally,the fused image and text representations are obtained using a cross-modal attention mechanism and a gated fusion mechanism,and a CRF decoder is used to perform the MNER task.Comparative experiments,ablation studies and case studies on 2 publicly datasets demonstrate the effectiveness of the proposed model.

Key words: Multimodal named entity recognition, CLIP, Multimodal contrastive learning, Feature pyramid, Transformer, Gated fusion mechanism

中图分类号: 

  • TP391
[1]WANG X,TIAN J,GUI M,et al.PromptMNER:Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition[C]//Database Systems for Advanced Applications:27th International Conference.Cham:Springer International Publishing,2022:297-305.
[2]ZHANG X,YUAN J,LI L,et al.Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition[C]//Procee-dings of the Sixteenth ACM International Conference on Web Search and Data Mining.Singapore:ACM,2023:958-966.
[3]ZHENG C,WU Z,FENG J,et al.MNRE:A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts[C]//2021 IEEE International Conference on Multimedia and Expo.Shenzhen,China:IEEE Press,2021:1-6.
[4]ZHAO Y,WANG W,ZHANG H,et al.Learning Homogeneous and Heterogeneous Co-Occurrences for Unsupervised Cross-Modal Retrieval[C]//2021 IEEE International Conference on Multimedia and Expo.Shenzhen,China:IEEE Press,2021:1-6.
[5]WANG X,CAI J,JIANG Y,et al.Named Entity and Relation Extraction with Multi-Modal Retrieval[C]//Conference on Empirical Methods in Natural Language Processing.2022:5954-5965.
[6]ZHAO G,DONG G,SHI Y,et al.Entity-level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition[C]//Findings of the Association for Computational Linguistics:EMNLP 2022.Abu Dhabi,United Arab Emirates:Association for Computational Linguistics,2022:6345-6350.
[7]ZHOU X,ZHANG X,TAO C,et al.Multi-Grained Knowledge Distillation for Named Entity Recognition[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics.Online:Association for Computational Linguistics.2021:5704-5716.
[8]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Las Vegas:IEEE Computer Society,2016:770-778.
[9]HE K,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV).Venice,Italy:IEEE Press,2017:2980-2988.
[10]RADFORD A,KIM W,HALLACY C,et al.Learning Transfe-rable Visual Models From Natural Language Supervision[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2021.
[11]MOON S,NEVES L,CARVALHO V.Multimodal named entity recognition for short social media posts[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics.New Orleans,Louisiana:Association for Computational Linguistics,2018:852-860.
[12]CHEN S,AGUILAR G,NEVES L,et al.Can images help re-cognize entities? A study of the role of images for Multimodal NER[C]//Proceedings of the 2021 EMNLP Workshop W-NUT:The Seventh Workshop on Noisy User-generated Text.2021:87-96.
[13]WANG X,GUI M,JIANG Y,et al.ITA:Image-Text Align-ments for Multi-Modal Named Entity Recognition[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa-tional Linguistics.Seattle,United States:Association for Computational Linguistics.2022:3176-3189.
[14]LU D,NEVES L,CARVALHO V,et al.Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne,Australia:Association for Computa-tional Linguistics,2018:1990-1999.
[15]ZHANG Q,FU J,LIU X,et al.Adaptive Co-Attention Network for Named Entity Recognition in Tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[16]YU J,JIANG J,YANG L,et al.Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3342-3352.
[17]ZHANG D,WEI S,LI S,et al.Multimodal graph fusion fornamed entity recognition with targeted visual guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:14347-14355.
[18]CHEN D,LI Z,GU B,et al.Multimodal Named Entity Recognition with Image Attributes and Image Knowledge[C]//Database Systems for Advanced Applications.Cham:Springer International Publishing,2021:186-201.
[19]SUN L,WANG J,SU Y,et al.RIVA:A Pre-trained TweetMultimodal Model Based on Text-image Relation for Multimodal NER[C]//Proceedings of the 28th International Conference on Computational Linguistics.Barcelona,Spain:International Committee on Computational Linguistics,2020:1852-1862.
[20]SUN L,WANG J,ZHANG K,et al.RpBERT:A Text-image Relation Propagation-based BERT Model for Multimodal NER[C]//Proceedings of Association for the Advancement of Artificial Intelligence.2021.
[21]XU B,HUANG S,SHA C,et al.MAF:A General Matching and Alignment Framework for Multimodal Named Entity Recognition[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.New York:Association for Computing Machinery.2022:1215-1223.
[22]WU Z,ZHENG C,CAI Y,et al.Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia.New York:Association for Computing Machinery.2020:1038-1046.
[23]ZHENG C,WU Z,WANG T,et al.Object Aware MultimodalNamed Entity Recognition in Social Media Posts With Adversa-rial Learning[C]//IEEE Transactions on Multimedia.2021:2520-2532.
[24]JIA M,SHEN L,SHEN X,et al.MNER-QG:An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2022.
[25]CHEN J,XUE Y,ZHANG H,et al.On development of multimodal named entity recognition using part-of-speech and mixture of experts[J].International Journal of Machine Learning and Cybernetics.2022,24:1-2.
[26]ERIK F S,JORN V.Representing Text Chunks[C]//Procee-dings of the Ninth Conference on European Chapter of the Association for Computational Linguistics.USA:Association for Computational Linguistics,1999:173-179.
[27]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT 2019.Minneapolis,Minnesota:Association for Computational Linguistics,2019:4171-4186.
[28]LIU S,QI L,QIN H,et al.Path Aggregation Network for Instance Segmentation.In 2018 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Salt Lake City:IEEE Computer Society,2018:8759-8768.
[29]LIN T,DOLLAR P,GIRSHICK R,et al.Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu:IEEE Computer Society,2017:936-944.
[30]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning (ICML’20).2020:1597-1607.
[31]GAO T,YAO X,CHEN D.SimCSE:Simple Contrastive Lear-ning of Sentence Embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Proces-sing.Online and Punta Cana,Dominican Republic:Association for Computational Linguistics,2021:6894-6910.
[32]TSAI Y H H,BAI S,LIANG P P J,et al.Multimodal Trans-former for Unaligned Multimodal Language Sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy:Association for Computational Linguistics,2019:6558-6569.
[33]BA L,KIROS R,HINTON G.2016.Layer normalization[J].arXiv preprint arXiv:1607.06450.
[34]JOHN L,ANDREW M,FERNANDO C.Conditional RandomFields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.2001:282-289.
[35]MA X,HOVY E.End-to-end Sequence Labeling via Bi-direc-tional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1064-1074.
[36]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural Architectures for Named Entity Recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.San Diego,California:Association for Computational Linguistics,2016:260-270.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!