计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 198-205.doi: 10.11896/jsjkx.230400052
于碧辉, 谭淑月, 魏靖烜, 孙林壮, 卜立平, 赵艺曼
YU Bihui, TAN Shuyue, WEI Jingxuan, SUN Linzhuang, BU Liping, ZHAO Yiman
摘要: 多模态命名实体识别(MNER)的目的是在给定的图像-文本对中检测实体范围并将其分类为相应的实体类型。尽管现存的MNER方法取得了成功,但它们都集中在使用图像编码器提取视觉特征后,不做增强或过滤处理,直接送入跨模态交互机制。此外,由于文本和图像的表示来自不同的编码器,很难弥合两种模态之间的语义鸿沟,因此,提出了一个基于对比学习的视觉增强多模态命名实体识别模型(MCLAug)。首先,使用ResNet收集图像特征,在此基础上提出金字塔双向融合策略,将低层次高分辨率和高层次强语义的图像信息结合起来,以增强视觉特征。其次,利用CLIP 模型中的多模态对比学习思想,计算并最小化对比损失,使两种模态的表示更加一致。最后,利用跨模态注意力机制和门控融合机制获得融合后的图像和文本表示,并通过CRF解码器来执行MNER任务。在两个公开数据集上进行了对比实验并进行消融研究和案例研究,结果证明了所提模型的有效性。
中图分类号:
[1]WANG X,TIAN J,GUI M,et al.PromptMNER:Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition[C]//Database Systems for Advanced Applications:27th International Conference.Cham:Springer International Publishing,2022:297-305. [2]ZHANG X,YUAN J,LI L,et al.Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition[C]//Procee-dings of the Sixteenth ACM International Conference on Web Search and Data Mining.Singapore:ACM,2023:958-966. [3]ZHENG C,WU Z,FENG J,et al.MNRE:A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts[C]//2021 IEEE International Conference on Multimedia and Expo.Shenzhen,China:IEEE Press,2021:1-6. [4]ZHAO Y,WANG W,ZHANG H,et al.Learning Homogeneous and Heterogeneous Co-Occurrences for Unsupervised Cross-Modal Retrieval[C]//2021 IEEE International Conference on Multimedia and Expo.Shenzhen,China:IEEE Press,2021:1-6. [5]WANG X,CAI J,JIANG Y,et al.Named Entity and Relation Extraction with Multi-Modal Retrieval[C]//Conference on Empirical Methods in Natural Language Processing.2022:5954-5965. [6]ZHAO G,DONG G,SHI Y,et al.Entity-level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition[C]//Findings of the Association for Computational Linguistics:EMNLP 2022.Abu Dhabi,United Arab Emirates:Association for Computational Linguistics,2022:6345-6350. [7]ZHOU X,ZHANG X,TAO C,et al.Multi-Grained Knowledge Distillation for Named Entity Recognition[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics.Online:Association for Computational Linguistics.2021:5704-5716. [8]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Las Vegas:IEEE Computer Society,2016:770-778. [9]HE K,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV).Venice,Italy:IEEE Press,2017:2980-2988. [10]RADFORD A,KIM W,HALLACY C,et al.Learning Transfe-rable Visual Models From Natural Language Supervision[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2021. [11]MOON S,NEVES L,CARVALHO V.Multimodal named entity recognition for short social media posts[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics.New Orleans,Louisiana:Association for Computational Linguistics,2018:852-860. [12]CHEN S,AGUILAR G,NEVES L,et al.Can images help re-cognize entities? A study of the role of images for Multimodal NER[C]//Proceedings of the 2021 EMNLP Workshop W-NUT:The Seventh Workshop on Noisy User-generated Text.2021:87-96. [13]WANG X,GUI M,JIANG Y,et al.ITA:Image-Text Align-ments for Multi-Modal Named Entity Recognition[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa-tional Linguistics.Seattle,United States:Association for Computational Linguistics.2022:3176-3189. [14]LU D,NEVES L,CARVALHO V,et al.Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne,Australia:Association for Computa-tional Linguistics,2018:1990-1999. [15]ZHANG Q,FU J,LIU X,et al.Adaptive Co-Attention Network for Named Entity Recognition in Tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018. [16]YU J,JIANG J,YANG L,et al.Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3342-3352. [17]ZHANG D,WEI S,LI S,et al.Multimodal graph fusion fornamed entity recognition with targeted visual guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:14347-14355. [18]CHEN D,LI Z,GU B,et al.Multimodal Named Entity Recognition with Image Attributes and Image Knowledge[C]//Database Systems for Advanced Applications.Cham:Springer International Publishing,2021:186-201. [19]SUN L,WANG J,SU Y,et al.RIVA:A Pre-trained TweetMultimodal Model Based on Text-image Relation for Multimodal NER[C]//Proceedings of the 28th International Conference on Computational Linguistics.Barcelona,Spain:International Committee on Computational Linguistics,2020:1852-1862. [20]SUN L,WANG J,ZHANG K,et al.RpBERT:A Text-image Relation Propagation-based BERT Model for Multimodal NER[C]//Proceedings of Association for the Advancement of Artificial Intelligence.2021. [21]XU B,HUANG S,SHA C,et al.MAF:A General Matching and Alignment Framework for Multimodal Named Entity Recognition[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.New York:Association for Computing Machinery.2022:1215-1223. [22]WU Z,ZHENG C,CAI Y,et al.Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia.New York:Association for Computing Machinery.2020:1038-1046. [23]ZHENG C,WU Z,WANG T,et al.Object Aware MultimodalNamed Entity Recognition in Social Media Posts With Adversa-rial Learning[C]//IEEE Transactions on Multimedia.2021:2520-2532. [24]JIA M,SHEN L,SHEN X,et al.MNER-QG:An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2022. [25]CHEN J,XUE Y,ZHANG H,et al.On development of multimodal named entity recognition using part-of-speech and mixture of experts[J].International Journal of Machine Learning and Cybernetics.2022,24:1-2. [26]ERIK F S,JORN V.Representing Text Chunks[C]//Procee-dings of the Ninth Conference on European Chapter of the Association for Computational Linguistics.USA:Association for Computational Linguistics,1999:173-179. [27]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT 2019.Minneapolis,Minnesota:Association for Computational Linguistics,2019:4171-4186. [28]LIU S,QI L,QIN H,et al.Path Aggregation Network for Instance Segmentation.In 2018 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Salt Lake City:IEEE Computer Society,2018:8759-8768. [29]LIN T,DOLLAR P,GIRSHICK R,et al.Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu:IEEE Computer Society,2017:936-944. [30]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning (ICML’20).2020:1597-1607. [31]GAO T,YAO X,CHEN D.SimCSE:Simple Contrastive Lear-ning of Sentence Embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Proces-sing.Online and Punta Cana,Dominican Republic:Association for Computational Linguistics,2021:6894-6910. [32]TSAI Y H H,BAI S,LIANG P P J,et al.Multimodal Trans-former for Unaligned Multimodal Language Sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy:Association for Computational Linguistics,2019:6558-6569. [33]BA L,KIROS R,HINTON G.2016.Layer normalization[J].arXiv preprint arXiv:1607.06450. [34]JOHN L,ANDREW M,FERNANDO C.Conditional RandomFields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.2001:282-289. [35]MA X,HOVY E.End-to-end Sequence Labeling via Bi-direc-tional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1064-1074. [36]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural Architectures for Named Entity Recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.San Diego,California:Association for Computational Linguistics,2016:260-270. |
|