Computer Science ›› 2024, Vol. 51 ›› Issue (6): 198-205.doi: 10.11896/jsjkx.230400052

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Vision-enhanced Multimodal Named Entity Recognition Based on Contrastive Learning

YU Bihui, TAN Shuyue, WEI Jingxuan, SUN Linzhuang, BU Liping, ZHAO Yiman   

  1. University of Chinese Academy of Sciences,Beijing 100049,ChinaShenyang
    Institute of Computing Technology,Chinese Academy of Sciences,Shenyang 110168,China
  • Received:2023-04-09 Revised:2023-09-06 Online:2024-06-15 Published:2024-06-05
  • About author:YU Bihui,born in 1982,Ph.D,resear-cher.His main research interest is multi-modal learning.
    TAN Shuyue,born in 2000,postgra-duate.Her main research interests include multimodal alignment and named entity recognition.
  • Supported by:
    Applied Basic Research Project of Liaoning Province(2022JH2/101300258).

Abstract: Multimodal named entity recognition(MNER) aims to detect ranges of entities in a given image-text pair and classifies them into corresponding entity types.Although existing MNER methods have achieved success,they all focus on using image encoder to extract visual features,without enhancement or filtering,and directly feed them into cross-modal interaction mechanism.Moreover,since the representations of text and images come from different encoders,it is difficult to bridge the semantic gap between the two modalities.Therefore,a vision-enhanced multimodal named entity recognition model based on contrastive learning (MCLAug) is proposed.First,ResNet is used to collect image features.On this basis,a pyramid bidirectional fusion strategy is proposed to combine low-level high-resolution with high-level strong semantic image information to enhance visual features.Se-condly,using the idea of multimodal contrastive learning in the CLIP model,calculate and minimize the contrastive loss to make the representations of the two modalities more consistent.Finally,the fused image and text representations are obtained using a cross-modal attention mechanism and a gated fusion mechanism,and a CRF decoder is used to perform the MNER task.Comparative experiments,ablation studies and case studies on 2 publicly datasets demonstrate the effectiveness of the proposed model.

Key words: Multimodal named entity recognition, CLIP, Multimodal contrastive learning, Feature pyramid, Transformer, Gated fusion mechanism

CLC Number: 

  • TP391
[1]WANG X,TIAN J,GUI M,et al.PromptMNER:Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition[C]//Database Systems for Advanced Applications:27th International Conference.Cham:Springer International Publishing,2022:297-305.
[2]ZHANG X,YUAN J,LI L,et al.Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition[C]//Procee-dings of the Sixteenth ACM International Conference on Web Search and Data Mining.Singapore:ACM,2023:958-966.
[3]ZHENG C,WU Z,FENG J,et al.MNRE:A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts[C]//2021 IEEE International Conference on Multimedia and Expo.Shenzhen,China:IEEE Press,2021:1-6.
[4]ZHAO Y,WANG W,ZHANG H,et al.Learning Homogeneous and Heterogeneous Co-Occurrences for Unsupervised Cross-Modal Retrieval[C]//2021 IEEE International Conference on Multimedia and Expo.Shenzhen,China:IEEE Press,2021:1-6.
[5]WANG X,CAI J,JIANG Y,et al.Named Entity and Relation Extraction with Multi-Modal Retrieval[C]//Conference on Empirical Methods in Natural Language Processing.2022:5954-5965.
[6]ZHAO G,DONG G,SHI Y,et al.Entity-level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition[C]//Findings of the Association for Computational Linguistics:EMNLP 2022.Abu Dhabi,United Arab Emirates:Association for Computational Linguistics,2022:6345-6350.
[7]ZHOU X,ZHANG X,TAO C,et al.Multi-Grained Knowledge Distillation for Named Entity Recognition[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics.Online:Association for Computational Linguistics.2021:5704-5716.
[8]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Las Vegas:IEEE Computer Society,2016:770-778.
[9]HE K,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV).Venice,Italy:IEEE Press,2017:2980-2988.
[10]RADFORD A,KIM W,HALLACY C,et al.Learning Transfe-rable Visual Models From Natural Language Supervision[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2021.
[11]MOON S,NEVES L,CARVALHO V.Multimodal named entity recognition for short social media posts[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics.New Orleans,Louisiana:Association for Computational Linguistics,2018:852-860.
[12]CHEN S,AGUILAR G,NEVES L,et al.Can images help re-cognize entities? A study of the role of images for Multimodal NER[C]//Proceedings of the 2021 EMNLP Workshop W-NUT:The Seventh Workshop on Noisy User-generated Text.2021:87-96.
[13]WANG X,GUI M,JIANG Y,et al.ITA:Image-Text Align-ments for Multi-Modal Named Entity Recognition[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa-tional Linguistics.Seattle,United States:Association for Computational Linguistics.2022:3176-3189.
[14]LU D,NEVES L,CARVALHO V,et al.Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Melbourne,Australia:Association for Computa-tional Linguistics,2018:1990-1999.
[15]ZHANG Q,FU J,LIU X,et al.Adaptive Co-Attention Network for Named Entity Recognition in Tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[16]YU J,JIANG J,YANG L,et al.Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3342-3352.
[17]ZHANG D,WEI S,LI S,et al.Multimodal graph fusion fornamed entity recognition with targeted visual guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:14347-14355.
[18]CHEN D,LI Z,GU B,et al.Multimodal Named Entity Recognition with Image Attributes and Image Knowledge[C]//Database Systems for Advanced Applications.Cham:Springer International Publishing,2021:186-201.
[19]SUN L,WANG J,SU Y,et al.RIVA:A Pre-trained TweetMultimodal Model Based on Text-image Relation for Multimodal NER[C]//Proceedings of the 28th International Conference on Computational Linguistics.Barcelona,Spain:International Committee on Computational Linguistics,2020:1852-1862.
[20]SUN L,WANG J,ZHANG K,et al.RpBERT:A Text-image Relation Propagation-based BERT Model for Multimodal NER[C]//Proceedings of Association for the Advancement of Artificial Intelligence.2021.
[21]XU B,HUANG S,SHA C,et al.MAF:A General Matching and Alignment Framework for Multimodal Named Entity Recognition[C]//Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining.New York:Association for Computing Machinery.2022:1215-1223.
[22]WU Z,ZHENG C,CAI Y,et al.Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts[C]//Proceedings of the 28th ACM International Conference on Multimedia.New York:Association for Computing Machinery.2020:1038-1046.
[23]ZHENG C,WU Z,WANG T,et al.Object Aware MultimodalNamed Entity Recognition in Social Media Posts With Adversa-rial Learning[C]//IEEE Transactions on Multimedia.2021:2520-2532.
[24]JIA M,SHEN L,SHEN X,et al.MNER-QG:An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2022.
[25]CHEN J,XUE Y,ZHANG H,et al.On development of multimodal named entity recognition using part-of-speech and mixture of experts[J].International Journal of Machine Learning and Cybernetics.2022,24:1-2.
[26]ERIK F S,JORN V.Representing Text Chunks[C]//Procee-dings of the Ninth Conference on European Chapter of the Association for Computational Linguistics.USA:Association for Computational Linguistics,1999:173-179.
[27]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT 2019.Minneapolis,Minnesota:Association for Computational Linguistics,2019:4171-4186.
[28]LIU S,QI L,QIN H,et al.Path Aggregation Network for Instance Segmentation.In 2018 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Salt Lake City:IEEE Computer Society,2018:8759-8768.
[29]LIN T,DOLLAR P,GIRSHICK R,et al.Feature pyramid networks for object detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).Honolulu:IEEE Computer Society,2017:936-944.
[30]CHEN T,KORNBLITH S,NOROUZI M,et al.A simpleframework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning (ICML’20).2020:1597-1607.
[31]GAO T,YAO X,CHEN D.SimCSE:Simple Contrastive Lear-ning of Sentence Embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Proces-sing.Online and Punta Cana,Dominican Republic:Association for Computational Linguistics,2021:6894-6910.
[32]TSAI Y H H,BAI S,LIANG P P J,et al.Multimodal Trans-former for Unaligned Multimodal Language Sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy:Association for Computational Linguistics,2019:6558-6569.
[33]BA L,KIROS R,HINTON G.2016.Layer normalization[J].arXiv preprint arXiv:1607.06450.
[34]JOHN L,ANDREW M,FERNANDO C.Conditional RandomFields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.2001:282-289.
[35]MA X,HOVY E.End-to-end Sequence Labeling via Bi-direc-tional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Berlin,Germany:Association for Computational Linguistics,2016:1064-1074.
[36]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al.Neural Architectures for Named Entity Recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.San Diego,California:Association for Computational Linguistics,2016:260-270.
[1] LI Zekai, BAI Zhengyao, XIAO Xiao, ZHANG Yihan, YOU Yilin. Point Cloud Upsampling Network Incorporating Transformer and Multi-stage Learning Framework [J]. Computer Science, 2024, 51(6): 231-238.
[2] LIAO Junshuang, TAN Qinhong. DETR with Multi-granularity Spatial Attention and Spatial Prior Supervision [J]. Computer Science, 2024, 51(6): 239-246.
[3] LIU Jiasen, HUANG Jun. Center Point Target Detection Algorithm Based on Improved Swin Transformer [J]. Computer Science, 2024, 51(6): 264-271.
[4] WU Fengyuan, LIU Ming, YIN Xiaokang, CAI Ruijie, LIU Shengli. Remote Access Trojan Traffic Detection Based on Fusion Sequences [J]. Computer Science, 2024, 51(6): 434-442.
[5] ZHANG Jianliang, LI Yang, ZHU Qingshan, XUE Hongling, MA Junwei, ZHANG Lixia, BI Sheng. Substation Equipment Malfunction Alarm Algorithm Based on Dual-domain Sparse Transformer [J]. Computer Science, 2024, 51(5): 62-69.
[6] WANG Ping, YU Zhenhuang, LU Lei. Partial Near-duplicate Video Detection Algorithm Based on Transformer Low-dimensionalCompact Coding [J]. Computer Science, 2024, 51(5): 108-116.
[7] ZHOU Yu, CHEN Zhihua, SHENG Bin, LIANG Lei. Multi Scale Progressive Transformer for Image Dehazing [J]. Computer Science, 2024, 51(5): 117-124.
[8] XI Ying, WU Xuemeng, CUI Xiaohui. Node Influence Ranking Model Based on Transformer [J]. Computer Science, 2024, 51(4): 106-116.
[9] HAO Ran, WANG Hongjun, LI Tianrui. Deep Neural Network Model for Transmission Line Defect Detection Based on Dual-branch Sequential Mixed Attention [J]. Computer Science, 2024, 51(3): 135-140.
[10] WANG Wenjie, YANG Yan, JING Lili, WANG Jie, LIU Yan. LNG-Transformer:An Image Classification Network Based on Multi-scale Information Interaction [J]. Computer Science, 2024, 51(2): 189-195.
[11] ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru. Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer [J]. Computer Science, 2024, 51(2): 196-204.
[12] HUANG Hanqiang, XING Yunbing, SHEN Jianfei, FAN Feiyi. Sign Language Animation Splicing Model Based on LpTransformer Network [J]. Computer Science, 2023, 50(9): 184-191.
[13] TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[14] ZHU Yuying, GUO Yan, WAN Yizhao, TIAN Kai. New Word Detection Based on Branch Entropy-Segmentation Probability Model [J]. Computer Science, 2023, 50(7): 221-228.
[15] LI Tao, WANG Hairui. Remote Sensing Image Change Detection of Construction Land Based on Siamese AttentionNetwork [J]. Computer Science, 2023, 50(6A): 220500040-5.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!