计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700060-7.doi: 10.11896/jsjkx.240700060

• 人工智能 • 上一篇    下一篇

混合对比学习和多视角CLIP的多模态图文情感分析

叶佳乐1, 普园媛1,2, 赵征鹏1, 冯珏1, 周联敏1, 谷金晶1   

  1. 1 云南大学信息学院 昆明 650504
    2 中国云南省高校物联网技术及应用重点实验室 昆明 650504
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 赵征鹏(zhpzhao@ynu.edu.cn)
  • 作者简介:(yejiale@itc.ynu.edu.cn)
  • 基金资助:
    国家自然科学基金(61761046,52102382,62362070);云南省科技厅应用基础研究计划重点项目(202001BB050043,202401AS070149);云南省科技重大专项(202302AF080006);研究生科研创新项目(KC-23236053)

Multi-view CLIP and Hybrid Contrastive Learning for Multimodal Image-Text Sentiment Analysis

YE Jiale1, PU Yuanyuan1,2, ZHAO Zhengpeng1, FENG Jue1, ZHOU Lianmin1, GU Jinjing1   

  1. 1 School of Information Science and Engineering,Yunnan University,Kunming 650504,China
    2 Internet of Things Technology and Application Key Laboratory of Universities in Yunnan,Kunming 650504,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:YE Jiale,born in 2001,master.Her main research interests include multimodal image-text sentiment analysis.
    ZHAO Zhengpeng,born in 1973,master,associate professor,master supervisor.His main research interests include signal and information processing,computer systems and applications.
  • Supported by:
    National Natural Science Foundation of China(61761046,52102382,62362070),Key Project of Applied Basic Research Programme of Yunnan Provincial Department of Science and Technology(202001BB050043,202401AS070149),Yunnan Provincial Science and Technology Major Project(202302AF080006) and Graduate Student Innovation Project(KC-23236053).

摘要: 以往的多模态图文情感分析模型大多采用不同的编码器结构分别对图像和文本进行特征编码,重点关注探索不同的模态特征融合方法来实现情感分析。但由于独立提取的特征具有语义空间差异性,在交互时无法有效地捕捉到不同特征之间的语义关联和互补性,进而降低了情感分析的准确性。针对上述问题,文中提出了混合对比学习和多视角CLIP的多模态图文情感分析方法。具体来说,多视角CLIP特征编码模块采用CLIP对图像和文本进行联合编码表示,以提升特征的语义一致性,从图像、文本和图文交互等多个视角进行多模态情感分析。此外,通过混合对比学习模块使模型提取更具有情感特性以及有效信息的特征,提升模型的鲁棒性。其中,在图文交互时为了去除冗余信息,采用CNN和Transformer级联的融合策略,充分利用图文局部和全局信息来提高特征表示能力。最后,在3个公开数据集上进行综合实验,验证了所提方法的优越性,通过消融实验证明了所提方法各组件的有效性。

关键词: 多模态, CLIP, 对比学习, 预训练模型, 情感分析

Abstract: Most of the previous multimodal image-text sentiment analysis models use different encoder structures to encode the features of images and text respectively,focusing on exploring different modal feature fusion methods to realize sentiment analysis.However,due to the differences in semantic space between independently extracted features,the semantic associations and complementarities between different features cannot be effectively captured during interaction,which reduces the accuracy of sentiment analysis in turn.To address the above problems,this paper proposes a multimodal image-text sentiment analysis method with multi-view CLIP and hybrid contrast learning.Specifically,the multi-view CLIP feature encoding module employs CLIP to jointly encode image and text representations to improve the semantic consistency of features,and performs multimodal sentiment analysis from multiple perspectives,including image,text,and image-text interaction.In addition,the hybrid contrastive learning module enables the model to extract features with more emotional characteristics and effective information to improve the robustness of the model.In order to remove redundant information in image-text interaction,this paper adopts the fusion strategy of CNN and Transformer cascade,which makes full use of local and global information of image-text to improve the feature representation capability.Finally,comprehensive experiments on three public datasets verify the superiority of the proposed method,and the ablation experiments prove the effectiveness of the components of the proposed method.

Key words: Multimodal, CLIP, Contrastive learning, Pre-trained models, Sentiment analysis

中图分类号: 

  • TP391
[1]ZHANG L,WANG S,LIU B.Deep learning for sentiment analysis:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2018,8(4):e1253.
[2]FANG Q,XU C,SANG J,et al.Word-of-mouth understanding:Entity-centric Multiplemodal aspect-opinion mining in social media[J].IEEE Transactions on Multiplemedia,2015,17(12):2281-2296.
[3]GAO Y,ZHEN Y,LI H,et al.Filtering of brand-related microblogs using social-smooth Multipleview embedding[J].IEEE Transactions on Multiplemedia,2016,18(10):2115-2126.
[4]YOU Q,CAO L,CONG Y,et al.A Multiplefaceted approach to social Multiplemedia-based prediction of elections[J].IEEE Transactions on Multiplemedia,2015,17(12):2271-2280.
[5]XU N,MAO W,CHEN G.A co-memory network for Multi-plemodal sentiment analysis[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.2018:929-932.
[6]KUMAR A,VEPA J.Gated mechanism for attention basedMultiple modal sentiment analysis[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:4477-4481.
[7]XIAO X,PU Y,ZHAO Z,et al.BIT:Improving image-text sentiment analysis via learning bidirectional image-text interaction[C]//2023 International Joint Conference on Neural Networks(IJCNN).IEEE,2023:1-9.
[8]RADFORD A,KIM J W,HALLACYC,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[9]HU A,FLAXMAN S.Multimodal sentiment analysis to explore the structure of emotions[C]//proceedings of the 24th ACM SIGKDD international conference on Knowledge Discovery & Data Mining.2018:350-358.
[10]PORIA S,CAMBRIA E,HAZARIKA D,et al.Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//2017 IEEE International Conference on Data Mining(ICDM).IEEE,2017:1033-1038.
[11]HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentimentanalysis via deep multimodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37.
[12]HUANG P Y,PATRICK M,HU J,et al.MultiplelingualMultiplemodal pre-training for zero-shot cross-lingual transfer of vision-language models[J].arXiv:2103.08849,2021.
[13]LIN R,HU H.Multiplemodal contrastive learning via uni-Modal coding and cross-Modal prediction for Multiplemodal sentiment analysis[J].arXiv:2210.14556,2022.
[14]NIU T,ZHU S,PANG L,et al.Sentiment analysis on Multiple-view social data[C]//MultipleMedia Modeling:22nd International Conference,MMM 2016,Miami,FL,USA,January 4-6,2016,Proceedings,Part II 22.Springer International Publishing,2016:15-27.
[15]CAI Y,CAI H,WAN X.Multiple-modal sarcasm detection intwitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2506-2515.
[16]HU A,FLAXMANS.Multiplemodal sentiment analysis to explore the structure of emotions[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2018:350-358.
[17]HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentimentanalysis via deep Multiplemodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37.
[18]XU N.Analyzing Multiplemodal public sentiment based on hierarchical semantic attentional network[C]//2017 IEEE International Conference on Intelligence and Security Informatics(ISI).IEEE,2017:152-154.
[19]XU N,MAO W.MultiSentiNet:A deep semantic network forMultiplemodal sentiment analysis[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:2399-2402.
[20]YANG X,FENG S,ZHANG Y,et al.Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:328-339.
[21]ZHU T,LI L,YANG J,et al.Multiplemodal sentiment analysis with image-text interaction network[J].IEEE Transactions on Multiplemedia,2022,25:3375-3385.
[22]WANG Z,WAN Z,WAN X.Transmodality:An end2end fusion method with Transformer for Multiplemodal sentiment analysis[C]//Proceedings of the Web Conference 2020.2020:2514-2520.
[23]CHEN Z,PU Y Y,ZHAO Z P,et al.Multi-modal sentiment Analysis Based on adaptive Gated Information Fusion [J].Computer Science,2023,50(3):298-306.
[24]HE K,FAN H,WU Y,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738.
[25]KHOSLA P,TETERWAK P,WANG C,et al.Supervised contrastive learning[J].Advances in neural information processing systems,2020,33:18661-18673.
[26]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[27]YUAN X,LIN Z,KUEN J,et al.Multiplemodal contrastivetraining for visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:6995-7004.
[28]LI Z,XU B,ZHU C,et al.CLMLF:A contrastive learning and Multiple-layer fusion method for Multiplemodal sentiment detection[J].arXiv:2204.05515,2022.
[29]ZHU P,ZHANG W,WANG Y,et al.Multiple-granularityInter-class Correlation Based Contrastive Learning for Open Set Recognition[J].International Journal of Software & Informatics,2022,12(2):157-175.
[30]YU A W,DOHAN D,LUONG M T,et al.Qanet:Combining local convolution with global self-attention for reading comprehension[J].arXiv:1804.09541,2018.
[31]CUBUK E D,ZOPH B,SHLENS J,et al.Randaugment:Practical automated data augmentation with a reduced search space[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.2020:702-703.
[32]LONG X,GAN C,MELOG,et al.Multimodal keyless attention fusion for video classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[33]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[34]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[35]CHEN Y.Convolutional neural network for sentence classification[D].University of Waterloo,2015.
[36]ZHOU P,SHI W,TIAN J,et al.Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(volume 2:Short papers).2016:207-212.
[37]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional Transformers for language understanding[J].arXiv:1810.04805,2018.
[38]ZHANG M,CHANG K,WU Y.Multiple-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment[J].arXiv:2403.06355,2024.
[39]XU N,ZENG Z,MAO W.Reasoning with Multiplemodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3777-3786.
[40]QIN L,HUANG S,CHEN Q,et al.MMSD2.0:Towards a Reliable Multiple-modal Sarcasm Detection System[J].arXiv:2307.07135,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!