Computer Science ›› 2025, Vol. 52 ›› Issue (6A): 240700060-7.doi: 10.11896/jsjkx.240700060

• Artificial Intelligence • Previous Articles     Next Articles

Multi-view CLIP and Hybrid Contrastive Learning for Multimodal Image-Text Sentiment Analysis

YE Jiale1, PU Yuanyuan1,2, ZHAO Zhengpeng1, FENG Jue1, ZHOU Lianmin1, GU Jinjing1   

  1. 1 School of Information Science and Engineering,Yunnan University,Kunming 650504,China
    2 Internet of Things Technology and Application Key Laboratory of Universities in Yunnan,Kunming 650504,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:YE Jiale,born in 2001,master.Her main research interests include multimodal image-text sentiment analysis.
    ZHAO Zhengpeng,born in 1973,master,associate professor,master supervisor.His main research interests include signal and information processing,computer systems and applications.
  • Supported by:
    National Natural Science Foundation of China(61761046,52102382,62362070),Key Project of Applied Basic Research Programme of Yunnan Provincial Department of Science and Technology(202001BB050043,202401AS070149),Yunnan Provincial Science and Technology Major Project(202302AF080006) and Graduate Student Innovation Project(KC-23236053).

Abstract: Most of the previous multimodal image-text sentiment analysis models use different encoder structures to encode the features of images and text respectively,focusing on exploring different modal feature fusion methods to realize sentiment analysis.However,due to the differences in semantic space between independently extracted features,the semantic associations and complementarities between different features cannot be effectively captured during interaction,which reduces the accuracy of sentiment analysis in turn.To address the above problems,this paper proposes a multimodal image-text sentiment analysis method with multi-view CLIP and hybrid contrast learning.Specifically,the multi-view CLIP feature encoding module employs CLIP to jointly encode image and text representations to improve the semantic consistency of features,and performs multimodal sentiment analysis from multiple perspectives,including image,text,and image-text interaction.In addition,the hybrid contrastive learning module enables the model to extract features with more emotional characteristics and effective information to improve the robustness of the model.In order to remove redundant information in image-text interaction,this paper adopts the fusion strategy of CNN and Transformer cascade,which makes full use of local and global information of image-text to improve the feature representation capability.Finally,comprehensive experiments on three public datasets verify the superiority of the proposed method,and the ablation experiments prove the effectiveness of the components of the proposed method.

Key words: Multimodal, CLIP, Contrastive learning, Pre-trained models, Sentiment analysis

CLC Number: 

  • TP391
[1]ZHANG L,WANG S,LIU B.Deep learning for sentiment analysis:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2018,8(4):e1253.
[2]FANG Q,XU C,SANG J,et al.Word-of-mouth understanding:Entity-centric Multiplemodal aspect-opinion mining in social media[J].IEEE Transactions on Multiplemedia,2015,17(12):2281-2296.
[3]GAO Y,ZHEN Y,LI H,et al.Filtering of brand-related microblogs using social-smooth Multipleview embedding[J].IEEE Transactions on Multiplemedia,2016,18(10):2115-2126.
[4]YOU Q,CAO L,CONG Y,et al.A Multiplefaceted approach to social Multiplemedia-based prediction of elections[J].IEEE Transactions on Multiplemedia,2015,17(12):2271-2280.
[5]XU N,MAO W,CHEN G.A co-memory network for Multi-plemodal sentiment analysis[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.2018:929-932.
[6]KUMAR A,VEPA J.Gated mechanism for attention basedMultiple modal sentiment analysis[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:4477-4481.
[7]XIAO X,PU Y,ZHAO Z,et al.BIT:Improving image-text sentiment analysis via learning bidirectional image-text interaction[C]//2023 International Joint Conference on Neural Networks(IJCNN).IEEE,2023:1-9.
[8]RADFORD A,KIM J W,HALLACYC,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763.
[9]HU A,FLAXMAN S.Multimodal sentiment analysis to explore the structure of emotions[C]//proceedings of the 24th ACM SIGKDD international conference on Knowledge Discovery & Data Mining.2018:350-358.
[10]PORIA S,CAMBRIA E,HAZARIKA D,et al.Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//2017 IEEE International Conference on Data Mining(ICDM).IEEE,2017:1033-1038.
[11]HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentimentanalysis via deep multimodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37.
[12]HUANG P Y,PATRICK M,HU J,et al.MultiplelingualMultiplemodal pre-training for zero-shot cross-lingual transfer of vision-language models[J].arXiv:2103.08849,2021.
[13]LIN R,HU H.Multiplemodal contrastive learning via uni-Modal coding and cross-Modal prediction for Multiplemodal sentiment analysis[J].arXiv:2210.14556,2022.
[14]NIU T,ZHU S,PANG L,et al.Sentiment analysis on Multiple-view social data[C]//MultipleMedia Modeling:22nd International Conference,MMM 2016,Miami,FL,USA,January 4-6,2016,Proceedings,Part II 22.Springer International Publishing,2016:15-27.
[15]CAI Y,CAI H,WAN X.Multiple-modal sarcasm detection intwitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2506-2515.
[16]HU A,FLAXMANS.Multiplemodal sentiment analysis to explore the structure of emotions[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2018:350-358.
[17]HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentimentanalysis via deep Multiplemodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37.
[18]XU N.Analyzing Multiplemodal public sentiment based on hierarchical semantic attentional network[C]//2017 IEEE International Conference on Intelligence and Security Informatics(ISI).IEEE,2017:152-154.
[19]XU N,MAO W.MultiSentiNet:A deep semantic network forMultiplemodal sentiment analysis[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:2399-2402.
[20]YANG X,FENG S,ZHANG Y,et al.Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:328-339.
[21]ZHU T,LI L,YANG J,et al.Multiplemodal sentiment analysis with image-text interaction network[J].IEEE Transactions on Multiplemedia,2022,25:3375-3385.
[22]WANG Z,WAN Z,WAN X.Transmodality:An end2end fusion method with Transformer for Multiplemodal sentiment analysis[C]//Proceedings of the Web Conference 2020.2020:2514-2520.
[23]CHEN Z,PU Y Y,ZHAO Z P,et al.Multi-modal sentiment Analysis Based on adaptive Gated Information Fusion [J].Computer Science,2023,50(3):298-306.
[24]HE K,FAN H,WU Y,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738.
[25]KHOSLA P,TETERWAK P,WANG C,et al.Supervised contrastive learning[J].Advances in neural information processing systems,2020,33:18661-18673.
[26]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[27]YUAN X,LIN Z,KUEN J,et al.Multiplemodal contrastivetraining for visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:6995-7004.
[28]LI Z,XU B,ZHU C,et al.CLMLF:A contrastive learning and Multiple-layer fusion method for Multiplemodal sentiment detection[J].arXiv:2204.05515,2022.
[29]ZHU P,ZHANG W,WANG Y,et al.Multiple-granularityInter-class Correlation Based Contrastive Learning for Open Set Recognition[J].International Journal of Software & Informatics,2022,12(2):157-175.
[30]YU A W,DOHAN D,LUONG M T,et al.Qanet:Combining local convolution with global self-attention for reading comprehension[J].arXiv:1804.09541,2018.
[31]CUBUK E D,ZOPH B,SHLENS J,et al.Randaugment:Practical automated data augmentation with a reduced search space[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.2020:702-703.
[32]LONG X,GAN C,MELOG,et al.Multimodal keyless attention fusion for video classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[33]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[34]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[35]CHEN Y.Convolutional neural network for sentence classification[D].University of Waterloo,2015.
[36]ZHOU P,SHI W,TIAN J,et al.Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(volume 2:Short papers).2016:207-212.
[37]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional Transformers for language understanding[J].arXiv:1810.04805,2018.
[38]ZHANG M,CHANG K,WU Y.Multiple-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment[J].arXiv:2403.06355,2024.
[39]XU N,ZENG Z,MAO W.Reasoning with Multiplemodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3777-3786.
[40]QIN L,HUANG S,CHEN Q,et al.MMSD2.0:Towards a Reliable Multiple-modal Sarcasm Detection System[J].arXiv:2307.07135,2023.
[1] FAN Xing, ZHOU Xiaohang, ZHANG Ning. Review on Methods and Applications of Short Text Similarity Measurement in Social Media Platforms [J]. Computer Science, 2025, 52(6A): 240400206-8.
[2] LI Weirong, YIN Jibin. FB-TimesNet:An Improved Multimodal Emotion Recognition Method Based on TimesNet [J]. Computer Science, 2025, 52(6A): 240900046-8.
[3] FU Shufan, WANG Zhongqing, JIANG Xiaotong. Zero-shot Stance Detection in Chinese by Fusion of Emotion Lexicon and Graph ContrastiveLearning [J]. Computer Science, 2025, 52(6A): 240500051-7.
[4] HUANG Zhiyong, LI Bicheng, WEI Wei. Aspect-level Sentiment Analysis Models Based on Syntax and Semantics [J]. Computer Science, 2025, 52(6A): 240400193-7.
[5] SHI Kun, LI Decang, MENG Yanbing, LIU Yatong. Study on Regional Cold Chain Multimodal Transport Routes Considering Multiple Tasks [J]. Computer Science, 2025, 52(6A): 240600160-6.
[6] LI Jianghui, DING Haiyan, LI Weihua. Prediction of Influenza A Antigenicity Based on Few-shot Contrastive Learning [J]. Computer Science, 2025, 52(6A): 240800053-6.
[7] CHEN Yadang, GAO Yuxuan, LU Chuhan, CHE Xun. Saliency Mask Mixup for Few-shot Image Classification [J]. Computer Science, 2025, 52(6): 256-263.
[8] LIU Yufei, XIAO Yanhui, TIAN Huawei. PRNU Fingerprint Purification Algorithm for Open Environment [J]. Computer Science, 2025, 52(6): 187-199.
[9] WU Pengyuan, FANG Wei. Study on Graph Collaborative Filtering Model Based on FeatureNet Contrastive Learning [J]. Computer Science, 2025, 52(5): 139-148.
[10] MIAO Zhuang, CUI Haoran, ZHANG Qiyang, WANG Jiabao, LI Yang. Restoration of Atmospheric Turbulence-degraded Images Based on Contrastive Learning [J]. Computer Science, 2025, 52(5): 171-178.
[11] LI Zongmin, RONG Guangcai, BAI Yun, XU Chang , XIAN Shiyang. 3D Object Detection with Dynamic Weight Graph Convolution [J]. Computer Science, 2025, 52(3): 104-111.
[12] TIAN Qing, KANG Lulu, ZHOU Liangyu. Class-incremental Source-free Domain Adaptation Based on Multi-prototype Replay andAlignment [J]. Computer Science, 2025, 52(3): 206-213.
[13] YUAN Ye, CHEN Ming, WU Anbiao, WANG Yishu. Graph Anomaly Detection Model Based on Personalized PageRank and Contrastive Learning [J]. Computer Science, 2025, 52(2): 80-90.
[14] LIU Yanlun, XIAO Zheng, NIE Zhenyu, LE Yuquan, LI Kenli. Case Element Association with Evidence Extraction for Adjudication Assistance [J]. Computer Science, 2025, 52(2): 222-230.
[15] YE Lishuo, HE Zhixue. Multi-granularity Time Series Contrastive Learning Method Incorporating Time-Frequency Features [J]. Computer Science, 2025, 52(1): 170-182.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!