Computer Science ›› 2026, Vol. 53 ›› Issue (1): 187-194.doi: 10.11896/jsjkx.241100029

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Multimodal Sentiment Analysis for Interactive Fusion of Dual Perspectives Under Cross-modalInconsistent Perception

BU Yunyang, QI Binting, BU Fanliang   

  1. College of Information Network Security, People’s Public Security University of China, Beijing 100038, China
  • Received:2024-11-05 Revised:2025-02-12 Published:2026-01-08
  • About author:BU Yunyang,born in 2000,postgra-duate.His main research interest is multimodal sentiment analysis.
    BU Fanliang,born in 1965,Ph.D,professor,Ph.D supervisor.His main research interests include computer control and information processing.
  • Supported by:
    Double First-Class Innovation Research Project for People’s Public Security University of China(2023SYL08).

Abstract: In social media,people’s comments usually describe a certain sentiment region in the corresponding image,and there is correspondence information between image and text.Most previous multimodal sentiment analysis methods only explore the interactions between images and text from a single perspective,capturing the correspondence between image regions and text words,leading to results that are not optimal.In addition,data on social media is strongly personal and subjective,and the sentiment in the data is multidimensional and complex,which leads to the emergence of data with weak image and text sentiment consistency.To address the above two problems,a multimodal sentiment analysis model with interactive fusion of two perspectives under cross-modal inconsistency perception is proposed.On the one hand,cross-modal interaction of graphic and textual features from both global and local perspectives provides a more comprehensive and accurate sentiment analysis,which improves the perfor-mance and application of the model.On the other hand,the inconsistency scores of the graphical features are calculated to representthe degree of graphical inconsistency,as a way to dynamically regulate the weights of the unimodal and multimodal representations in the final sentiment features,thus improving the robustness of the model.Extensive experiments are conducted on two public datasets,MVSA-Single and MVSA-Multiple,and the results demonstrate the validity and superiority of the proposed multimodal sentiment analysis model compared to the existing baseline models,with F1 values increasing by 0.59 persentage points and 0.39 persentage points,respectively.

Key words: Multimodal sentiment analysis, Cross-modal inconsistent perception, Dual-view interactive fusion, Dynamic regulation, Cross-modal interaction

CLC Number: 

  • TP391.41
[1]ZHANG L,WANG S,LIU B.Deep learning for sentiment ana-lysis:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2018,8(4):e1253.
[2]PANG L,ZHU S,NGO C W.Deep multimodal learning for affective analysis and retrieval[J].IEEE Transactions on Multimedia,2015,17(11):2008-2020.
[3]ZHU T,LI L,YANG J,et al.Multimodal sentiment analysiswith image-text interaction network[J].IEEE Transactions on Multimedia,2022,25:3375-3385.
[4]XU J,HUANG F,ZHANG X,et al.Visual-textual sentiment classification with bi-directional multi-level attention networks[J].Knowledge-Based Systems,2019,178:61-73.
[5]TABOADA M,BROOKE J,TOFILOSKI M,et al.Lexicon-based methods for sentiment analysis[J].Computational linguistics,2011,37(2):267-307.
[6]RAO Y,LEI J,LIU W,et al.Building emotional dictionary for sentiment analysis of online news[J].World Wide Web,2014,17:723-742.
[7]HAMOUDA A,ROHAIM M.Reviews classification usingsentiwordnet lexicon[C]//World Congress on Computer Science and Information Technology.2011:104-105.
[8]PANG B,LEE L,VAITHYANATHAN S.Thumbs up? Sentiment classification using machine learning techniques[C]//Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing.2002:79-86.
[9]KIM Y.Convolutional Neural Networks for Sentence Classifica-tion[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.2014:1746-1751.
[10]SOCHER R,PERELYGIN A,WU J,et al.Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.2013:1631-1642.
[11]AKHTAR M S,GARG T,EKBAL A.Multi-task learning foraspect term extraction and aspect sentiment classification[J].Neurocomputing,2020,398:247-256.
[12]MACHAJDIK J,HANBURY A.Affective image classification using features inspired by psychology and art theory[C]//Proceedings of the 18th ACM International Conference on Multimedia.2010:83-92.
[13]SIERSDORFER S,MINACK E,DENG F,et al.Analyzing and predicting sentiment of images on the social web[C]//Procee-dings of the 18th ACM International Conference on Multimedia.2010:715-718.
[14]BORTH D,JI R,CHEN T,et al.Large-scale visual sentimentontology and detectors using adjective noun pairs[C]//Procee-dings of the 21st ACM International Conference on Multimedia.2013:223-232.
[15]YUAN J,MCDONOUGH S,YOU Q,et al.Sentribute:image sentiment analysis from a mid-level perspective[C]//Procee-dings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining.2013:1-8.
[16]YOU Q,LUO J,JIN H,et al.Robust image sentiment analysis using progressively trained and domain transferred deep networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2015.
[17]YANG J,SHE D,SUN M.Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network[C]//IJCAI.2017:3266-3272.
[18]LI Z,SUN Q,GUO Q,et al.Visual sentiment analysis based on image caption and adjective-noun-pair description[J].Soft Computing,2021:1-13.
[19]WANG M,CAO D,LI L,et al.Microblog sentiment analysisbased on cross-media bag-of-words model[C]//Proceedings of International Conference on Internet Multimedia Computing and Service.2014:76-80.
[20]YOU Q,LUO J,JIN H,et al.Joint visual-textual sentimentanalysis with deep neural networks[C]//Proceedings of the 23rd ACM International Conference on Multimedia.2015:1071-1074.
[21]LI P,ZHONG P,ZHANG J,et al.Convolutional transformer with sentiment-aware attention for sentiment analysis[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[22]HE J,YANGA H,ZHANG C,et al.Dynamic Invariant-Specific Representation Fusion Network for Multimodal Sentiment Analysis[J].Computational Intelligence and Neuroscience,2022,2022(1):2105593.
[23]LIU H,LI K,FAN J,et al.Social Image-Text Sentiment Classification With Cross-Modal Consistency and Knowledge Distillation[J].IEEE Transactions on Affective Computing,2022,14(4):3332-3344.
[24]XU M,LIANG F,SU X,et al.Cmjrt:Cross-modal joint representation transformer for multimodal sentiment analysis[J].IEEE Access,2022,10:131671-131679.
[25]CHEN D,SU W,WU P,et al.Joint multimodal sentiment analysis based on information relevance[J].Information Processing &Management,2023,60(2):103193.
[26]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[27]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach[J].arXiv:1907,11692,2019.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010..
[29]WANG J,YANG Y,LIU K,et al.CiteNet:Cross-modal incongruity perception network for multimodal sentiment prediction[J].Knowledge-Based Systems,2024,295:111848.
[30]ZHAN F,YU Y,WU R,et al.Multimodal image synthesis and editing:A survey and taxonomy[J].arXiv:2112.13592,2023.
[31]NIU T,ZHU S,PANG L,et al.Sentiment analysis on multi-view social data[C]//MultiMedia Modeling:22nd International Conference(MMM 2016).Miami,FL,USA,Part II 22.2016:15-27.
[32]XU N,MAO W.Multisentinet:A deep semantic network formultimodal sentiment analysis[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:2399-2402.
[33]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations.2020:38-45.
[34]LOSHCHILOV I,HUTTER F.Decoupled weight decay regularization[J].arXiv:1711,05101,2017.
[35]ZHOU P,SHI W,TIAN J,et al.Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:207-212.
[36]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810,04805,2018.
[37]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[38]YANG X,FENG S,WANG D,et al.Image-text multimodalemotion classification via multi-view attentional network[J].IEEE Transactions on Multimedia,2020,23:4014-4026.
[39]XU N.Analyzing multimodal public sentiment based on hierarchical semantic attentional network[C]//2017 IEEE International Conference on Intelligence and Security Informatics(ISI).2017,IEEE:152-154.
[40]CAI G,XIA B.Convolutional neural networks for multimediasentiment analysis[C]//4th CCF Conference Natural Language Processing and Chinese Computing(NLPCC 2015).2015:159-167.
[41]XU N,MAO W,CHEN G.A co-memory network for multimodal sentiment analysis[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.2018:929-932.
[42]YANG X,FENG S,ZHANG Y,et al.Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:328-339.
[43]YE J,ZHOU J,TIAN J,et al.Sentiment-aware multimodal pre-training for multimodal sentiment analysis[J].Knowledge-Based Systems,2022,258:110021.
[44]WEI Y,YUAN S,YANG R,et al.Tackling modality heterogeneity with multi-view calibration network for multimodal sentiment detection[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:5240-5252.
[45]VAN DER MAATEN L,HINTON G.Visualizing data usingt-SNE[J].Journal of Machine Learning Research,2008,9(86):2579-2605.
[1] JIANG Kun, ZHAO Zhengpeng, PU Yuanyuan, HUANG Jian, GU Jinjing, XU Dan. Cross-modal Hypergraph Optimisation Learning for Multimodal Sentiment Analysis [J]. Computer Science, 2025, 52(7): 210-217.
[2] WANG Youkang, CHENG Chunling. Multimodal Sentiment Analysis Model Based on Cross-modal Unidirectional Weighting [J]. Computer Science, 2025, 52(7): 226-232.
[3] FENG Guang, LIN Yibao, ZHONG Ting, ZHENG Runting, HUANG Junhui, LIU Tianxiang, YANG Yanru. Multimodal Sentiment Analysis Based on Dominant Attention and Multi-space Domain Information Collaboration [J]. Computer Science, 2025, 52(11A): 250200022-9.
[4] PENG Guangchuan, WU Fei, HAN Lu, JI Yimu, JING Xiaoyuan. Fake News Detection Based on Cross-modal Interaction and Feature Fusion Network [J]. Computer Science, 2024, 51(11): 23-29.
[5] CHEN Zhen, PU Yuanyuan, ZHAO Zhengpeng, XU Dan, QIAN Wenhua. Multimodal Sentiment Analysis Based on Adaptive Gated Information Fusion [J]. Computer Science, 2023, 50(3): 298-306.
[6] NIE Xiu-shan, PAN Jia-nan, TAN Zhi-fang, LIU Xin-fang, GUO Jie, YIN Yi-long. Overview of Natural Language Video Localization [J]. Computer Science, 2022, 49(9): 111-122.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!