计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700060-7.doi: 10.11896/jsjkx.240700060
叶佳乐1, 普园媛1,2, 赵征鹏1, 冯珏1, 周联敏1, 谷金晶1
YE Jiale1, PU Yuanyuan1,2, ZHAO Zhengpeng1, FENG Jue1, ZHOU Lianmin1, GU Jinjing1
摘要: 以往的多模态图文情感分析模型大多采用不同的编码器结构分别对图像和文本进行特征编码,重点关注探索不同的模态特征融合方法来实现情感分析。但由于独立提取的特征具有语义空间差异性,在交互时无法有效地捕捉到不同特征之间的语义关联和互补性,进而降低了情感分析的准确性。针对上述问题,文中提出了混合对比学习和多视角CLIP的多模态图文情感分析方法。具体来说,多视角CLIP特征编码模块采用CLIP对图像和文本进行联合编码表示,以提升特征的语义一致性,从图像、文本和图文交互等多个视角进行多模态情感分析。此外,通过混合对比学习模块使模型提取更具有情感特性以及有效信息的特征,提升模型的鲁棒性。其中,在图文交互时为了去除冗余信息,采用CNN和Transformer级联的融合策略,充分利用图文局部和全局信息来提高特征表示能力。最后,在3个公开数据集上进行综合实验,验证了所提方法的优越性,通过消融实验证明了所提方法各组件的有效性。
中图分类号:
[1]ZHANG L,WANG S,LIU B.Deep learning for sentiment analysis:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery,2018,8(4):e1253. [2]FANG Q,XU C,SANG J,et al.Word-of-mouth understanding:Entity-centric Multiplemodal aspect-opinion mining in social media[J].IEEE Transactions on Multiplemedia,2015,17(12):2281-2296. [3]GAO Y,ZHEN Y,LI H,et al.Filtering of brand-related microblogs using social-smooth Multipleview embedding[J].IEEE Transactions on Multiplemedia,2016,18(10):2115-2126. [4]YOU Q,CAO L,CONG Y,et al.A Multiplefaceted approach to social Multiplemedia-based prediction of elections[J].IEEE Transactions on Multiplemedia,2015,17(12):2271-2280. [5]XU N,MAO W,CHEN G.A co-memory network for Multi-plemodal sentiment analysis[C]//The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.2018:929-932. [6]KUMAR A,VEPA J.Gated mechanism for attention basedMultiple modal sentiment analysis[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:4477-4481. [7]XIAO X,PU Y,ZHAO Z,et al.BIT:Improving image-text sentiment analysis via learning bidirectional image-text interaction[C]//2023 International Joint Conference on Neural Networks(IJCNN).IEEE,2023:1-9. [8]RADFORD A,KIM J W,HALLACYC,et al.Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning.PMLR,2021:8748-8763. [9]HU A,FLAXMAN S.Multimodal sentiment analysis to explore the structure of emotions[C]//proceedings of the 24th ACM SIGKDD international conference on Knowledge Discovery & Data Mining.2018:350-358. [10]PORIA S,CAMBRIA E,HAZARIKA D,et al.Multi-level multiple attentions for contextual multimodal sentiment analysis[C]//2017 IEEE International Conference on Data Mining(ICDM).IEEE,2017:1033-1038. [11]HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentimentanalysis via deep multimodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37. [12]HUANG P Y,PATRICK M,HU J,et al.MultiplelingualMultiplemodal pre-training for zero-shot cross-lingual transfer of vision-language models[J].arXiv:2103.08849,2021. [13]LIN R,HU H.Multiplemodal contrastive learning via uni-Modal coding and cross-Modal prediction for Multiplemodal sentiment analysis[J].arXiv:2210.14556,2022. [14]NIU T,ZHU S,PANG L,et al.Sentiment analysis on Multiple-view social data[C]//MultipleMedia Modeling:22nd International Conference,MMM 2016,Miami,FL,USA,January 4-6,2016,Proceedings,Part II 22.Springer International Publishing,2016:15-27. [15]CAI Y,CAI H,WAN X.Multiple-modal sarcasm detection intwitter with hierarchical fusion model[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:2506-2515. [16]HU A,FLAXMANS.Multiplemodal sentiment analysis to explore the structure of emotions[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2018:350-358. [17]HUANG F,ZHANG X,ZHAO Z,et al.Image-text sentimentanalysis via deep Multiplemodal attentive fusion[J].Knowledge-Based Systems,2019,167:26-37. [18]XU N.Analyzing Multiplemodal public sentiment based on hierarchical semantic attentional network[C]//2017 IEEE International Conference on Intelligence and Security Informatics(ISI).IEEE,2017:152-154. [19]XU N,MAO W.MultiSentiNet:A deep semantic network forMultiplemodal sentiment analysis[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:2399-2402. [20]YANG X,FENG S,ZHANG Y,et al.Multimodal sentiment detection based on multi-channel graph neural networks[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).2021:328-339. [21]ZHU T,LI L,YANG J,et al.Multiplemodal sentiment analysis with image-text interaction network[J].IEEE Transactions on Multiplemedia,2022,25:3375-3385. [22]WANG Z,WAN Z,WAN X.Transmodality:An end2end fusion method with Transformer for Multiplemodal sentiment analysis[C]//Proceedings of the Web Conference 2020.2020:2514-2520. [23]CHEN Z,PU Y Y,ZHAO Z P,et al.Multi-modal sentiment Analysis Based on adaptive Gated Information Fusion [J].Computer Science,2023,50(3):298-306. [24]HE K,FAN H,WU Y,et al.Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:9729-9738. [25]KHOSLA P,TETERWAK P,WANG C,et al.Supervised contrastive learning[J].Advances in neural information processing systems,2020,33:18661-18673. [26]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021. [27]YUAN X,LIN Z,KUEN J,et al.Multiplemodal contrastivetraining for visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:6995-7004. [28]LI Z,XU B,ZHU C,et al.CLMLF:A contrastive learning and Multiple-layer fusion method for Multiplemodal sentiment detection[J].arXiv:2204.05515,2022. [29]ZHU P,ZHANG W,WANG Y,et al.Multiple-granularityInter-class Correlation Based Contrastive Learning for Open Set Recognition[J].International Journal of Software & Informatics,2022,12(2):157-175. [30]YU A W,DOHAN D,LUONG M T,et al.Qanet:Combining local convolution with global self-attention for reading comprehension[J].arXiv:1804.09541,2018. [31]CUBUK E D,ZOPH B,SHLENS J,et al.Randaugment:Practical automated data augmentation with a reduced search space[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.2020:702-703. [32]LONG X,GAN C,MELOG,et al.Multimodal keyless attention fusion for video classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018. [33]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [34]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020. [35]CHEN Y.Convolutional neural network for sentence classification[D].University of Waterloo,2015. [36]ZHOU P,SHI W,TIAN J,et al.Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(volume 2:Short papers).2016:207-212. [37]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional Transformers for language understanding[J].arXiv:1810.04805,2018. [38]ZHANG M,CHANG K,WU Y.Multiple-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment[J].arXiv:2403.06355,2024. [39]XU N,ZENG Z,MAO W.Reasoning with Multiplemodal sarcastic tweets via modeling cross-modality contrast and semantic association[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:3777-3786. [40]QIN L,HUANG S,CHEN Q,et al.MMSD2.0:Towards a Reliable Multiple-modal Sarcasm Detection System[J].arXiv:2307.07135,2023. |
|