计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 250-257.doi: 10.11896/jsjkx.230600047
莫书渊, 蒙祖强
MO Shuyuan, MENG Zuqiang
摘要: 随着深度学习技术的发展,多模态情感分析已成为研究热点之一。然而,大多数多模态情感分析模型或从不同模态中提取特征向量并简单地进行加权求和,导致数据无法准确地映射到统一的多模态向量空间中,或依赖图像描述模型将图像转化为文本,导致提取到过多不包含情感信息的视觉语义,造成信息冗余,最终影响模型的性能。为了解决这些问题,提出了一种基于视觉语义与提示学习的多模态情感分析模型VSPL。该模型将图像转化为精确简短、蕴含情感信息的视觉语义词汇,从而缓解信息冗余的问题;并基于提示学习的方法,将得到的视觉语义词汇与针对情感分类任务而提前设计好的提示模板组合成新文本,实现模态融合,这样做既避免了由加权求和导致的特征空间映射不准确的问题,又能借助提示学习的方法激发预训练语言模型的潜在性能。对多模态情感分析任务进行了对比实验,结果表明所提模型VSPL在3个公开数据集上的性能超越了先进的基准模型。此外,还进行了消融实验、特征可视化和样例分析,验证了VSPL的有效性。
中图分类号:
[1]YUE L,CHEN W,LI X,et al.A survey of sentiment analysis in social media [J].Knowledge and Information Systems,2019,60:617-663. [2]GAO Y,ZHEN Y,LI H,et al.Filtering of brand-related micro-blogs using social-smooth multiview embedding [J].IEEE Tran-sactions on Multimedia,2016,18(10):2115-2126. [3]PANG L,ZHU S,NGO C W.Deep multimodal learning for affective analysis and retrieval [J].IEEE Transactions on Multimedia,2015,17(11):2008-2020. [4]CAMBRIA E,SCHULLER B,XIA Y,et al.New avenues inopinion mining and sentiment analysis [J].IEEE Intelligent Systems,2013,28(2):15-21. [5]GUO W,ZHANG Y,CAI X,et al.LD-MAN:Layout-drivenmultimodal attention network for online news sentiment recognition [J].IEEE Transactions on Multimedia,2020,23:1785-1798. [6]ZHU T,LI L,YANG J,et al.Multimodal sentiment analysiswith image-text interaction network [J].IEEE Transactions on Multi-media,2023,40(1):1-27. [7]KHAN Z,FU Y.Exploiting BERT for multimodal target sentiment classification through input space translation [C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:3034-3042. [8]MACHAJDIK J,HANBURY A.Affective image classificationusing features inspired by psychology and art theory [C]//Proceedings of the 18th ACM International Conference on Multi-media.2010:83-92. [9]ZHAO S,GAO Y,JIANG X,et al.Exploring principles-of-art features for image emotion recognition [C]//Proceedings of the 22nd ACM International Conference on Multimedia.2014:47-56. [10]BORTH D,JI R,CHEN T,et al.Large-scale visual sentimentontology and detectors using adjective noun pairs [C]//Procee-dings of the 21st ACM International Conference on Multi-media.2013:223-232. [11]LI Z,FAN Y,LIU W,et al.Image sentiment prediction based on textual descriptions with adjective noun pairs [J].Multimedia Tools and Applications,2018,77:1115-1132. [12]ZHAO S,YAO H,YANG Y,et al.Affective image retrieval via multi-graph learning [C]//Proceedings of the 22nd ACM International Conference on Multimedia.2014:1025-1028. [13]YOU Q,JIN H,LUO J.Visual sentiment analysis by attending on local image regions [C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2017. [14]SHE D,YANG J,CHENG M M,et al.Wscnet:Weakly supervised coupled networks for visual sentiment classification and detection [J].IEEE Transactions on Multimedia,2019,22(5):1358-1371. [15]ZHANG J,LIU X,CHEN M,et al.Image sentiment classification via multi-level sentiment region correlation analysis [J].Neurocomputing,2022,469:221-233. [16]YANG J,LI J,WANG X,et al.Stimuli-aware visual emotion analysis [J].IEEE Transactions on Image Processing,2021,30:7432-7445. [17]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding [J].arXiv:1810.04805,2018. [18]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners [J].OpenAI blog,2019,1(8):9. [19]JIANG Z,XU F F,ARAKI J,et al.How can we know what language models know? [J].Transactions of the Association for Computational Linguistics,2020,8:423-438. [20]SHIN T,RAZEGHI Y,LOGAN IV R L,et al.Autoprompt:Eliciting knowledge from language models with automatically generated prompts [J].arXiv:2010.15980,2020. [21]ZHANG R,GUO Z,ZHANG W,et al.Pointclip:Point cloud understanding by clip [C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2022:8552-8562. [22]ZHOU K,YANG J,LOY C C,et al.Learning to prompt for vision-language models [J].International Journal of Computer Vision,2022,130(9):2337-2348. [23]ZHOU K,YANG J,LOY C C,et al.Conditional prompt learning for vision-language models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:16816-16825. [24]WU H,SHI X.Adversarial soft prompt tuning for cross-domain sentiment analysis [C]//Proceedings of the 60th Annual Mee-ting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:2438-2447. [25]KAUR R,KAUTISH S.Multimodal sentiment analysis:A survey and comparison [J].International Journal of Service Science Management Engineering & Technology,2019,10(2):38-58. [26]SOLEYMANI M,GARCIA D,JOU B,et al.A survey of multimodal sentiment analysis [J].Image and Vision Computing,2017,65:3-14. [27]MORENCY L P,MIHALCEA R,DOSHI P.Towards multimodal sentiment analysis:Harvesting opinions from the web [C]//Proceedings of the 13th International Conference on Multimodal Interfaces.2011:169-176. [28]ZADEH A A B,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:Cmu-mosei dataset and interpretable dynamic fusion graph [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Vo-lume 1:Long Papers).2018:2236-2246. [29]STAPPEN L,SCHUMANN L,SERTOLLI B,et al.Muse-toolbox:The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox [C]//Procee-dings of the 2nd on Multimodal Sentiment Analysis Challenge.2021:75-82. [30]LIANG C,XU J,ZHAO J,et al.Deep Learning-based Con-struction and Processing of Multimodal Corpus for IoT Devices in Mobile Edge Computing[J].Computational Intelligence and Neuroscience,2022,2022(1):2241310. [31]NIU T,ZHU S,PANG L,et al.Sentiment analysis on multi-view social data [C]//22nd International Conference Multi-Media Modeling(MMM 2016),Miami,FL,USA,Part II 22.Springer International Publishing,2016:15-27. [32]YU J,JIANG J.Adapting BERT for target-oriented multimodal sentiment classification [C]//IJCAI,2019. [33]XU N,MAO W.Multisentinet:A deep semantic network formultimodal sentiment analysis [C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:2399-2402. [34]XU N,MAO W,CHEN G.A co-memory network for Multimodal sentiment analysis [C]//The 41st international ACM SIGIR Conference on Research & Development in Information Retrie-val.2018:929-932. [35]CAI G,XIA B.Convolutional neural networks for multimedia sentiment analysis [C]//4th CCF Conference Natural Language Processing and Chinese Computing(NLPCC 2015).Nanchang,China:Springer International Publishing,2015:159-167. [36]CHOCHLAKIS G,SRINIVASAN T,THOMASON J,et al.VAuLT:Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations [J].arXiv:2208.09021,2022. [37]KIM W,SON B,KIM I.Vilt:Vision-and-language transformer without convolution or region supervision [C]//International Conference on Machine Learning.PMLR,2021:5583-5594. [38]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach [J].arXiv:1907.11692,2019. [39]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation [J].Advances in Neural Information Processing Systems,2021,34:9694-9705. [40]YE J,ZHOU J,TIAN J,et al.Sentiment-aware multimodal pre-training for multimodal sentiment analysis [J].Knowledge-Based Systems,2022,258:110021. [41]VAN DER MAATEN L,HINTON G.Visualizing data using t-SNE [J].Journal of Machine Learning Research,2008,9(11):2579-2605. [42]CHEFER H,GUR S,WOLF L.Transformer interpretability beyond attention visualization [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:782-791. |
|