计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 250-257.doi: 10.11896/jsjkx.230600047

• 人工智能 • 上一篇    下一篇

基于视觉语义与提示学习的多模态情感分析模型

莫书渊, 蒙祖强   

  1. 广西大学计算机与电子信息学院 南宁 530004
  • 收稿日期:2023-06-06 修回日期:2023-12-25 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 蒙祖强(zqmeng@126.com)
  • 作者简介:(msygxu2023@163.com)
  • 基金资助:
    国家自然科学基金(62266004)

Multimodal Sentiment Analysis Model Based on Visual Semantics and Prompt Learning

MO Shuyuan, MENG Zuqiang   

  1. School of Computer and Electronic Information,Guangxi University,Nanning 530004,China
  • Received:2023-06-06 Revised:2023-12-25 Online:2024-09-15 Published:2024-09-10
  • About author:MO Shuyuan,born in 1997,postgra-duate.His main research interest is multimodal deep learning.
    MENG Zuqiang,born in 1974,Ph.D,professor,is a senior member of CCF(No.06312S).His main research intere-sts include multimodal deep learning and granular computing.
  • Supported by:
    National Natural Science Foundation of China(62266004).

摘要: 随着深度学习技术的发展,多模态情感分析已成为研究热点之一。然而,大多数多模态情感分析模型或从不同模态中提取特征向量并简单地进行加权求和,导致数据无法准确地映射到统一的多模态向量空间中,或依赖图像描述模型将图像转化为文本,导致提取到过多不包含情感信息的视觉语义,造成信息冗余,最终影响模型的性能。为了解决这些问题,提出了一种基于视觉语义与提示学习的多模态情感分析模型VSPL。该模型将图像转化为精确简短、蕴含情感信息的视觉语义词汇,从而缓解信息冗余的问题;并基于提示学习的方法,将得到的视觉语义词汇与针对情感分类任务而提前设计好的提示模板组合成新文本,实现模态融合,这样做既避免了由加权求和导致的特征空间映射不准确的问题,又能借助提示学习的方法激发预训练语言模型的潜在性能。对多模态情感分析任务进行了对比实验,结果表明所提模型VSPL在3个公开数据集上的性能超越了先进的基准模型。此外,还进行了消融实验、特征可视化和样例分析,验证了VSPL的有效性。

关键词: 多模态, 视觉语义, 提示学习, 情感分析, 预训练语言模型

Abstract: With the development of deep learning technology,multimodal sentiment analysis has become one of the research highlights.However,most multimodal sentiment analysis models either extract eigenvector from different modalities and simply use weighted sum method,resulting in data that cannot be accurately mapped into a unified multimodal vector space,or rely on image description models to translate image into text,resulting in the extraction of too many visual semantics without sentimental information and information redundancy,and ultimately affecting the performance of the model.To address these issues,a multimodal sentiment analysis model VSPL based on visual semantics and prompt learning is proposed.This model translates images into precise,concise,and sentimentally informative visual semantic vocabulary to alleviate the problem of information redundancy.Based on prompt learning,the obtained visual semantic vocabulary is combined with pre-designed prompt templates for sentiment classification tasks to form new text,achieving modal fusion.It not only avoids the problem of inaccurate feature space mapping caused by weighted sum method,but also stimulates the potential performance of pre-trained language model through prompt learning methods.Comparative experiments are conducted on multimodal sentiment analysis tasks,and the proposed model VSPL outperforms advanced baseline models on three public datasets.In addition,ablation experiments,feature visualization,and sample analysis are conducted to verify the effectiveness of VSPL.

Key words: Multimodal, Visual semantics, Prompt learning, Sentiment analysis, Pre-trained language model

中图分类号: 

  • TP391
[1]YUE L,CHEN W,LI X,et al.A survey of sentiment analysis in social media [J].Knowledge and Information Systems,2019,60:617-663.
[2]GAO Y,ZHEN Y,LI H,et al.Filtering of brand-related micro-blogs using social-smooth multiview embedding [J].IEEE Tran-sactions on Multimedia,2016,18(10):2115-2126.
[3]PANG L,ZHU S,NGO C W.Deep multimodal learning for affective analysis and retrieval [J].IEEE Transactions on Multimedia,2015,17(11):2008-2020.
[4]CAMBRIA E,SCHULLER B,XIA Y,et al.New avenues inopinion mining and sentiment analysis [J].IEEE Intelligent Systems,2013,28(2):15-21.
[5]GUO W,ZHANG Y,CAI X,et al.LD-MAN:Layout-drivenmultimodal attention network for online news sentiment recognition [J].IEEE Transactions on Multimedia,2020,23:1785-1798.
[6]ZHU T,LI L,YANG J,et al.Multimodal sentiment analysiswith image-text interaction network [J].IEEE Transactions on Multi-media,2023,40(1):1-27.
[7]KHAN Z,FU Y.Exploiting BERT for multimodal target sentiment classification through input space translation [C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:3034-3042.
[8]MACHAJDIK J,HANBURY A.Affective image classificationusing features inspired by psychology and art theory [C]//Proceedings of the 18th ACM International Conference on Multi-media.2010:83-92.
[9]ZHAO S,GAO Y,JIANG X,et al.Exploring principles-of-art features for image emotion recognition [C]//Proceedings of the 22nd ACM International Conference on Multimedia.2014:47-56.
[10]BORTH D,JI R,CHEN T,et al.Large-scale visual sentimentontology and detectors using adjective noun pairs [C]//Procee-dings of the 21st ACM International Conference on Multi-media.2013:223-232.
[11]LI Z,FAN Y,LIU W,et al.Image sentiment prediction based on textual descriptions with adjective noun pairs [J].Multimedia Tools and Applications,2018,77:1115-1132.
[12]ZHAO S,YAO H,YANG Y,et al.Affective image retrieval via multi-graph learning [C]//Proceedings of the 22nd ACM International Conference on Multimedia.2014:1025-1028.
[13]YOU Q,JIN H,LUO J.Visual sentiment analysis by attending on local image regions [C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2017.
[14]SHE D,YANG J,CHENG M M,et al.Wscnet:Weakly supervised coupled networks for visual sentiment classification and detection [J].IEEE Transactions on Multimedia,2019,22(5):1358-1371.
[15]ZHANG J,LIU X,CHEN M,et al.Image sentiment classification via multi-level sentiment region correlation analysis [J].Neurocomputing,2022,469:221-233.
[16]YANG J,LI J,WANG X,et al.Stimuli-aware visual emotion analysis [J].IEEE Transactions on Image Processing,2021,30:7432-7445.
[17]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding [J].arXiv:1810.04805,2018.
[18]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners [J].OpenAI blog,2019,1(8):9.
[19]JIANG Z,XU F F,ARAKI J,et al.How can we know what language models know? [J].Transactions of the Association for Computational Linguistics,2020,8:423-438.
[20]SHIN T,RAZEGHI Y,LOGAN IV R L,et al.Autoprompt:Eliciting knowledge from language models with automatically generated prompts [J].arXiv:2010.15980,2020.
[21]ZHANG R,GUO Z,ZHANG W,et al.Pointclip:Point cloud understanding by clip [C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2022:8552-8562.
[22]ZHOU K,YANG J,LOY C C,et al.Learning to prompt for vision-language models [J].International Journal of Computer Vision,2022,130(9):2337-2348.
[23]ZHOU K,YANG J,LOY C C,et al.Conditional prompt learning for vision-language models [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:16816-16825.
[24]WU H,SHI X.Adversarial soft prompt tuning for cross-domain sentiment analysis [C]//Proceedings of the 60th Annual Mee-ting of the Association for Computational Linguistics(Volume 1:Long Papers).2022:2438-2447.
[25]KAUR R,KAUTISH S.Multimodal sentiment analysis:A survey and comparison [J].International Journal of Service Science Management Engineering & Technology,2019,10(2):38-58.
[26]SOLEYMANI M,GARCIA D,JOU B,et al.A survey of multimodal sentiment analysis [J].Image and Vision Computing,2017,65:3-14.
[27]MORENCY L P,MIHALCEA R,DOSHI P.Towards multimodal sentiment analysis:Harvesting opinions from the web [C]//Proceedings of the 13th International Conference on Multimodal Interfaces.2011:169-176.
[28]ZADEH A A B,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:Cmu-mosei dataset and interpretable dynamic fusion graph [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Vo-lume 1:Long Papers).2018:2236-2246.
[29]STAPPEN L,SCHUMANN L,SERTOLLI B,et al.Muse-toolbox:The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox [C]//Procee-dings of the 2nd on Multimodal Sentiment Analysis Challenge.2021:75-82.
[30]LIANG C,XU J,ZHAO J,et al.Deep Learning-based Con-struction and Processing of Multimodal Corpus for IoT Devices in Mobile Edge Computing[J].Computational Intelligence and Neuroscience,2022,2022(1):2241310.
[31]NIU T,ZHU S,PANG L,et al.Sentiment analysis on multi-view social data [C]//22nd International Conference Multi-Media Modeling(MMM 2016),Miami,FL,USA,Part II 22.Springer International Publishing,2016:15-27.
[32]YU J,JIANG J.Adapting BERT for target-oriented multimodal sentiment classification [C]//IJCAI,2019.
[33]XU N,MAO W.Multisentinet:A deep semantic network formultimodal sentiment analysis [C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:2399-2402.
[34]XU N,MAO W,CHEN G.A co-memory network for Multimodal sentiment analysis [C]//The 41st international ACM SIGIR Conference on Research & Development in Information Retrie-val.2018:929-932.
[35]CAI G,XIA B.Convolutional neural networks for multimedia sentiment analysis [C]//4th CCF Conference Natural Language Processing and Chinese Computing(NLPCC 2015).Nanchang,China:Springer International Publishing,2015:159-167.
[36]CHOCHLAKIS G,SRINIVASAN T,THOMASON J,et al.VAuLT:Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations [J].arXiv:2208.09021,2022.
[37]KIM W,SON B,KIM I.Vilt:Vision-and-language transformer without convolution or region supervision [C]//International Conference on Machine Learning.PMLR,2021:5583-5594.
[38]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach [J].arXiv:1907.11692,2019.
[39]LI J,SELVARAJU R,GOTMARE A,et al.Align before fuse:Vision and language representation learning with momentum distillation [J].Advances in Neural Information Processing Systems,2021,34:9694-9705.
[40]YE J,ZHOU J,TIAN J,et al.Sentiment-aware multimodal pre-training for multimodal sentiment analysis [J].Knowledge-Based Systems,2022,258:110021.
[41]VAN DER MAATEN L,HINTON G.Visualizing data using t-SNE [J].Journal of Machine Learning Research,2008,9(11):2579-2605.
[42]CHEFER H,GUR S,WOLF L.Transformer interpretability beyond attention visualization [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:782-791.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!