计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 313-319.doi: 10.11896/jsjkx.240700161

• 人工智能 • 上一篇    下一篇

基于分步协作融合表示的情感分类方法

高龙1, 李旸2, 王素格1,3   

  1. 1 山西大学计算机与信息技术学院 太原 030006
    2 山西财经大学金融学院 太原 030006
    3 山西大学计算智能与中文信息处理教育部重点实验室 太原 030006
  • 收稿日期:2024-07-25 修回日期:2024-10-18 出版日期:2025-09-15 发布日期:2025-09-11
  • 通讯作者: 李旸(liyangprimrose@163.com)
  • 作者简介:(glong202202@163.com)
  • 基金资助:
    国家自然科学基金(62106130,62376143,62076158);山西省基础研究计划(20210302124084);山西省高校科技创新计划(2021L284)

Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation

GAO Long1, LI Yang2, WANG Suge1,3   

  1. 1 School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2 School of Finance,Shanxi University of Finance and Economics,Taiyuan 030006,China
    3 Key Laboratory Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University,Taiyuan 030006,China
  • Received:2024-07-25 Revised:2024-10-18 Online:2025-09-15 Published:2025-09-11
  • About author:GAO Long,born in 2000,postgraduate.His main research interests include na-tural language processing and so on.
    LI Yang,born in 1988,Ph.D,associate professor,is a member of CCF(No.P6278M).Her main research interests include text sentiment analysis and text mining.
  • Supported by:
    National Natural Science Foundation of China(62106130,62376143,62076158),Basic Research Program in Shanxi(20210302124084) and Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi(2021L284).

摘要: 多模态情感分析任务旨在通过各种异构模态(如语言、视频和音频)感知和理解人类的情感,但不同模态间存在着复杂的关联。现有的大多数方法将多个模态特征直接融合,忽略了不同步的模态融合表示在情感分析中的贡献不同。针对上述问题,提出了一种基于分步协作融合表示的情感分类方法。首先,利用降噪瓶颈模型对音视频中的噪声和冗余进行过滤,通过Transformer完成对音视频两种模态的交互融合,建立音视频融合的低级特征表示;进一步利用跨模态注意力机制,强化文本模态对音视频模态的低级融合表示,构建音视频融合的高级特征表示。其次,设计一个新颖的模态融合层将多级特征表示引入预训练模型T5中,建立以文本为中心的多模态融合表示。最后,将低级特征表示、高级特征表示以及以文本为中心的特征融合表示进行联合,实现了多模态数据的情感判别。在两个公开数据集CMU-MOSI和CMU-MOSEI上进行实验,结果表明所提出的方法相比已有基线模型ALMT在Acc-7指标上分别提高0.1和0.17,表明了分步协作融合表示能够提高多模态情感分类性能。

关键词: 多模态融合, 情感分析, 瓶颈机制, 注意力机制, 预训练模型

Abstract: The goal of multimodal sentiment analysis is to perceive and understand human emotions through various heteroge-neous modalities,such as language,video,and audio.However,there are complex correlations between different modalities.Most existing methods directly fuse multiple modality features,they overlook the fact that asynchronous modality fusion representations contribute differently to sentiment analysis.To address the above issues,this paper proposes a sentiment classification me-thod based on stepwise collaborative fusion representation.Firstly,a denoising bottleneck model is used to filter out noise and redundancy in the audio and video,and the two modalities are fused through Transformer,establishing a low-level feature representation of the audio-video fusion.Then,a cross-modal attention mechanism is utilized to enhance the audio-video modalities with the text modality,constructing a high-level feature representation of the audio-video fusion.Secondly,a novel multimodal fusion layer is designed to incorporate multi-level feature representations into the pre-trained T5 model,establishing a text-centric multimodal fusion representation.Finally,the low-level feature representation,high-level feature representation,and text-centric feature fusion representation are combined to achieve sentiment classification of multimodal data.Experimental results on two public datasets,CMU-MOSI and CMU-MOSEI indicate that the proposed method improves the Acc-7 metric by 0.1 and 0.17 compared to the existing baseline model ALMT,demonstrating that stepwise collaborative fusion representation can enhance multimodal sentiment classification performance.

Key words: Multimodal fusion, Sentiment analysis, Bottleneck mechanism, Attention mechanism, Pre-trained model

中图分类号: 

  • TP391
[1]HAZARIKA D,ZIMMERMANN,PORIA S,et al.Misa:Modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM InternationalConference on Multimedia.2020:1122-1131.
[2]NAGRANI A,YANG S,ARNAB A,et al.Attention bottle-necks for multimodal fusion[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2021:14200-14213.
[3]WU S X,DAI D M,QIN Z W,et al.Denoising bottleneck with mutual information maximization for video multimodal fusion[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:756-767.
[4]ZHANG H Y,W Y,YIN G H,et al.Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:2231-2243.
[5]YU T S,GAO H Y,YANG M,et al.Speech-text dialog pre-training for spoken dialog understanding with explicit cross-modal alignment[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.ACL,2023:7900-7913.
[6]YANG H,LIN J Y,YANG A,et al.Prompt tuning for unified multimodal pretrained models[C]//Findings of the Association for Computational Linguistics.2023:402-416.
[7]TSAI Y H,LIANG P P,ZADEH A,et al.Learning factorized multimodal representations[C]//International Conference on Representation Learning.2018:53-69.
[8]HAN W,CHEN H,PORIA S,et al.Improving multimodal fu-sion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:9180-9192
[9]GUO J W,TANG J J,DAI W C,et al.Dynamically adjust wordrepresentations using unaligned multimodal information[C]//Proceedings of the 30th ACM International Conference on Multimedia.2022:3394-3402.
[10]SUN Y,MAI S J,HU H F,et al.Learning to learn better unimodal representations via adaptive multimodal meta-learning[J].IEEE Transactions on Affective Computing,2023,14(3):2209-2223.
[11]SUN L C,ZHENG L,LIU B,et al.Efficient multimodal trans-former with dual-level feature restoration for robust multimodal sentiment analysis[J].IEEE Transactions on Affective Computing,2024,15(1):309-325.
[12]ZADEH A,CHEN M H,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Proces-sing.2017:1103-1114.
[13]HUANG J H,LIU B,NIU M Y.Multimodal transformer fusion for continuous emotion recognition[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2020:3507-3511.
[14]RAHMAN W,HASAN M K,LEE S,et al.Integrating multi-modal information in large pretrained transformers[C]//Proceedings of the Conference Association for Computational Linguistics.2020:2359-2373.
[15]LIANG T,LIN G S,FENG L,et al.Attention Is not enough:mitigating the distribution discrepancy in asynchronous multimodal sequence fusion[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:8148-8156.
[16]LUO H S,JI L,HUANG Y Y,et al.Scalevlad:Improving multimodal sentiment analysis via multiscale fusion of locally descriptors[J].arXiv:2112.01368,2021.
[17]SUN J,HAN S K,RUAN Y P,et al.Layer-wise fusion with modality independence modeling for multi-modal emotion recognition[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:658-670.
[18]SHI T,HUANG S L.MultiEMO:An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:658-670.
[19]GILLES D,KANE J,DRUGMAN T,et al.COVAREP:A collaborative voice analysis repository for speech technologies[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2014:978-986.
[20]AMOS L,LUDWICZUK B,SATYANARAYANAN M.OpenFace:A general-purpose face recognition library with mobile applications[J/OL].https://elijah.cs.cmu.edu/DOCS/CMU-CS-16-118.pdf.
[21]ZADEH A,ZELLERS R,PINCU S,et al.Multimodal sentimentintensity analysis in videos:Facial gestures and verbal messages[J].IEEE Intelligent Systems,2016,31(6):82-88.
[22]ZADEH A,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.2018:2236-2246.
[23]LIU Z,SHEN Y,LIANG P P,et al.Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2247-2256.
[24]TSAI Y H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference Association for Computational Linguistics Meeting.2019:6558-6571.
[25]LYU F M,CHEN X,HUANG Y Y,et al.Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:2554-2562.
[26]YU W M,XU H,YUAN Z Q,et al.Learning modality-specific representations withself supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:10790-10797.
[27]HU G M,LIN T E,ZHAO Y,et al.UniMSE:Towards unifiedmultimodal sentiment analysis and emotion recognition[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:7837-7851.
[28]GUO R R,GAO J L,XU R J.Aspect-based Sentiment Analysis by Fusing Multi-feature Graph Convolutional Network[J].Journal of Chinese Computer Systems,2024,45(5):1039-1045.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!