计算机科学 ›› 2025, Vol. 52 ›› Issue (9): 313-319.doi: 10.11896/jsjkx.240700161
高龙1, 李旸2, 王素格1,3
GAO Long1, LI Yang2, WANG Suge1,3
摘要: 多模态情感分析任务旨在通过各种异构模态(如语言、视频和音频)感知和理解人类的情感,但不同模态间存在着复杂的关联。现有的大多数方法将多个模态特征直接融合,忽略了不同步的模态融合表示在情感分析中的贡献不同。针对上述问题,提出了一种基于分步协作融合表示的情感分类方法。首先,利用降噪瓶颈模型对音视频中的噪声和冗余进行过滤,通过Transformer完成对音视频两种模态的交互融合,建立音视频融合的低级特征表示;进一步利用跨模态注意力机制,强化文本模态对音视频模态的低级融合表示,构建音视频融合的高级特征表示。其次,设计一个新颖的模态融合层将多级特征表示引入预训练模型T5中,建立以文本为中心的多模态融合表示。最后,将低级特征表示、高级特征表示以及以文本为中心的特征融合表示进行联合,实现了多模态数据的情感判别。在两个公开数据集CMU-MOSI和CMU-MOSEI上进行实验,结果表明所提出的方法相比已有基线模型ALMT在Acc-7指标上分别提高0.1和0.17,表明了分步协作融合表示能够提高多模态情感分类性能。
中图分类号:
[1]HAZARIKA D,ZIMMERMANN,PORIA S,et al.Misa:Modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM InternationalConference on Multimedia.2020:1122-1131. [2]NAGRANI A,YANG S,ARNAB A,et al.Attention bottle-necks for multimodal fusion[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2021:14200-14213. [3]WU S X,DAI D M,QIN Z W,et al.Denoising bottleneck with mutual information maximization for video multimodal fusion[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:756-767. [4]ZHANG H Y,W Y,YIN G H,et al.Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:2231-2243. [5]YU T S,GAO H Y,YANG M,et al.Speech-text dialog pre-training for spoken dialog understanding with explicit cross-modal alignment[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.ACL,2023:7900-7913. [6]YANG H,LIN J Y,YANG A,et al.Prompt tuning for unified multimodal pretrained models[C]//Findings of the Association for Computational Linguistics.2023:402-416. [7]TSAI Y H,LIANG P P,ZADEH A,et al.Learning factorized multimodal representations[C]//International Conference on Representation Learning.2018:53-69. [8]HAN W,CHEN H,PORIA S,et al.Improving multimodal fu-sion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:9180-9192 [9]GUO J W,TANG J J,DAI W C,et al.Dynamically adjust wordrepresentations using unaligned multimodal information[C]//Proceedings of the 30th ACM International Conference on Multimedia.2022:3394-3402. [10]SUN Y,MAI S J,HU H F,et al.Learning to learn better unimodal representations via adaptive multimodal meta-learning[J].IEEE Transactions on Affective Computing,2023,14(3):2209-2223. [11]SUN L C,ZHENG L,LIU B,et al.Efficient multimodal trans-former with dual-level feature restoration for robust multimodal sentiment analysis[J].IEEE Transactions on Affective Computing,2024,15(1):309-325. [12]ZADEH A,CHEN M H,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Proces-sing.2017:1103-1114. [13]HUANG J H,LIU B,NIU M Y.Multimodal transformer fusion for continuous emotion recognition[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2020:3507-3511. [14]RAHMAN W,HASAN M K,LEE S,et al.Integrating multi-modal information in large pretrained transformers[C]//Proceedings of the Conference Association for Computational Linguistics.2020:2359-2373. [15]LIANG T,LIN G S,FENG L,et al.Attention Is not enough:mitigating the distribution discrepancy in asynchronous multimodal sequence fusion[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:8148-8156. [16]LUO H S,JI L,HUANG Y Y,et al.Scalevlad:Improving multimodal sentiment analysis via multiscale fusion of locally descriptors[J].arXiv:2112.01368,2021. [17]SUN J,HAN S K,RUAN Y P,et al.Layer-wise fusion with modality independence modeling for multi-modal emotion recognition[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:658-670. [18]SHI T,HUANG S L.MultiEMO:An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:658-670. [19]GILLES D,KANE J,DRUGMAN T,et al.COVAREP:A collaborative voice analysis repository for speech technologies[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2014:978-986. [20]AMOS L,LUDWICZUK B,SATYANARAYANAN M.OpenFace:A general-purpose face recognition library with mobile applications[J/OL].https://elijah.cs.cmu.edu/DOCS/CMU-CS-16-118.pdf. [21]ZADEH A,ZELLERS R,PINCU S,et al.Multimodal sentimentintensity analysis in videos:Facial gestures and verbal messages[J].IEEE Intelligent Systems,2016,31(6):82-88. [22]ZADEH A,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.2018:2236-2246. [23]LIU Z,SHEN Y,LIANG P P,et al.Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2247-2256. [24]TSAI Y H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference Association for Computational Linguistics Meeting.2019:6558-6571. [25]LYU F M,CHEN X,HUANG Y Y,et al.Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:2554-2562. [26]YU W M,XU H,YUAN Z Q,et al.Learning modality-specific representations withself supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:10790-10797. [27]HU G M,LIN T E,ZHAO Y,et al.UniMSE:Towards unifiedmultimodal sentiment analysis and emotion recognition[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:7837-7851. [28]GUO R R,GAO J L,XU R J.Aspect-based Sentiment Analysis by Fusing Multi-feature Graph Convolutional Network[J].Journal of Chinese Computer Systems,2024,45(5):1039-1045. |
|