Computer Science ›› 2025, Vol. 52 ›› Issue (9): 313-319.doi: 10.11896/jsjkx.240700161

• Artificial Intelligence • Previous Articles     Next Articles

Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation

GAO Long1, LI Yang2, WANG Suge1,3   

  1. 1 School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2 School of Finance,Shanxi University of Finance and Economics,Taiyuan 030006,China
    3 Key Laboratory Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University,Taiyuan 030006,China
  • Received:2024-07-25 Revised:2024-10-18 Online:2025-09-15 Published:2025-09-11
  • About author:GAO Long,born in 2000,postgraduate.His main research interests include na-tural language processing and so on.
    LI Yang,born in 1988,Ph.D,associate professor,is a member of CCF(No.P6278M).Her main research interests include text sentiment analysis and text mining.
  • Supported by:
    National Natural Science Foundation of China(62106130,62376143,62076158),Basic Research Program in Shanxi(20210302124084) and Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi(2021L284).

Abstract: The goal of multimodal sentiment analysis is to perceive and understand human emotions through various heteroge-neous modalities,such as language,video,and audio.However,there are complex correlations between different modalities.Most existing methods directly fuse multiple modality features,they overlook the fact that asynchronous modality fusion representations contribute differently to sentiment analysis.To address the above issues,this paper proposes a sentiment classification me-thod based on stepwise collaborative fusion representation.Firstly,a denoising bottleneck model is used to filter out noise and redundancy in the audio and video,and the two modalities are fused through Transformer,establishing a low-level feature representation of the audio-video fusion.Then,a cross-modal attention mechanism is utilized to enhance the audio-video modalities with the text modality,constructing a high-level feature representation of the audio-video fusion.Secondly,a novel multimodal fusion layer is designed to incorporate multi-level feature representations into the pre-trained T5 model,establishing a text-centric multimodal fusion representation.Finally,the low-level feature representation,high-level feature representation,and text-centric feature fusion representation are combined to achieve sentiment classification of multimodal data.Experimental results on two public datasets,CMU-MOSI and CMU-MOSEI indicate that the proposed method improves the Acc-7 metric by 0.1 and 0.17 compared to the existing baseline model ALMT,demonstrating that stepwise collaborative fusion representation can enhance multimodal sentiment classification performance.

Key words: Multimodal fusion, Sentiment analysis, Bottleneck mechanism, Attention mechanism, Pre-trained model

CLC Number: 

  • TP391
[1]HAZARIKA D,ZIMMERMANN,PORIA S,et al.Misa:Modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM InternationalConference on Multimedia.2020:1122-1131.
[2]NAGRANI A,YANG S,ARNAB A,et al.Attention bottle-necks for multimodal fusion[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems.Red Hook,NY:Curran Associates Inc.,2021:14200-14213.
[3]WU S X,DAI D M,QIN Z W,et al.Denoising bottleneck with mutual information maximization for video multimodal fusion[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.2023:756-767.
[4]ZHANG H Y,W Y,YIN G H,et al.Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:2231-2243.
[5]YU T S,GAO H Y,YANG M,et al.Speech-text dialog pre-training for spoken dialog understanding with explicit cross-modal alignment[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.ACL,2023:7900-7913.
[6]YANG H,LIN J Y,YANG A,et al.Prompt tuning for unified multimodal pretrained models[C]//Findings of the Association for Computational Linguistics.2023:402-416.
[7]TSAI Y H,LIANG P P,ZADEH A,et al.Learning factorized multimodal representations[C]//International Conference on Representation Learning.2018:53-69.
[8]HAN W,CHEN H,PORIA S,et al.Improving multimodal fu-sion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:9180-9192
[9]GUO J W,TANG J J,DAI W C,et al.Dynamically adjust wordrepresentations using unaligned multimodal information[C]//Proceedings of the 30th ACM International Conference on Multimedia.2022:3394-3402.
[10]SUN Y,MAI S J,HU H F,et al.Learning to learn better unimodal representations via adaptive multimodal meta-learning[J].IEEE Transactions on Affective Computing,2023,14(3):2209-2223.
[11]SUN L C,ZHENG L,LIU B,et al.Efficient multimodal trans-former with dual-level feature restoration for robust multimodal sentiment analysis[J].IEEE Transactions on Affective Computing,2024,15(1):309-325.
[12]ZADEH A,CHEN M H,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Proces-sing.2017:1103-1114.
[13]HUANG J H,LIU B,NIU M Y.Multimodal transformer fusion for continuous emotion recognition[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2020:3507-3511.
[14]RAHMAN W,HASAN M K,LEE S,et al.Integrating multi-modal information in large pretrained transformers[C]//Proceedings of the Conference Association for Computational Linguistics.2020:2359-2373.
[15]LIANG T,LIN G S,FENG L,et al.Attention Is not enough:mitigating the distribution discrepancy in asynchronous multimodal sequence fusion[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:8148-8156.
[16]LUO H S,JI L,HUANG Y Y,et al.Scalevlad:Improving multimodal sentiment analysis via multiscale fusion of locally descriptors[J].arXiv:2112.01368,2021.
[17]SUN J,HAN S K,RUAN Y P,et al.Layer-wise fusion with modality independence modeling for multi-modal emotion recognition[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:658-670.
[18]SHI T,HUANG S L.MultiEMO:An attention-based correlation-aware multimodal fusion framework for emotion recognition in conversations[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.2023:658-670.
[19]GILLES D,KANE J,DRUGMAN T,et al.COVAREP:A collaborative voice analysis repository for speech technologies[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2014:978-986.
[20]AMOS L,LUDWICZUK B,SATYANARAYANAN M.OpenFace:A general-purpose face recognition library with mobile applications[J/OL].https://elijah.cs.cmu.edu/DOCS/CMU-CS-16-118.pdf.
[21]ZADEH A,ZELLERS R,PINCU S,et al.Multimodal sentimentintensity analysis in videos:Facial gestures and verbal messages[J].IEEE Intelligent Systems,2016,31(6):82-88.
[22]ZADEH A,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.2018:2236-2246.
[23]LIU Z,SHEN Y,LIANG P P,et al.Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2247-2256.
[24]TSAI Y H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference Association for Computational Linguistics Meeting.2019:6558-6571.
[25]LYU F M,CHEN X,HUANG Y Y,et al.Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:2554-2562.
[26]YU W M,XU H,YUAN Z Q,et al.Learning modality-specific representations withself supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:10790-10797.
[27]HU G M,LIN T E,ZHAO Y,et al.UniMSE:Towards unifiedmultimodal sentiment analysis and emotion recognition[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:7837-7851.
[28]GUO R R,GAO J L,XU R J.Aspect-based Sentiment Analysis by Fusing Multi-feature Graph Convolutional Network[J].Journal of Chinese Computer Systems,2024,45(5):1039-1045.
[1] LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[2] PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching [J]. Computer Science, 2025, 52(9): 276-281.
[3] ZHONG Boyang, RUAN Tong, ZHANG Weiyan, LIU Jingping. Collaboration of Large and Small Language Models with Iterative Reflection Framework for Clinical Note Summarization [J]. Computer Science, 2025, 52(9): 294-302.
[4] ZHOU Tao, DU Yongping, XIE Runfeng, HAN Honggui. Vulnerability Detection Method Based on Deep Fusion of Multi-dimensional Features from Heterogeneous Contract Graphs [J]. Computer Science, 2025, 52(9): 368-375.
[5] LIU Jian, YAO Renyuan, GAO Nan, LIANG Ronghua, CHEN Peng. VSRI:Visual Semantic Relational Interactor for Image Caption [J]. Computer Science, 2025, 52(8): 222-231.
[6] LIU Chengzhuang, ZHAI Sulan, LIU Haiqing, WANG Kunpeng. Weakly-aligned RGBT Salient Object Detection Based on Multi-modal Feature Alignment [J]. Computer Science, 2025, 52(7): 142-150.
[7] ZHUANG Jianjun, WAN Li. SCF U2-Net:Lightweight U2-Net Improved Method for Breast Ultrasound Lesion SegmentationCombined with Fuzzy Logic [J]. Computer Science, 2025, 52(7): 161-169.
[8] JIANG Kun, ZHAO Zhengpeng, PU Yuanyuan, HUANG Jian, GU Jinjing, XU Dan. Cross-modal Hypergraph Optimisation Learning for Multimodal Sentiment Analysis [J]. Computer Science, 2025, 52(7): 210-217.
[9] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
[10] WANG Youkang, CHENG Chunling. Multimodal Sentiment Analysis Model Based on Cross-modal Unidirectional Weighting [J]. Computer Science, 2025, 52(7): 226-232.
[11] KONG Yinling, WANG Zhongqing, WANG Hongling. Study on Opinion Summarization Incorporating Evaluation Object Information [J]. Computer Science, 2025, 52(7): 233-240.
[12] LIU Yajun, JI Qingge. Pedestrian Trajectory Prediction Based on Motion Patterns and Time-Frequency Domain Fusion [J]. Computer Science, 2025, 52(7): 92-102.
[13] GUAN Xin, YANG Xueyong, YANG Xiaolin, MENG Xiangfu. Tumor Mutation Prediction Model of Lung Adenocarcinoma Based on Pathological [J]. Computer Science, 2025, 52(6A): 240700010-8.
[14] TAN Jiahui, WEN Chenyan, HUANG Wei, HU Kai. CT Image Segmentation of Intracranial Hemorrhage Based on ESC-TransUNet Network [J]. Computer Science, 2025, 52(6A): 240700030-9.
[15] CHEN Xianglong, LI Haijun. LST-ARBunet:An Improved Deep Learning Algorithm for Nodule Segmentation in Lung CT Images [J]. Computer Science, 2025, 52(6A): 240600020-10.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!