计算机科学 ›› 2025, Vol. 52 ›› Issue (7): 210-217.doi: 10.11896/jsjkx.240600127

• 人工智能 • 上一篇    下一篇

基于跨模态超图优化学习的多模态情感分析

蒋昆1, 赵征鹏1, 普园媛1,2, 黄健1, 谷金晶1, 徐丹1   

  1. 1 云南大学信息学院 昆明 650500
    2 中国云南省高校物联网技术及应用重点实验室 昆明 650500
  • 收稿日期:2024-06-21 修回日期:2024-09-18 发布日期:2025-07-17
  • 通讯作者: 赵征鹏(zhpzhao@ynu.edu.cn)
  • 作者简介:(jiangkun_w6nl@stu.ynu.edu.cn)
  • 基金资助:
    国家自然科学基金(61271361,61761046,62162068,52102382,62362070);云南省科技厅应用基础研究计划重点项目(202001BB050043,202401AS070149);云南省科技重大专项(202302AF080006);研究生科研创新项目(KC-23236053)

Cross-modal Hypergraph Optimisation Learning for Multimodal Sentiment Analysis

JIANG Kun1, ZHAO Zhengpeng1, PU Yuanyuan1,2, HUANG Jian1, GU Jinjing1, XU Dan1   

  1. 1 School of Information Science and Engineering, Yunnan University, Kunming 650500, China
    2 Internet of Things Technology and Application Key Laboratory of Universities in Yunnan, Kunming 650500, China
  • Received:2024-06-21 Revised:2024-09-18 Published:2025-07-17
  • About author:JIANG Kun,born in 1998,master.His main research interests include multimodal sentiment analysis and so on.
    ZHAO Zhengpeng,born in 1973,associa-te professor,master's supervisor.His main research interests include signal and information processing,and computer systems and applications.
  • Supported by:
    National Natural Science Foundation of China(61271361,61761046,62162068,52102382,62362070),Key Project of Applied Basic Research Programe of Yunnan Provincial Department of Science and Technology(202001BB050043,202401AS070149),Yunnan Provincial Science and Technology Major Project(202302AF080006) and Graduate Student Innovation Project(KC-23236053).

摘要: 多模态情感分析旨在从文本、音频和视觉等多种模态信息中检测出更准确的情感表达。以往的研究通过图神经网络来捕获跨模态和跨时间的节点情感交互,从而获得高度表达的情感信息。但图神经网络只能实现二元信息交互,这限制了对模态间复杂情感交互信息的利用,多模态数据中更需要挖掘这种潜在的情感交互信息。因此,提出了一种基于跨模态超图神经网络的多模态情感分析框架,利用超图结构可以连接多个节点的特性,充分利用模态内和模态间的复杂情感交互信息,以挖掘数据间更深层次的情感表征。此外,提出了一种超图自适应模块来优化学习原始超图的结构。超图自适应网络通过点边交叉注意力、超边采样和节点采样来发现潜在的隐式连接,并修剪冗余的超边以及无关的事件节点,对超图结构进行更新与优化。相对于初始结构,更新后的超图结构能够更准确、更完整地表述数据间的潜在情感关联性,以达到更好的情感分类效果。最后,在两个公开的CMU-MOSI和CMU-MOSEI数据集上进行了广泛的实验,结果表明所提框架相对于其他先进算法在多个性能指标上提升了1%~6%。

关键词: 多模态情感分析, 超图神经网络, 超图优化, 自适应网络, 点边信息融合

Abstract: Sentiment expressions are multimodal,and more accurate emotions can be derived through multiple modalities such as verbal,audio,and visual.Studying the interactions among modalities can effectively improve the accuracy of multimodal sentiment analysis.Previous studies have used graph models to capture rich interactions across modalities and time to obtain highly expressive and fine-grained sequence representations,but there is a greater need to tap into the expression of higher-order information in multimodal data,which can only be achieved on a one-to-one basis in graph neural networks,which restricts the utilisation of the interactions of higher-order information.This paper explores the application of hypergraph neural networks in multimodal sentiment analysis,where the hypergraph structure can connect two or more nodes to make full use of intra-and inter-modal higher order information and to achieve the interaction of higher-order information between data.Furthermore,this paper proposes a hypergraph adaptive module to optimise the structure of the original hypergraph,where the hypergraph adaptive network is designed to detect potential hidden information by means of point-edge cross-attention,hyperedge sampling and event node sampling to discover potential implicit connections and prune redundant hyperedges as well as irrelevant event nodes to update and optimise the hypergraph structure,the updated hypergraph structure represents the higher-order correlations of the data more accurately and completely than the initial structure.Extensive experiments on two publicly available datasets show that the proposed framework improves 1% to 6% in several performance metrics over other state-of-the-art algorithms on the CMU-MOSI and CMU-MOSEI datasets.

Key words: Multimodal sentiment analysis, Hypergraph neural networks, Hypergraph optimisation, Adaptive networks, Node-edge information fusion

中图分类号: 

  • TP391
[1]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactive emotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42:335-359.
[2]LIU J M,ZHANG P X,LIU Y,et al.Summary of multi-modal sentiment analysis technology[J].Journal of Frontiers of Computer Science & Technology,2021,15(7):1165.
[3]SNOEK C G M,WORRING M,SMEULDERS A W M.Early versus late fusion in semantic video analysis[C]//Proceedings of the 13th Annual ACM International Conference on Multimedia.2005:399-402.
[4]TSAI Y H H,LIANG P P,ZADEH A,et al.Learning factorized multimodal representations[J].arXiv:1806.06176,2018.
[5]HAZARIKA D,ZIMMERMANN R,PORIA S.Misa:Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1122-1131.
[6]WU J,MAI S,HU H.Graph capsule aggregation for unaligned multimodal sequences[C]//Proceedings of the 2021 International Conference on Multimodal Interaction.2021:521-529.
[7]HUANG J,PU Y,ZHOU D,et al.Dynamic hypergraph convo-lutional network for multimodal sentiment analysis[J].Neurocomputing,2024,565:126992.
[8]SOLEYMANI M,GARCIA D,JOU B,et al.A survey of multi-modal sentiment analysis[J].Image and Vision Computing,2017,65:3-14.
[9]D'MELLO S K,KORY J.A review and meta-analysis of multimodal affect detection systems[J].ACM computing surveys(CSUR),2015,47(3):1-36.
[10]GKOUMAS D,LI Q,LIOMA C,et al.What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis[J].Information Fusion,2021,66:184-197.
[11]RAHMAN W,HASAN M K,LEE S,et al.Integrating multimodal information in large pretrained transformers[C]//Proceedings of the conference.Association for Computational Linguistics.Meeting.NIH Public Access,2020.
[12]HAN W,CHEN H,GELBUKH A,et al.Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction.2021:6-15.
[13]YU W,XU H,YUAN Z,et al.Learning modality-specific repre-sentations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Confe-rence on Artificial Intelligence.2021:10790-10797.
[14]YANG J,WANG Y,YI R,et al.MTAG:Modal-temporal attention graph for unaligned human multimodal language sequences[J].arXiv:2010.11985,2020.
[15]GAO Y,ZHANG Z,LIN H,et al.Hypergraph learning:Methods and practices[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,44(5):2548-2566.
[16]ZHOU D,HUANG J,SCHÖLKOPF B.Learning with hypergraphs:Clustering,classification,and embedding[C]//Procee-dings of the 20th Annual Conference on Neural Information Processing Systems.2006.
[17]BAI S,ZHANG F,TORR P H S.Hypergraph convolution and hypergraph attention[J].Pattern Recognition,2021,110:107637.
[18]ZHANG R,ZOU Y,MA J.Hyper-SAGNN:a self-attentionbased graph neural network for hypergraphs[J].arXiv:1911.02613,2019.
[19]FENG Y,YOU H,ZHANG Z,et al.Hypergraph neural networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:3558-3565.
[20]YADATI N,NIMISHAKAVI M,YADAV P,et al.HyperGCN:A new method fo of training graph convolutional networks on hypergraphs[C]//Proceedings of the 33rd International Confe-rence onNeural Information Processing Systems,2019,32.
[21]HUANG J,YANG J.Unignn:a unified framework for graphand hypergraph neural networks[J].arXiv:2105.00956,2021.
[22]CHIEN E,PAN C,PENG J,et al.You are allset:A multisetfunction framework for hypergraph neural networks[J].arXiv:2106.13264,2021.
[23]ZHANG Z,LIN H,ZHAO X,et al.Inductive multi-hypergraph learning and its application on view-based 3D object classification[J].IEEE Transactions on Image Processing,2018,27(12):5957-5968.
[24]WANG M,LIU X,WU X.Visual classification by l_1 $-hypergraph modeling[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(9):2564-2574.
[25]GAO Y,WANG M,ZHA Z J,et al.Visual-textual joint rele-vance learning for tag-based social image search[J].IEEE Transactions on Image Processing,2012,22(1):363-376.
[26]HE J,HU H.MF-BERT:Multimodal fusion in pre-trainedBERT for sentiment analysis[J].IEEE Signal Processing Letters,2021,29:454-458.
[27]SHI H,PU Y,ZHAO Z,et al.Co-space Representation Interaction Network for multimodal sentiment analysis[J].Knowledge-Based Systems,2024,283:111149.
[28]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[29]DEGOTTEX G,KANE J,DRUGMAN T,et al.COVAREP-A collaborative voice analysis repository for speech technologies[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2014:960-964.
[30]BALTRUŠAITIS T,ROBINSON P,MORENCY L P.Open-face:an open source facial behavior analysis toolkit[C]//2016 IEEE Winter Conference on Applications of Computer Vision(WACV).IEEE,2016:1-10.
[31]CHUNG J,GULCEHRE C,CHO K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[J].arXiv:1412.3555,2014.
[32]KRISHNA K,MURTY M N.Genetic K-means algorithm[J].IEEE Transactions on Systems,Man,and Cybernetics,Part B(Cybernetics),1999,29(3):433-439.
[33]MADDISON C J,MNIH A,TEH Y W.The concrete distribution:A continuous relaxation of discrete random variables[J].arXiv:1611.00712,2016.
[34]ZADEH A,ZELLERS R,PINCUS E,et al.Mosi:multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J].arXiv:1606.06259,2016.
[35]ZADEH A A B,LIANG P P,PORIA S,et al.Multimodal language analysis in the wild:Cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).2018:2236-2246.
[36]ZADEH A,LIANG P P,PORIA S,et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[37]TSAI Y H H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences[J].arXiv:1906.00295,2019.
[38]CHEN M,LI X.Swafn:Sentimental words aware fusion net-work for multimodal sentiment analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:1067-1077.
[39]WU J,MAI S,HU H.Graph capsule aggregation for unaligned multimodal sequences[C]//Proceedings of the 2021 Interna-tional Conference on Multimodal Interaction.2021:521-529.
[40]MAI S,XING S,HE J,et al.Multimodal graph for unalignedmultimodal sequence analysis via graph convolution and graph pooling[J].ACM Transactions on Multimedia Computing,Communications and Applications,2023,19(2):1-24.
[41]LI Y,WANG Y,CUI Z.Decoupled multimodal distilling foremotion recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:6631-6640.
[42]PHAM H,LIANG P P,MANZINI T,et al.Found in translation:Learning robust joint representations by cyclic translations between modalities[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6892-6899.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!