计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 210900107-6.doi: 10.11896/jsjkx.210900107
胡新荣, 陈志恒, 刘军平, 彭涛, 叶鹏, 朱强
HU Xin-rong, CHEN Zhi-heng, LIU Jun-ping, PENG Tao, YE Peng, ZHU Qiang
摘要: 在多模态表示对整体损失的学习过程中,重构损失对模型的依赖性相对较小,导致隐含表示无法有效捕捉它们各自模态的细节。文中提出了一个基于多模态表示学习的多子空间情感分析框架。首先将每个模态投射到模态不变和模态特定两种不同的话语表示中,在模态不变表示中构建主共享子空间以及帮助该子空间减少模态差距的辅助共享子空间,在模态特定表示中构建私有子空间以捕获每个模态独有的特征,将所有子空间中的隐藏向量作为解码函数的输入并重构模态向量,以实现对重构损失的优化。然后,在融合阶段对每个模态表示执行基于Transformer的自注意力,使每个表示能从对整体情感取向具有协同作用的其他跨模态表示中获取潜在信息。最后,通过串联生成联合向量并利用全连接层生成任务预测。在两个公开数据集MOSI和MOSEI上的实验结果表明,该框架在大多数评价指标上都优于基线模型。
中图分类号:
[1]ABDU S A,YOUSEF A H,SALEM A.Multimodal video sentiment analysis using deep learning approaches,a survey[J].Information Fusion,2021,76(2021):204-226. [2]ZADEH A,ZELLERS R,PINCUS E,et al.Mul-timodal sentiment intensity analysis in videos:facial gestures and verbal messages[J].IEEE Intelligent Systems,2016,31(6):82-88. [3]ZADEH A,LIANG P P,VANBRIESEN J,et al.Multimodallanguage analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2236-2246. [4]RAJAGOPALAN S S,MORENCY L P,BALTRUSAITIS T,et al.Extending long short-term memory for multi-view structured learning [C] //European Conference on Computer Vision.2016:338-353. [5]HAZARIKA D,ZIMMERMANN R,PORIA S.MISA:modality-invariant and -specific representations for multimodal sentiment analysis [C] //ACM International Conference on Multimedia.2020:1122-1131. [6]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C] //Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [7]ZADEH A,CHEN M,PORIA S,et al.Tensor Fusion Network for Multimodal Sentiment Analysis[C] //Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017:1103-1114. [8]LIU Z,SHEN Y,LAKSHMINARASIMHAN V B,et al.Effi-cient Low-rank Multimodal Fusion with Modality-Specific Factors[C] //Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.2018:2247-2256. [9]MAI S,HU H,XING S,et al.Divide,Conquer and Combine:Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing [C] //Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:481-492. [10]PORIA S,CAMBRIA E,HAZARIKA D,et al.Context-Depen-dent Sentiment Analysis in User-Generated Videos [C] //Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:873-883. [11]GHOSAL D,AKHTAR M S,CHAUHAN D,et al.Contextual Inter-modal Attention for Multi-modal Sentiment Analysis [C] //Proceedings of the 2018 Conference on Empirical Me-thods in Natural Language Processing.2018:3454-3466. [12]CHAUHAN D S,AKHTAR M S,EKBAL A,et al.Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis [C] //Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.2019:5647-5657. [13]PORIA S,CAMBRIA E,HAZARIKA D,et al.Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis [C] //IEEE International Conference on Data Mining.2017:1033-1038. [14]ZHANG Y Z,SONG D W,ZHANG P.A Quantum-InspiredMultimodal Sentiment Analysis Framework [J].Theoretical Computer Science,2018,752(2018):21-40. [15]LI Q C,GKOUMAS D,LIOMA C,et al.Quantum-inspired Multimodal Fusion for Video Sentiment Analysis [J].Information Fusion,2021,65(2021):58-71. [16]ZHANG Y,SONG D,LI X,et al.A quantum-like multimodal network framework for modeling interaction dynamics in multiparty conversational sentiment analysis [J].Information Fusion,2020,62(2020):14-31. [17]OLSON D.From utterance to text:The bias of language inspeech and writing [J].Harvard Educational Review,1977,47(3):257-281. [18]GUO W Z,WANG J W,WANG S P.Deep Multimodal Representation Learning:A Survey [J].IEEE Access,2019,7(2019):63373-63394. [19]ZELLINGER W,LUGHOFER E,SAMINGER-PLATZ S,et al.Central moment discrepancy (cmd) for domain-invariant representation learning[J].arXiv:1702.08811,2017. [20]ZADEH A,LIANG P P,PORIS S,et al.Multi-attention Recurrent Network for Human Communication Comprehension [C] //AAAI Conference on Artificial Intelligence.2018:5642-5649. [21]TSAI Y H H,BAI S,LIANG P P,et al.Multimodal transformer for unaligned multimodal language sequences [C] //Proceedings of the conference.Association for Computational Linguistics.Meeting,2019:6558-6569. [22]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding [J] arXiv:1810.04805,2018. [23]EKMAN P,ROSENBERG E L.What the face reveals:Basic and applied studies of spontaneous expression using the Facial Action Coding System(FACS) [M]//USA:Oxford University Press. [24]DEGOTTEX G,KANE J,DRUGMAN T,et al.Cvarep-a colla-borative voice analysis repository for speech technologies [C] //2014 IEEE International Conference on Acoustics.Speech and Signal Processing,2014:960-964. [25]DRUGMAN T,ALWAN A.Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics [J].arXiv:2001.00459,2019. [26]DRUGMAN T,THOMAS M,GUDNASON J,et al.Detectionof Glottal Closure Instants from Speech Signals:a Quantitative Review [J].IEEE Transactions on Audio,Speech,and Language Processing,2011,20(3):994-1006. [27]LIANG P P,LIU Z Y,ZADEH A,et al.Multimodal Language Analysis with Recurrent Multistage Fusion [J].arXiv:1808.03920,2018. [28]WANG Y S,SHEN Y,LIU Z,et al.Words Can Shift:Dynamically Adjusting Word Representations Using Nonverbal Beha-viors [C] //Proceedings of the AAAI Conference on Artificial Intelligence.2019:7216-7223. [29]PHAM H,LIANG P P,MANZINI T,et al.Found in Translation:Learning Robust Joint Representations by Cyclic Translations between Modalities [C] //Proceedings of the AAAI Conference on Artificial Intelligence.2019:6892-6899. [30]TSAI Y H H,LIANG P P,ZADEH A.Learning factorized multimodal representations[J].arXiv:1806.06176,2018. [31]SUN Z K,SARMA P K,SETHARES W A,et al.Learning Relationships between Text,Audio,and Video via Deep Canonical Correlation for Multimodal Language Analysis [C] //Procee-dings of the AAAI Conference on Artificial Intelligence.2020:8992-8999. |
[1] | 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙. 基于自然语言的视频片段定位综述 Overview of Natural Language Video Localization 计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130 |
[2] | 吴子仪, 李邵梅, 姜梦函, 张建朋. 基于自注意力模型的本体对齐方法 Ontology Alignment Method Based on Self-attention 计算机科学, 2022, 49(9): 215-220. https://doi.org/10.11896/jsjkx.210700190 |
[3] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 |
[4] | 方义秋, 张震坤, 葛君伟. 基于自注意力机制和迁移学习的跨领域推荐算法 Cross-domain Recommendation Algorithm Based on Self-attention Mechanism and Transfer Learning 计算机科学, 2022, 49(8): 70-77. https://doi.org/10.11896/jsjkx.210600011 |
[5] | 陈坤峰, 潘志松, 王家宝, 施蕾, 张锦. 基于双目叠加仿生的微换衣行人再识别 Moderate Clothes-Changing Person Re-identification Based on Bionics of Binocular Summation 计算机科学, 2022, 49(8): 165-171. https://doi.org/10.11896/jsjkx.210600140 |
[6] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 |
[7] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[8] | 康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩. 基于Transformer和LSTM的药物相互作用预测 Drug-Drug Interaction Prediction Based on Transformer and LSTM 计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150 |
[9] | 张嘉淏, 刘峰, 齐佳音. 一种基于Bottleneck Transformer的轻量级微表情识别架构 Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer 计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023 |
[10] | 赵小虎, 叶圣, 李晓. 多算法融合的骨骼重建信息动作分类方法 Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction 计算机科学, 2022, 49(6): 269-275. https://doi.org/10.11896/jsjkx.210500070 |
[11] | 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123 |
[12] | 陆亮, 孔芳. 面向对话的融入知识的实体关系抽取 Dialogue-based Entity Relation Extraction with Knowledge 计算机科学, 2022, 49(5): 200-205. https://doi.org/10.11896/jsjkx.210300198 |
[13] | 韩洁, 陈俊芬, 李艳, 湛泽聪. 基于自注意力的自监督深度聚类算法 Self-supervised Deep Clustering Algorithm Based on Self-attention 计算机科学, 2022, 49(3): 134-143. https://doi.org/10.11896/jsjkx.210100001 |
[14] | 丁锋, 孙晓. 基于注意力机制和BiLSTM-CRF的消极情绪意见目标抽取 Negative-emotion Opinion Target Extraction Based on Attention and BiLSTM-CRF 计算机科学, 2022, 49(2): 223-230. https://doi.org/10.11896/jsjkx.210100046 |
[15] | 李川, 李维华, 王迎晖, 陈伟, 文俊颖. 基于transformer的门控双塔模型预测H1N1流感抗原性 Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1 计算机科学, 2022, 49(11A): 211000209-6. https://doi.org/10.11896/jsjkx.211000209 |
|