基于多任务学习的多模态情绪识别方法

doi:10.11896/jsjkx.180901665

摘要/Abstract

摘要： 情绪分析是自然语言处理的一项基本任务,目前在单模态信息(文本)上的研究已经相当成熟。但是对于包含文本、图像和语音3种模态信息的多模态内容(如视频)来说,额外增加的模态信息让情绪分析变得更具挑战性。为了提升多模态情绪识别任务的性能,文中提出了一种基于多任务学习的神经网络方法,该方法在考虑模态内部信息的同时,充分结合了3种模态之间的联系。具体而言,首先对3种模态信息进行预处理,得到相应的特征表示;其次,分别为每个模态构建私有的双向LSTM,从而获得单模态的内部信息;分别为两两组合(文本-图像、文本-语音和图像-语音)的双模态信息构建共享的双向LSTM层,以学习双模态之间的动态交互信息;接着,为3种模态组合的信息构建一个共享的双向LSTM,从而捕捉3种模态之间的动态交互信息;最后,把网络层中得到的单模态的内部信息和多模态的动态交互信息进行融合,通过全连接层和Sigmoid层获取最终的情绪识别结果。在单模态实验中,相比于目前的最佳方法,所提方法在文本、图像和语音3个方面对所有情绪识别的效果分别平均提高了6.25%,0.75%和2.38%;在多模态实验中,该方法在情绪识别任务中达到了平均65.67%的准确率,相比其他基准方法有了明显的提升。

关键词: 多模态, 多任务学习, 情绪识别, 自然语言处理

Abstract: Emotion analysis is a fundamental task of natural language processing(NLP),and the research on single modality (text modality) has been rather mature.However,for multi-modal contents such as videos which consist of three modalities including text,visual and acoustic modalities,additional modal information makes emotion analysis more challenging.In order to improve the performance of emotion recognition on multi-modal emotion datasets,this paper proposed a neural network approach based on multi-task learning.This approach simultaneously considers both intra-modality and inter-modality dynamics among three modalities.Specifically,three kinds of modality information are first preprocessed to extract the corresponding features.Secondly,private bidirectional LSTMs are constructed for each modality to acquire the intra-modality dynamics.Then,shared bidirectional LSTMs are built for modeling inter-modality dynamics,including bi-modal (text-visual,text-acoustic and visual-acoustic) and tri-modal interactions.Finally,the intra-modality dynamics and inter-modality dynamics obtained in the network are fused to get the final emotion recognition results through fully-connected layers and the Sigmoid layer.In the experiment of uni-modal emotion recognition,the proposed approach outperforms the state-of-the-art by 6.25%,0.75% and 2.38% in terms of text,visual and acoustic on average respectively.In addition,this approach can achieve average 65.67% in accuracy in multi-modal emotion recognition tasks,showing significant improvement compared with other baselines.

Key words: Emotion recognition, Multi-modal, Multi-task learning, Natural language processing

中图分类号:

TP391

吴良庆, 张栋, 李寿山, 陈瑛. 基于多任务学习的多模态情绪识别方法[J]. 计算机科学, 2019, 46(11): 284-290. https://doi.org/10.11896/jsjkx.180901665

WU Liang-qing, ZHANG Dong, LI Shou-shan, CHEN Ying. Multi-modal Emotion Recognition Approach Based on Multi-task Learning[J]. Computer Science, 2019, 46(11): 284-290. https://doi.org/10.11896/jsjkx.180901665

参考文献

[1]MORENCY L P,MIHALCEA R,DOSHI P.Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web[C]∥Proceedings of International Conference on Multimodal Interfaces.ACM,2011:169-176.
[2]ZADEH A,ZELLERS R,PINCUS E,et al.Multimodal Sentiment Intensity Analysis in Videos:Facial Gestures and Verbal Messages[J].IEEE Intelligent Systems,2016,31(6):82-88.
[3]PORIA S,CAMBRIA E,GELBUKH A F.Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.2015:2539-2544.
[4]HUANG L,LI S S,ZHOU G D.Emotion recognition of Chinese microblogs with syntactic information [J].Computer Science,2017,44(2):244-249.(in Chinese)
黄磊,李寿山,周国栋.基于句法信息的微博情绪识别方法研究 [J].计算机科学,2017,44(2):244-249.
[5]ZADEH A,CHEN M,PORIA S,et al.Tensor Fusion Network for Multimodal Sentiment Analysis[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing.2017:1103-1114.
[6]ZADEH A,LIANG P P,VANBRIESEN J,et al.MultimodalLanguage Analysis in the Wild:CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph[C]∥Proceedings of the Meeting of the Association for Computational Linguistics.2018:2236-2246[7]LIU P,QIU X,HUANG X.Adversarial Multi-task Learning for Text Classification[C]∥Proceedings of the Meeting of the Association for Computational Linguistics.2017:1-10.
[8]YU J,JIANG J.Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification[C]∥Procee-dings of the Conference on Empirical Methods in Natural Language Processing.2016:236-246.
[9]GLODEK M,TSCHECHNE S,LAYHER G,et al.MultipleClassifier Systems for the Classification of Audio-Visual Emotional States[C]∥Proceedings of International Conference on Affective Computing and Intelligent Interaction.Springer-Verlag,2011:359-368.
[10]GHOSH S,LAKSANA E,MORENCY L P,et al.Representation Learning for Speech Emotion Recognition[C]∥Proceedings of INTERSPEECH.2016:3603-3607.
[11]WANG H,MEGHAWAT A,MORENCY L P,et al.SelectAdditive Learning:Improving Cross-individual Generalization in Multimodal Sentiment Analysis[J].arXiv:1609.05244[12]NOJAVANASGHARI B,HUGHES C E,MORENCY L P.EmoReact:A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2016:137-144.
[13]ZADEH A,LIANG P P,MAZUMDER N,et al.Memory Fusion Network for Multi-view Sequential Learning[C]∥Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[14]CHEN M,WANG S,LIANG P P,et al.Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement Learning[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2017:163-171.
[15]ZADEH A,LIANG P P,PORIA S,et al.Multi-attention Recurrent Network for Human Communication Comprehension[C]∥Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[16]PENNINGTON J,SOCHER R,MANNING C.Glove:GlobalVectors for Word Representation[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[17]EKMAN P.An argument for basic emotions.
[J].Cognition & Emotion,1992,6(3／4):169-200.
[18]DEGOTTEX G,KANE J,DRUGMAN T,et al.COVAREP — A Collaborative Voice Analysis Repository for Speech Technologies[C]∥Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:960-964.
[19]YUAN J,LIBERMAN M.Speaker Identification on the SCOTUS Corpus[J].Journal of the Acoustical Society of America,2008,123(123):3878.
[20]HOCHREITER S,SCHMIDHUBER J.Long Short-term Memory.
[J].Neural Computation,1997,9(8):1735-1780.
[21]ZEILER M D.ADADELTA:An Adaptive Learning Rate Method[J].arXiv:1212.5701.
[22]YIN H,LI S S,GONG Z X,et al.Imbalanced Emotion Classification Based on Multi-channel LSTM[J].Journal of Chinese Information Processing,2018,32(1):139-145.(in Chinese)
殷昊,李寿山,贡正仙,等.基于多通道LSTM的不平衡情绪分类方法[J].中文信息学报,2018,32(1):139-145.
[23]HUANG Y,WANG W,WANG L,et al.Multi-task Deep Neural Network for Multi-label Learning[C]∥Proceedings of IEEE International Conference on Image Processing.IEEE,2014:2897-2900.
[24]TONG E,ZADEH A,JONES C,et al.Combating Human Trafficking with Multimodal Deep Models[C]∥Proceedings of the Meeting of the Association for Computational Linguistics.2017:1547-1556.
[25]CORTES C,VAPNIK V.Support-vector Networks[J].Machine Learning,1995,20(3):273-297.
[26]IYYER M,MANJUNATHA V,BOYD-GRABER J,et al.Deep Unordered Composition Rivals Syntactic Methods for Text Classification[C]∥Proceedings of the Meeting of the Association for Computational Linguistics and the International Joint Confe-rence on Natural Language Processing.2015:1681-1691.
[27]HO T K.The Random Subspace Method for Constructing Decision Forests[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,1998,20(8):832-844.
[28]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu features? End-to-end Speech Emotion Recognition Using a Deep Convolutional Recurrent Network[C]∥Proceedings of IEEE International Conference on Acoustics,Speech and Signal Proces-sing.IEEE,2016:5200-5204.
[29]NOJAVANASGHARI B,GOPINATH D,KOUSHIK J,et al.Deep Multimodal Fusion for Persuasiveness Prediction[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2016:284-288.
[30]LIM W,JANG D,LEE T.Speech Emotion Recognition UsingConvolutional and Recurrent Neural Networks[C]∥Procee-dings of Signal and Information Processing Association Summit and Conference.IEEE,2017:1-4.
[31]KAHOU S E,MICHALSKI V,KONDA K,et al.RecurrentNeural Networks for Emotion Recognition in Video[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2015:467-474.
[32]KALCHBRENNER N,GREFENSTETTE E,BLUNSOM P.A Convolutional Neural Network for Modelling Sentences[J].ar-Xiv:1404.2188.
[33]SRIVASTAVA R K,GREFF K,SCHMIDHUBER J.TrainingVery Deep Networks[J].arXiv:1507.06228.
[34]ZILLY J G,SRIVASTAVA R K,KOUTNÍk J,et al.Recurrent Highway Networks[J].arXiv:1607.03474.

相关文章 15

[1]	聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙. 基于自然语言的视频片段定位综述 Overview of Natural Language Video Localization 计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130
[2]	周旭, 钱胜胜, 李章明, 方全, 徐常胜. 基于对偶变分多模态注意力网络的不完备社会事件分类方法 Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification 计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[4]	徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[5]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[6]	常炳国, 石华龙, 常雨馨. 基于深度学习的黑色素瘤智能诊断多模型算法 Multi Model Algorithm for Intelligent Diagnosis of Melanoma Based on Deep Learning 计算机科学, 2022, 49(6A): 22-26. https://doi.org/10.11896/jsjkx.210500197
[7]	杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建. 基于注意力机制和多任务学习的阿尔茨海默症分类 Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning 计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072
[8]	李小伟, 舒辉, 光焱, 翟懿, 杨资集. 自然语言处理在简历分析中的应用研究综述 Survey of the Application of Natural Language Processing for Resume Analysis 计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[9]	李浩东, 胡洁, 范勤勤. 基于并行分区搜索的多模态多目标优化及其应用 Multimodal Multi-objective Optimization Based on Parallel Zoning Search and Its Application 计算机科学, 2022, 49(5): 212-220. https://doi.org/10.11896/jsjkx.210300019
[10]	高越, 傅湘玲, 欧阳天雄, 陈松龄, 闫晨巍. 基于时空自适应图卷积神经网络的脑电信号情绪识别 EEG Emotion Recognition Based on Spatiotemporal Self-Adaptive Graph ConvolutionalNeural Network 计算机科学, 2022, 49(4): 30-36. https://doi.org/10.11896/jsjkx.210900200
[11]	赵亮, 张洁, 陈志奎. 基于双图正则化的自适应多模态鲁棒特征学习 Adaptive Multimodal Robust Feature Learning Based on Dual Graph-regularization 计算机科学, 2022, 49(4): 124-133. https://doi.org/10.11896/jsjkx.210300078
[12]	赵凯, 安卫超, 张晓宇, 王彬, 张杉, 相洁. 共享浅层参数多任务学习的脑出血图像分割与分类 Intracerebral Hemorrhage Image Segmentation and Classification Based on Multi-taskLearning of Shared Shallow Parameters 计算机科学, 2022, 49(4): 203-208. https://doi.org/10.11896/jsjkx.201000153
[13]	杨晓宇, 殷康宁, 候少麒, 杜文仪, 殷光强. 基于特征定位与融合的行人重识别算法 Person Re-identification Based on Feature Location and Fusion 计算机科学, 2022, 49(3): 170-178. https://doi.org/10.11896/jsjkx.210100132
[14]	张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[15]	刘创, 熊德意. 多语言问答研究综述 Survey of Multilingual Question Answering 计算机科学, 2022, 49(1): 65-72. https://doi.org/10.11896/jsjkx.210900003

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed