Computer Science ›› 2019, Vol. 46 ›› Issue (11): 284-290.doi: 10.11896/jsjkx.180901665

• Graphics ,Image & Pattern Recognition • Previous Articles     Next Articles

Multi-modal Emotion Recognition Approach Based on Multi-task Learning

WU Liang-qing1, ZHANG Dong1, LI Shou-shan1, CHEN Ying2   

  1. (School of Computer Science & Technology,Soochow University,Suzhou,Jiangsu 215006,China)1
    (College of Information and Electrical Engineering,China Agricultural University,Beijing 100083,China)2
  • Received:2018-09-06 Online:2019-11-15 Published:2019-11-14

Abstract: Emotion analysis is a fundamental task of natural language processing(NLP),and the research on single modality (text modality) has been rather mature.However,for multi-modal contents such as videos which consist of three modalities including text,visual and acoustic modalities,additional modal information makes emotion analysis more challenging.In order to improve the performance of emotion recognition on multi-modal emotion datasets,this paper proposed a neural network approach based on multi-task learning.This approach simultaneously considers both intra-modality and inter-modality dynamics among three modalities.Specifically,three kinds of modality information are first preprocessed to extract the corresponding features.Secondly,private bidirectional LSTMs are constructed for each modality to acquire the intra-modality dynamics.Then,shared bidirectional LSTMs are built for modeling inter-modality dynamics,including bi-modal (text-visual,text-acoustic and visual-acoustic) and tri-modal interactions.Finally,the intra-modality dynamics and inter-modality dynamics obtained in the network are fused to get the final emotion recognition results through fully-connected layers and the Sigmoid layer.In the experiment of uni-modal emotion recognition,the proposed approach outperforms the state-of-the-art by 6.25%,0.75% and 2.38% in terms of text,visual and acoustic on average respectively.In addition,this approach can achieve average 65.67% in accuracy in multi-modal emotion recognition tasks,showing significant improvement compared with other baselines.

Key words: Emotion recognition, Multi-modal, Multi-task learning, Natural language processing

CLC Number: 

  • TP391
[1]MORENCY L P,MIHALCEA R,DOSHI P.Towards Multimodal Sentiment Analysis:Harvesting Opinions from the Web[C]∥Proceedings of International Conference on Multimodal Interfaces.ACM,2011:169-176.
[2]ZADEH A,ZELLERS R,PINCUS E,et al.Multimodal Sentiment Intensity Analysis in Videos:Facial Gestures and Verbal Messages[J].IEEE Intelligent Systems,2016,31(6):82-88.
[3]PORIA S,CAMBRIA E,GELBUKH A F.Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.2015:2539-2544.
[4]HUANG L,LI S S,ZHOU G D.Emotion recognition of Chinese microblogs with syntactic information [J].Computer Science,2017,44(2):244-249.(in Chinese)
黄磊,李寿山,周国栋.基于句法信息的微博情绪识别方法研究 [J].计算机科学,2017,44(2):244-249.
[5]ZADEH A,CHEN M,PORIA S,et al.Tensor Fusion Network for Multimodal Sentiment Analysis[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing.2017:1103-1114.
[6]ZADEH A,LIANG P P,VANBRIESEN J,et al.MultimodalLanguage Analysis in the Wild:CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph[C]∥Proceedings of the Meeting of the Association for Computational Linguistics.2018:2236-2246[7]LIU P,QIU X,HUANG X.Adversarial Multi-task Learning for Text Classification[C]∥Proceedings of the Meeting of the Association for Computational Linguistics.2017:1-10.
[8]YU J,JIANG J.Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification[C]∥Procee-dings of the Conference on Empirical Methods in Natural Language Processing.2016:236-246.
[9]GLODEK M,TSCHECHNE S,LAYHER G,et al.MultipleClassifier Systems for the Classification of Audio-Visual Emotional States[C]∥Proceedings of International Conference on Affective Computing and Intelligent Interaction.Springer-Verlag,2011:359-368.
[10]GHOSH S,LAKSANA E,MORENCY L P,et al.Representation Learning for Speech Emotion Recognition[C]∥Proceedings of INTERSPEECH.2016:3603-3607.
[11]WANG H,MEGHAWAT A,MORENCY L P,et al.SelectAdditive Learning:Improving Cross-individual Generalization in Multimodal Sentiment Analysis[J].arXiv:1609.05244[12]NOJAVANASGHARI B,HUGHES C E,MORENCY L P.EmoReact:A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2016:137-144.
[13]ZADEH A,LIANG P P,MAZUMDER N,et al.Memory Fusion Network for Multi-view Sequential Learning[C]∥Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[14]CHEN M,WANG S,LIANG P P,et al.Multimodal Sentiment Analysis with Word-level Fusion and Reinforcement Learning[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2017:163-171.
[15]ZADEH A,LIANG P P,PORIA S,et al.Multi-attention Recurrent Network for Human Communication Comprehension[C]∥Proceedings of the AAAI Conference on Artificial Intelligence.2018.
[16]PENNINGTON J,SOCHER R,MANNING C.Glove:GlobalVectors for Word Representation[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[17]EKMAN P.An argument for basic emotions.
[J].Cognition & Emotion,1992,6(3/4):169-200.
[18]DEGOTTEX G,KANE J,DRUGMAN T,et al.COVAREP — A Collaborative Voice Analysis Repository for Speech Technologies[C]∥Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:960-964.
[19]YUAN J,LIBERMAN M.Speaker Identification on the SCOTUS Corpus[J].Journal of the Acoustical Society of America,2008,123(123):3878.
[20]HOCHREITER S,SCHMIDHUBER J.Long Short-term Memory.
[J].Neural Computation,1997,9(8):1735-1780.
[21]ZEILER M D.ADADELTA:An Adaptive Learning Rate Method[J].arXiv:1212.5701.
[22]YIN H,LI S S,GONG Z X,et al.Imbalanced Emotion Classification Based on Multi-channel LSTM[J].Journal of Chinese Information Processing,2018,32(1):139-145.(in Chinese)
殷昊,李寿山,贡正仙,等.基于多通道LSTM的不平衡情绪分类方法[J].中文信息学报,2018,32(1):139-145.
[23]HUANG Y,WANG W,WANG L,et al.Multi-task Deep Neural Network for Multi-label Learning[C]∥Proceedings of IEEE International Conference on Image Processing.IEEE,2014:2897-2900.
[24]TONG E,ZADEH A,JONES C,et al.Combating Human Trafficking with Multimodal Deep Models[C]∥Proceedings of the Meeting of the Association for Computational Linguistics.2017:1547-1556.
[25]CORTES C,VAPNIK V.Support-vector Networks[J].Machine Learning,1995,20(3):273-297.
[26]IYYER M,MANJUNATHA V,BOYD-GRABER J,et al.Deep Unordered Composition Rivals Syntactic Methods for Text Classification[C]∥Proceedings of the Meeting of the Association for Computational Linguistics and the International Joint Confe-rence on Natural Language Processing.2015:1681-1691.
[27]HO T K.The Random Subspace Method for Constructing Decision Forests[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,1998,20(8):832-844.
[28]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu features? End-to-end Speech Emotion Recognition Using a Deep Convolutional Recurrent Network[C]∥Proceedings of IEEE International Conference on Acoustics,Speech and Signal Proces-sing.IEEE,2016:5200-5204.
[29]NOJAVANASGHARI B,GOPINATH D,KOUSHIK J,et al.Deep Multimodal Fusion for Persuasiveness Prediction[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2016:284-288.
[30]LIM W,JANG D,LEE T.Speech Emotion Recognition UsingConvolutional and Recurrent Neural Networks[C]∥Procee-dings of Signal and Information Processing Association Summit and Conference.IEEE,2017:1-4.
[31]KAHOU S E,MICHALSKI V,KONDA K,et al.RecurrentNeural Networks for Emotion Recognition in Video[C]∥Proceedings of International Conference on Multimodal Interaction.ACM,2015:467-474.
[32]KALCHBRENNER N,GREFENSTETTE E,BLUNSOM P.A Convolutional Neural Network for Modelling Sentences[J].ar-Xiv:1404.2188.
[33]SRIVASTAVA R K,GREFF K,SCHMIDHUBER J.TrainingVery Deep Networks[J].arXiv:1507.06228.
[34]ZILLY J G,SRIVASTAVA R K,KOUTNÍk J,et al.Recurrent Highway Networks[J].arXiv:1607.03474.
[1] ZHOU Xu, QIAN Sheng-sheng, LI Zhang-ming, FANG Quan, XU Chang-sheng. Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification [J]. Computer Science, 2022, 49(9): 132-138.
[2] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[3] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[4] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[5] DU Li-jun, TANG Xi-lu, ZHOU Jiao, CHEN Yu-lan, CHENG Jian. Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning [J]. Computer Science, 2022, 49(6A): 60-65.
[6] LI Xiao-wei, SHU Hui, GUANG Yan, ZHAI Yi, YANG Zi-ji. Survey of the Application of Natural Language Processing for Resume Analysis [J]. Computer Science, 2022, 49(6A): 66-73.
[7] GAO Yue, FU Xiang-ling, OUYANG Tian-xiong, CHEN Song-ling, YAN Chen-wei. EEG Emotion Recognition Based on Spatiotemporal Self-Adaptive Graph ConvolutionalNeural Network [J]. Computer Science, 2022, 49(4): 30-36.
[8] ZHAO Kai, AN Wei-chao, ZHANG Xiao-yu, WANG Bin, ZHANG Shan, XIANG Jie. Intracerebral Hemorrhage Image Segmentation and Classification Based on Multi-taskLearning of Shared Shallow Parameters [J]. Computer Science, 2022, 49(4): 203-208.
[9] YANG Xiao-yu, YIN Kang-ning, HOU Shao-qi, DU Wen-yi, YIN Guang-qiang. Person Re-identification Based on Feature Location and Fusion [J]. Computer Science, 2022, 49(3): 170-178.
[10] ZHANG Hu, BAI Ping. Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification [J]. Computer Science, 2022, 49(2): 279-284.
[11] LIU Chuang, XIONG De-yi. Survey of Multilingual Question Answering [J]. Computer Science, 2022, 49(1): 65-72.
[12] CHEN Zhi-yi, SUI Jie. DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection [J]. Computer Science, 2022, 49(1): 101-107.
[13] ZHOU Xin-min, HU Yi-gui, LIU Wen-jie, SUN Rong-jun. Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method [J]. Computer Science, 2021, 48(9): 50-58.
[14] WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[15] SONG Long-ze, WAN Huai-yu, GUO Sheng-nan, LIN You-fang. Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction [J]. Computer Science, 2021, 48(7): 112-117.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!