Computer Science ›› 2022, Vol. 49 ›› Issue (7): 132-141.doi: 10.11896/jsjkx.210100085

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition

XU Ming-ke1, ZHANG Fan2   

  1. 1 School of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China
    2 IBM Watson Group,Littleton,Massachusetts 01460,USA
  • Received:2021-01-11 Revised:2021-05-24 Online:2022-07-15 Published:2022-07-12
  • About author:XU Ming-ke,born in 1996,postgra-duate.His main research interests include speech recognitionand so on.
    ZHANG Fan,born in 1983,Ph.D,is a member of China Computer Federation.His main research interests include cloud computing,big-data processing and artificial intelligence.

Abstract: Speech emotion recognition(SER) refers to the use of machines to recognize the emotions of a speaker from speech.SER is an important part of human-computer interaction(HCI).But there are still many problems in SER research,e.g.,the lack of high-quality data,insufficient model accuracy,little research under noisy environments.In this paper,we propose a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER.We implemente an attention-based convolutional neural network(ACNN) model and conduct experiments on the interactive emotional dyadic motion capture(IEMOCAP) data set.The accuracy is improved to 76.18% (weighted accuracy,WA) and 76.36% (unweighted accuracy,UA).To the best of our knowledge,compared with the state-of-the-art result on this dataset(76.4% of WA and 70.1% of WA),we achieve a UA improvement of about 6% absolute while achieving a similar WA.Furthermore,We conduct empirical experiments by injecting speech data with 50 types of common noises.We inject the noises by altering the noise intensity,time-shifting the noises,and mixing different noise types,to identify their varied impacts on the SER accuracy and verify the robustness of our model.This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

Key words: Attention mechanism, Convolutional neural network, Noisy speech, Speech emotion recognition, Speech recognition

CLC Number: 

  • TP391.4
[1]HAN K,YU D,TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014.
[2]CHEN M,HE X,YANG J,et al.3-D convolutional recurrent neuralnetworks with attention model for speech emotion recognition[J].IEEE Signal Processing Letters,2018,25(10):1440-1444.
[3]WU X,LIU S,CAO Y,et al.Speech emotion recognition using capsule networks[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:6695-6699.
[4]XU Y,XU H,ZOU J.HGFM:A Hierarchical Grained and Feature Model for Acoustic Emotion Recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2020).IEEE,2020:6499-6503.
[5]PRIYASAD D,FERNANDO T,DENMAN S,et al.AttentionDriven Fusion for Multi-Modal Emotion Recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2020).IEEE,2020:3227-3231.
[6]NEDIYANCHATH A,PARAMASIVAM P,YENIGALLA P.Multi-Head Attention for Speech Emotion Recognition with Auxiliary Learning of Gender Recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2020).IEEE,2020:7179-7183.
[7]CHATTERJEE A,GUPTA U,CHINNAKOTLA M K,et al.Understanding emotions in text using deep learning and big data[J].Computers in Human Behavior,2019,93:309-317.
[8]BATBAATAR E,LI M,RYU K H.Semantic-emotion neural network for emotion recognition from text[J].IEEE Access,2019,7:111866-111878.
[9]YANG J,ZHANG F,CHEN B,et al.Facial Expression Recognition Based on Facial Action Unit[C]//2019 Tenth International Green and Sustainable Computing Conference(IGSC).IEEE,2019:1-6.
[10]LIU X,VIJAYA KUMAR B V K,YOU J,et al.Adaptive deep metric learning for identity-aware facial expression recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2017:20-29.
[11]LOW L S A,MADDAGE N C,LECH M,et al.Detection of clinical depression in adolescents' speech during family interactions[J].IEEE Transactions on Biomedical Engineering,2010,58(3):574-586.
[12]YOON W J,CHO Y H,PARK K S.A study of speech emotion recognition and its application tomobile services[C]//International Conference on Ubiquitous Intelligence and Computing.Berlin:Springer,2007:758-766.
[13]HUAHU X,JUE G,JIAN Y.Application of speech emotionrecognition in intelligent household robot[C]//2010 International Conference on Artificial Intelligence and Computational Intelligence.IEEE,2010,1:537-541.
[14]ARNOLD G F,O'CONNOR J D.Intonation of colloquial English[M].Longman,London,1973.
[15]EKMAN P,OSTER H.Facial expressions of emotion[J].An-nual Review of Psychology,1979,30(1):527-554.
[16]EL AYADI M,KAMEL M S,KARRAY F.Survey on speechemotion recognition:Features,classification schemes,and databases[J].Pattern Recognition,2011,44(3):572-587.
[17]RUSSELL J A,MEHRABIAN A.Evidence for a three-factor theory of emotions[J].Journal of Research in Personality,1977,11(3):273-294.
[18]ALTROV R,PAJUPUU H.The influence of language and culture on the understanding of vocal emotions[J].Journal of Estonian and Finno-Ugric Linguistics,2015,6(3):11-48.
[19]TARANTINO L,GARNER P N,LAZARIDIS A.Self-Attention for Speech Emotion Recognition[C]//INTERSPEECH.2019:2578-2582.
[20]LI P,YAN S,MCLOUGHLIN I,et al.An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition[C]// INTERSPEECH.2018:3087-3091.
[21]SCHULLER B,RIGOLL G,LANG M.HiddenMarkov model-based speech emotion recognition[C]//2003 IEEE International Conference on Acoustics,Speech,and Signal Processing(IC-ASSP'03).IEEE,2003.
[22]MOWER E,MATARIĆM J,NARAYANAN S.A framework for automatic human emotion classification using emotion profiles[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,19(5):1057-1070.
[23]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[24]PHAM N Q,NGUYEN T S,NIEHUES J,et al.Very deep self-attention networks for end-to-end speech recognition[J].arXiv:1904.13377,2019.
[25]SAFARI P,HERNANDO J.Self multi-head attention for spea-ker recognition[J].arXiv:1906.09890,2019.
[26]ZHAO Z,BAO Z,ZHANG Z,et al.Attention-Enhanced Con-nectionist Temporal Classification for Discrete Speech Emotion Recognition[C]//INTERSPEECH.2019:206-210.
[27]YOON S,BYUN S,DEY S,et al.Speech emotion recognitionusing multi-hop attention mechanism[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:2822-2826.
[28]PANDHARIPANDE M,CHAKRABORTY R,PANDA A,et al.Robust front-end processing for emotion recognition in noisy speech[C]//1th International Symposium on Chinese Spoken Language Processing(ISCSLP 2018).IEEE,2018:324-328.
[29]HUANG Y,TIAN K,WU A,et al.Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition[J].Journal of Ambient Intelligence and Humanized Computing,2019,10(5):1787-1798.
[30]HUANG Y,XIAO J,TIAN K,et al.Research on Robustness of Emotion Recognition Under Environmental Noise Conditions[J].IEEE Access,2019,7:142009-142021.
[31]CHEN S,JIN Q.Multi-modal conditional attention fusion for dimensional emotion prediction[C]//Proceedings of the 24th ACM International Conference on Multimedia.2016:571-575.
[32]PARTHASARATHY S,BUSSO C.Semi-supervised speechemotion recognition with ladder networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2697-2709.
[33]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactiveemotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335.
[34]NEUMANN M,VU N T.Improving speech emotion recognition with unsupervised representation learning on unlabeled speech[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:7390-7394.
[35]LIVINGSTONE S R,RUSSO F A.The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS):A dyna-mic,multimodal set of facial and vocal expressions in North American English[J].PloS one,2018,13(5):e0196391.
[36]PICZAK K J.ESC:Dataset for environmental sound classification[C]//Proceedings of the 23rd ACM International Confe-rence on Multimedia.2015:1015-1018.
[37]MCFEE B,RAFFEL C,LIANG D,et al.librosa:Audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference.2015:18-25.
[38]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[39]ZENG Y,MAO H,PENG D,et al.Spectrogram based multi-task audio classification[J].Multimedia Tools and Applications,2019,78(3):3705-3722.
[40]JALAL M A,LOWEIMI E,MOORE R K,et al.Learning Temporal Clusters Using Capsule Routing for Speech Emotion Reco-gnition[C]//INTERSPEECH.2019:1701-1705.
[41]ISSA D,DEMIRCI M F,YAZICI A.Speech emotion recognition with deep convolutional neural networks[J].Biomedical Signal Processing and Control,2020,59:101894.
[42]LI H,DING W,WU Z,et al.Learning Fine-Grained Multimodal Alignment for Speech Emotion Recognition[J].arXiv:2010.12733,2020.
[1] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[2] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[3] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[4] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[5] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[6] JIANG Meng-han, LI Shao-mei, ZHENG Hong-hao, ZHANG Jian-peng. Rumor Detection Model Based on Improved Position Embedding [J]. Computer Science, 2022, 49(8): 330-335.
[7] WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[8] CHEN Yong-quan, JIANG Ying. Analysis Method of APP User Behavior Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(8): 78-85.
[9] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[10] SUN Qi, JI Gen-lin, ZHANG Jie. Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection [J]. Computer Science, 2022, 49(8): 172-177.
[11] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[12] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
[13] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[14] PENG Shuang, WU Jiang-jiang, CHEN Hao, DU Chun, LI Jun. Satellite Onboard Observation Task Planning Based on Attention Neural Network [J]. Computer Science, 2022, 49(7): 242-247.
[15] ZHANG Ying-tao, ZHANG Jie, ZHANG Rui, ZHANG Wen-qiang. Photorealistic Style Transfer Guided by Global Information [J]. Computer Science, 2022, 49(7): 100-105.
Full text



No Suggested Reading articles found!