计算机科学 ›› 2022, Vol. 49 ›› Issue (7): 132-141.doi: 10.11896/jsjkx.210100085

• 计算机图形学&多媒体 • 上一篇    下一篇

Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法

徐鸣珂1, 张帆2   

  1. 1 南京工业大学计算机科学与技术学院 南京211816
    2 国际商业机器麻省实验室 马萨诸塞州 利特尔顿01460
  • 收稿日期:2021-01-11 修回日期:2021-05-24 出版日期:2022-07-15 发布日期:2022-07-12
  • 通讯作者: 张帆(fzhang@us.ibm.com)
  • 作者简介:(mingkexu@njtech.edu.cn)

Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition

XU Ming-ke1, ZHANG Fan2   

  1. 1 School of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China
    2 IBM Watson Group,Littleton,Massachusetts 01460,USA
  • Received:2021-01-11 Revised:2021-05-24 Online:2022-07-15 Published:2022-07-12
  • About author:XU Ming-ke,born in 1996,postgra-duate.His main research interests include speech recognitionand so on.
    ZHANG Fan,born in 1983,Ph.D,is a member of China Computer Federation.His main research interests include cloud computing,big-data processing and artificial intelligence.

摘要: 语音情绪识别指使用机器从说话人的语音中识别说话人的情绪。语音情绪识别是人机交互的重要环节,但是目前的研究中仍然存在很多问题,例如,缺乏高质量的数据、模型准确性不足、在嘈杂的环境下进行的研究很少等。文中提出了一种基于多头注意力机制的Head Fusion方法,提高了语音情绪识别在相应数据集上的准确性。文中还实现了一个基于注意力的卷积神经网络模型,并在IEMOCAP数据集上进行了实验。语音情绪识别在该数据集上的准确度提高到76.18%(Weighted Accuracy,WA)和76.36%(Unweighted Accuracy,UA)。根据调研,该结果与该数据集上的最新结果(76.4%的WA和70.1%的UA)相比,在保持WA的同时提高了约6%的UA。此外,还使用了混入50种常见噪声的语音数据进行了实验,通过改变噪声强度、对噪声进行时域平移、混合不同的噪声类型,以识别它们对语音情绪识别(Speech Emotion Recognition)准确度的不同影响并验证模型的鲁棒性。文中还将帮助研究人员和工程师通过使用带有适当类型噪声的语音数据来增加其训练数据,从而缓解语音情绪识别研究中高质量数据不足的问题。

关键词: 卷积神经网络, 语音情绪识别, 语音识别, 噪声语音, 注意力机制

Abstract: Speech emotion recognition(SER) refers to the use of machines to recognize the emotions of a speaker from speech.SER is an important part of human-computer interaction(HCI).But there are still many problems in SER research,e.g.,the lack of high-quality data,insufficient model accuracy,little research under noisy environments.In this paper,we propose a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER.We implemente an attention-based convolutional neural network(ACNN) model and conduct experiments on the interactive emotional dyadic motion capture(IEMOCAP) data set.The accuracy is improved to 76.18% (weighted accuracy,WA) and 76.36% (unweighted accuracy,UA).To the best of our knowledge,compared with the state-of-the-art result on this dataset(76.4% of WA and 70.1% of WA),we achieve a UA improvement of about 6% absolute while achieving a similar WA.Furthermore,We conduct empirical experiments by injecting speech data with 50 types of common noises.We inject the noises by altering the noise intensity,time-shifting the noises,and mixing different noise types,to identify their varied impacts on the SER accuracy and verify the robustness of our model.This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

Key words: Attention mechanism, Convolutional neural network, Noisy speech, Speech emotion recognition, Speech recognition


  • TP391.4
[1]HAN K,YU D,TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014.
[2]CHEN M,HE X,YANG J,et al.3-D convolutional recurrent neuralnetworks with attention model for speech emotion recognition[J].IEEE Signal Processing Letters,2018,25(10):1440-1444.
[3]WU X,LIU S,CAO Y,et al.Speech emotion recognition using capsule networks[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:6695-6699.
[4]XU Y,XU H,ZOU J.HGFM:A Hierarchical Grained and Feature Model for Acoustic Emotion Recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2020).IEEE,2020:6499-6503.
[5]PRIYASAD D,FERNANDO T,DENMAN S,et al.AttentionDriven Fusion for Multi-Modal Emotion Recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2020).IEEE,2020:3227-3231.
[6]NEDIYANCHATH A,PARAMASIVAM P,YENIGALLA P.Multi-Head Attention for Speech Emotion Recognition with Auxiliary Learning of Gender Recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 2020).IEEE,2020:7179-7183.
[7]CHATTERJEE A,GUPTA U,CHINNAKOTLA M K,et al.Understanding emotions in text using deep learning and big data[J].Computers in Human Behavior,2019,93:309-317.
[8]BATBAATAR E,LI M,RYU K H.Semantic-emotion neural network for emotion recognition from text[J].IEEE Access,2019,7:111866-111878.
[9]YANG J,ZHANG F,CHEN B,et al.Facial Expression Recognition Based on Facial Action Unit[C]//2019 Tenth International Green and Sustainable Computing Conference(IGSC).IEEE,2019:1-6.
[10]LIU X,VIJAYA KUMAR B V K,YOU J,et al.Adaptive deep metric learning for identity-aware facial expression recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.2017:20-29.
[11]LOW L S A,MADDAGE N C,LECH M,et al.Detection of clinical depression in adolescents' speech during family interactions[J].IEEE Transactions on Biomedical Engineering,2010,58(3):574-586.
[12]YOON W J,CHO Y H,PARK K S.A study of speech emotion recognition and its application tomobile services[C]//International Conference on Ubiquitous Intelligence and Computing.Berlin:Springer,2007:758-766.
[13]HUAHU X,JUE G,JIAN Y.Application of speech emotionrecognition in intelligent household robot[C]//2010 International Conference on Artificial Intelligence and Computational Intelligence.IEEE,2010,1:537-541.
[14]ARNOLD G F,O'CONNOR J D.Intonation of colloquial English[M].Longman,London,1973.
[15]EKMAN P,OSTER H.Facial expressions of emotion[J].An-nual Review of Psychology,1979,30(1):527-554.
[16]EL AYADI M,KAMEL M S,KARRAY F.Survey on speechemotion recognition:Features,classification schemes,and databases[J].Pattern Recognition,2011,44(3):572-587.
[17]RUSSELL J A,MEHRABIAN A.Evidence for a three-factor theory of emotions[J].Journal of Research in Personality,1977,11(3):273-294.
[18]ALTROV R,PAJUPUU H.The influence of language and culture on the understanding of vocal emotions[J].Journal of Estonian and Finno-Ugric Linguistics,2015,6(3):11-48.
[19]TARANTINO L,GARNER P N,LAZARIDIS A.Self-Attention for Speech Emotion Recognition[C]//INTERSPEECH.2019:2578-2582.
[20]LI P,YAN S,MCLOUGHLIN I,et al.An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition[C]// INTERSPEECH.2018:3087-3091.
[21]SCHULLER B,RIGOLL G,LANG M.HiddenMarkov model-based speech emotion recognition[C]//2003 IEEE International Conference on Acoustics,Speech,and Signal Processing(IC-ASSP'03).IEEE,2003.
[22]MOWER E,MATARIĆM J,NARAYANAN S.A framework for automatic human emotion classification using emotion profiles[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,19(5):1057-1070.
[23]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[24]PHAM N Q,NGUYEN T S,NIEHUES J,et al.Very deep self-attention networks for end-to-end speech recognition[J].arXiv:1904.13377,2019.
[25]SAFARI P,HERNANDO J.Self multi-head attention for spea-ker recognition[J].arXiv:1906.09890,2019.
[26]ZHAO Z,BAO Z,ZHANG Z,et al.Attention-Enhanced Con-nectionist Temporal Classification for Discrete Speech Emotion Recognition[C]//INTERSPEECH.2019:206-210.
[27]YOON S,BYUN S,DEY S,et al.Speech emotion recognitionusing multi-hop attention mechanism[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:2822-2826.
[28]PANDHARIPANDE M,CHAKRABORTY R,PANDA A,et al.Robust front-end processing for emotion recognition in noisy speech[C]//1th International Symposium on Chinese Spoken Language Processing(ISCSLP 2018).IEEE,2018:324-328.
[29]HUANG Y,TIAN K,WU A,et al.Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition[J].Journal of Ambient Intelligence and Humanized Computing,2019,10(5):1787-1798.
[30]HUANG Y,XIAO J,TIAN K,et al.Research on Robustness of Emotion Recognition Under Environmental Noise Conditions[J].IEEE Access,2019,7:142009-142021.
[31]CHEN S,JIN Q.Multi-modal conditional attention fusion for dimensional emotion prediction[C]//Proceedings of the 24th ACM International Conference on Multimedia.2016:571-575.
[32]PARTHASARATHY S,BUSSO C.Semi-supervised speechemotion recognition with ladder networks[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2697-2709.
[33]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactiveemotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335.
[34]NEUMANN M,VU N T.Improving speech emotion recognition with unsupervised representation learning on unlabeled speech[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2019).IEEE,2019:7390-7394.
[35]LIVINGSTONE S R,RUSSO F A.The Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS):A dyna-mic,multimodal set of facial and vocal expressions in North American English[J].PloS one,2018,13(5):e0196391.
[36]PICZAK K J.ESC:Dataset for environmental sound classification[C]//Proceedings of the 23rd ACM International Confe-rence on Multimedia.2015:1015-1018.
[37]MCFEE B,RAFFEL C,LIANG D,et al.librosa:Audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference.2015:18-25.
[38]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[39]ZENG Y,MAO H,PENG D,et al.Spectrogram based multi-task audio classification[J].Multimedia Tools and Applications,2019,78(3):3705-3722.
[40]JALAL M A,LOWEIMI E,MOORE R K,et al.Learning Temporal Clusters Using Capsule Routing for Speech Emotion Reco-gnition[C]//INTERSPEECH.2019:1701-1705.
[41]ISSA D,DEMIRCI M F,YAZICI A.Speech emotion recognition with deep convolutional neural networks[J].Biomedical Signal Processing and Control,2020,59:101894.
[42]LI H,DING W,WU Z,et al.Learning Fine-Grained Multimodal Alignment for Speech Emotion Recognition[J].arXiv:2010.12733,2020.
[1] 周芳泉, 成卫青.
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2] 戴禹, 许林峰.
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4] 熊丽琴, 曹雷, 赖俊, 陈希亮.
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5] 饶志双, 贾真, 张凡, 李天瑞.
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6] 陈泳全, 姜瑛.
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[7] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[8] 孙奇, 吉根林, 张杰.
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9] 檀莹莹, 王俊丽, 张超波.
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[10] 闫佳丹, 贾彩燕.
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[11] 汪鸣, 彭舰, 黄飞虎.
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[12] 李宗民, 张玉鹏, 刘玉杰, 李华.
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[13] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[14] 金方焱, 王秀利.
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[15] 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚.
Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism
计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
Full text



No Suggested Reading articles found!