计算机科学 ›› 2024, Vol. 51 ›› Issue (4): 262-269.doi: 10.11896/jsjkx.230200063

• 计算机图形学&多媒体 • 上一篇    下一篇

基于语音节奏差异的情感识别方法

张家豪, 章昭辉, 严琦, 王鹏伟   

  1. 东华大学计算机科学与技术学院 上海201620
  • 收稿日期:2023-02-09 修回日期:2023-04-26 出版日期:2024-04-15 发布日期:2024-04-10
  • 通讯作者: 章昭辉(zhazhang@dhu.edu.cn)
  • 作者简介:(zhzhang@dhu.edu.cn)
  • 基金资助:
    上海市科技创新行动技术高新技术领域项目(22511100700)

Speech Emotion Recognition Based on Voice Rhythm Differences

ZHANG Jiahao, ZHANG Zhaohui, YAN Qi, WANG Pengwei   

  1. School of Computer Science and Technology,Donghua University,Shanghai 201620,China
  • Received:2023-02-09 Revised:2023-04-26 Online:2024-04-15 Published:2024-04-10
  • Supported by:
    Shanghai Science and Technology Innovation Action High-tech Field Project(22511100700).

摘要: 语音情感识别在金融反欺诈等领域有着重要的应用前景,但是语音情感识别的准确率提升变得越来越困难。现有基于语谱图的语音情感识别等方法难以捕捉节奏差异特征,从而影响识别效果。文中基于语音节奏特征的差异性,提出了能量帧时频融合的语音情感识别方法。其关键是,针对语音中高能量区域进行频谱筛选,以高能语音帧的分布和时频变化来体现个体的语音节奏差异。在此基础上建立基于卷积神经网络(CNN)和循环神经网络(RNN)的情感识别模型,实现对频谱的时域和频域变化特征的提取与融合。在公开数据集IEMOCAP上进行实验,结果表明,该基于语音节奏差异的语音情感识别与基于语谱图的方法相比,在加权准确率WA和非加权准确率UA指标上分别平均提升了1.05%和1.9%;同时也表明个体的语音节奏差异对提升语音情感识别效果具有重要作用。

关键词: 语音情感识别, 能量帧, 频域谱线, 时频融合, 语音节奏差异

Abstract: Speech emotion recognition has an important application prospect in financial anti-fraud and other fields,but it is increasingly difficult to improve the accuracy of speech emotion recognition.The existing methods of speech emotion recognition based on spectrograms are difficult to capture the rhythm difference features,which affects the recognition effect.Based on the difference of speech rhythm features,this paper proposes a speech emotion recognition method based on energy frames and time-frequency fusion.The key is to screen high-energy regions of the spectrum in the speech,and reflect the individual voice rhythm differences with the distribution of high-energy speech frames and time-frequency changes.On this basis,an emotion recognition model based on convolutional neural network(CNN) and recurrent neural network(RNN) is established to realize the extraction and fusion of the time and frequency changes of the spectrum.On the open data set IEMOCAP,the experiment shows that compared with the method based on spectrogram,the weighted accuracy WA and the unweighted accuracy UA of the speech emotion recognition based on the difference of speech rhythm increases by 1.05% and 1.9% on average respectively.At the same time,it also shows that individual voice rhythm difference plays an important role in improving the effect of speech emotion recognition.

Key words: Speech emotion recognition, Energy frames, Spectrum, Time-frequency fusion, Voice rhythm difference

中图分类号: 

  • TP301
[1]SONG Y K,XIE J.Lightweight speech emotion recognitionmodel based on multitask learning [J/OL].Computer Enginee-ring:1-8.[2023-03-06].https://doi.org/10.19678/j.issn.1000-3428.0064430.
[2]ZHANG S Q,LI L M,ZHAO Z J.Speech emotion recognition based on an improved supervised manifold learning algorithm[J].Journal of Electronics and Information,2010,32(11):2724-2729.
[3]BUSSO C,MARIOORYAD S,METALLINOU A,et al.Iterative Feature Normalization Scheme for Automatic Emotion Detection from Speech[J].IEEE Transactions on Affective Computing,2013,4(4):386-397.
[4]JIN Q,CHEN S Z,LI X R,et al.Speech emotion recognitionbased on acoustic features [J].Computer Science,2015,42(9):24-28.
[5]TRIGEORGIS G,RINGEVAL F,BRUECKNER R,et al.Adieu Features? End-To-End Speech Emotion Recognition Using A Deep Convolutional Recurrent Network[C]//International Conference on Acoustics,Speech,and Signal Processing.2016:5200-5204.
[6]HUANG C W,NARAYANAN S S.Deep Convolutional Recur-rent Neural Network With Attention Mechanism For Robust Speech Emotion Recognition[C]//International Conference on Multimedia Computing and Systems.2017:583-588.
[7]SATT A,ROZENBERG S,HOORY R.Efficient Emotion Re-cognition From Speech Using Deep Learning On Spectrograms[C]//Conference of the International Speech Communication Association.2017:1089-1093.
[8]TZIRAKIS P,ZHANG J H,SCHULLER B.End-To-EndSpeech Emotion Recognition Using Deep Neural Networks[C]//International Conference on Acoustics,Speech,and Signal Processing.2018:5089-5093.
[9]WU X X,LIU S X,CAO Y W,et al.Speech Emotion Recognition Using Capsule Networks[C]//IEEE ICASSP 2019.IEEE,2019.
[10]ZHAO J,MAO X,CHEN L.Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J].Biomedical Signal Processing and Control,2019,47:312-323.
[11]MUSTAQEEM,KWON S.A CNN-Assisted Enhanced AudioSignal Processing for Speech Emotion Recognition[J].Sensors,2020,20(1.0):183.
[12]LIU J,LIU Z,WANG L,et al.Speech Emotion Recognition with Local-Global Aware Deep Representation Learning [C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020.
[13]HU D S,ZHANG X Y,ZHANG J,et al.Speech emotion recognition based on feature fusion of primary and secondary networks [J].Journal of Taiyuan University of Technology,2021,52(5):769-774.
[14]WU X X,HU S K,WU Z Y,et al.Neural Architecture Search for Speech Emotion Recognition[C]//International Conference on Acoustics,Speech,and Signal Processing.2022:6902-6906.
[15]LU G M,YUAN L,YANG W J,et al.Speech emotion recognition based on short-term memory and convolutional neural network[J].Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition,2018,38(5):63-69.
[16]ZHANG S,ZHANG S,HUANG T,et al.Speech Emotion Re-cognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching [J].IEEE Transactions on Multimedia,2018,20(6):1576 -1590.
[17]HERACLEOUS P,MOHAMMAD Y,YONEVAMA A.DeepConvolutional Neural Networks for Feature Extraction in Speech Emotion Recognition[C]//International Conference on Human-Computer Interaction(HCII).2019:117-132.
[18]WANG J,XUE M,CULHANE R,et al.Speech emotion recognition with dual-sequence LSTM architecture[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6474-6478.
[19]HSU J,SU M,WU C,et al.Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:1675-1686.
[20]ATILA O,ŞENGÜR A.Attention guided 3D CNN-LSTM mo-del for accurate speech based emotion recognition[J].Applied Acoustics,2021,182:108260.
[21]SABOUR S,FROSST N,HINTON G E.Dynamic routing between capsules[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems(NIPS'17).2017:3859-3869.
[22]LI W H.Research on speech emotion recognition based on spectrum sensing feature[D].Nanchang:Donghua University of Technology,2018.
[23]YANG X J,WANG H Y,CHEN J H,et al Application of Fast Fourier Transform Algorithm in Audio Power Amplifier[J].Electronic Technology,2015,44(7):33-35.
[24]CHEN J.Speech emotion recognition based on convolutionalneural network[C]//2021 International Conference on Networking,Communications and Information Technology.2021:106-109.
[25]KAVITHA S,SANJANA N,YOGAJEEVA K,et al.Speech Emotion Recognition Using Different Activation Function[C]//2021 International Confe-rence on Advancements in Electrical,Electronics,Communication,Computing and Automation(ICAECA).2021:1-5.
[26]LIESKOVSKA E,JAKUBEC M,JARINA R.RNN with Im-proved Temporal Modeling for Speech Emotion Recognition[C]//2022 32nd International ConferenceRADIOELEKTRONIKA.2022:1-5.
[27]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:interactiveemotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42(4):335-359.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!