计算机科学 ›› 2023, Vol. 50 ›› Issue (6A): 220800211-7.doi: 10.11896/jsjkx.220800211

• 人工智能 • 上一篇    下一篇

改进MFCC和并行混合模型的语音情感识别

崔琳1,2, 崔晨露1, 刘政伟1, 薛凯1   

  1. 1 西安工程大学电子信息学院 西安 710600;
    2 西北工业大学航海学院 西安 710072
  • 出版日期:2023-06-10 发布日期:2023-06-12
  • 通讯作者: 崔晨露(1017071182@qq.com)
  • 作者简介:(cuilin789@163.com)
  • 基金资助:
    国家自然科学基金青年项目(61901347)

Speech Emotion Recognition Based on Improved MFCC and Parallel Hybrid Model

CUI Lin1,2, CUI Chenlu1, LIU Zhengwei1, XUE Kai1   

  1. 1 School of Electronic Information,Xi’an Polytechnic University,Xi’an 710600,China;
    2 School of Navigation,Northwestern Polytechnical University,Xi’an 710072,China
  • Online:2023-06-10 Published:2023-06-12
  • About author:CUI Lin,born in 1984,Ph.D,associate professor.Her main research interests include arrayecture signal processing and speech signal processing. CUI Chenlu,born in 1997,master.Her main research interest is speech emotion recognition.
  • Supported by:
    Young Fund of the National Natural Science Foundation of China(61901347).

摘要: 传统MFCC不仅忽略了浊音信号中基音频率的影响,还不能表征语音的动态特征,因此提出利用滑动平均滤波器滤除浊音信号的基音频率,并在提取完静态MFCC特征后再通过提取其一阶差分与二阶差分来获取动态特征。将得到的特征送入模型中进行训练,为了构建更高效的语音情感识别模型,搭建了一种融合多头注意力机制的并行混合模型。多头注意力机制不仅可以有效防止梯度消失现象,构建更深层的网络,各个注意力头还可以执行不同的任务来提高准确率。最后进行情感特征分类,传统softmax在进行分类时类内距离可能会变大导致模型的置信度差,因此引入了中心损失函数,将两者联合来进行分类。实验结果表明,所提方法在RAVDESS数据集和EMO-DB数据集上的准确率可以分别达到98.15%和96.26%。

关键词: 语音情感识别, MFCC, 多头注意力机制, 滑动平均滤波器, softmax

Abstract: The traditional MFCC not only ignores the influence of the pitch frequency in the voiced signal,but also cannot characterize the dynamic characteristics of the speech.Therefore,a moving average filter is proposed to filter out the pitch frequency of the voiced signal.After extracting the static MFCC features,the dynamic features are obtained by extracting their first-order difference and second-order difference.The obtained features are sent to the model for training.To construct a more efficient speech emotion recognition model,a parallel hybrid model integrating a multi-head attention mechanism is built.The multi-head attention mechanism can not only effectively prevent the gradient disappearance phenomenon from constructing a deeper network,but also perform different tasks to improve the accuracy.Finally,when classifying emotional features,the traditional softmax may increase the intra-class distance during classification,resulting in poor confidence in the model.Therefore,the center loss function is introduced to combine the two for classification.Experimental results show that the accuracy of the proposed method can reach 98.15 % and 96.26 % on the RAVDESS dataset and EMO-DB dataset,respectively.

Key words: Speech emotion recognition, MFCC, Multi-head attention mechanism, Moving average Filter, softmax

中图分类号: 

  • TP183
[1]DANNUO J,XIN H,JINGHAN X,et al.Design of Intelligent Vehicle Multimedia Human-Computer Interaction System[C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2019
[2]ZHOU Y,SUN Y,ZHANG J,et al.Speech emotion recognition using both spectral and prosodic features[C]//International Conference on Information Engineering and Computer Science.IEEE,2009:1-4.
[3]LIU Z T,XU J P,WU M,et al.Overview of speech emotion feature extraction and dimensionality reduction methods[J].Journal of Computer Science,2018,41(12):2833-2851.
[4]SHIMIZU T,ONAGA H.Study on acoustic improvements by sound-absorbing panels and acoustical quality assessment of te-leconference systems[J].Applied Acoustics,2018,139(1):101-112.
[5]VERVERIDIS D,KOTROPOULOS C,PITAS I.Automaticemotional speech classification[C]//IEEE International Confe-rence on Acoustics,Speech,and Signal Processing.IEEE,2004:1-593.
[6]SCHULLER B,RIGOLL G,LANG M.Hidden Markov model-based speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing(ICASSP’03).IEEE,2003.
[7]LIU Y R,ZHANG X Y,CHEN G J,et al.VMD improves theemotional speech feature extraction of GFCC[J].Journal of Xi’ an University of Electronic Science and Technology,2019,46(5):24-30.
[8]MAO Q,DONG M,HUANG Z,et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE transactions on multimedia,2014,16(8):2203-2213.
[9]LEE J,TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[J].Interspeech,2015,5(1):10-13.
[10]VERKHOLYAK O V,KAYA H,KARPOV A A.ModelingShort-Term and Long-Term Dependencies of the Speech Signal for Paralinguistic Emotion Classification[J].SPIIRAS Procee-dings,2019,18(1):30-56.
[11]YU H,JI Y,LI Q.Student sentiment classification model based on GRU neural network and TF-IDF algorithm[J].Journal of Intelligent and Fuzzy Systems,2021,40(2):2301-2311.
[12]LI D D,SUN L Y,XU X L,et al.BLSTM and CNN Stacking Architecture for Speech Emotion Recognition[J].Neural Processing Letters,2021,53(6):1-19.
[13]CHEN Q P,HUANG G M.A novel dual attention-basedBLSTM with hybrid features in speech emotion recognition[J].Engineering Applications of Artificial Intelligence,2021,102:104277.
[14]CHEN W L,SUN X.Speech emotion recognition based on MFCCG-PCA[ J ].Journal of Peking University(Natural Science Edition),2015,51(2):269-274.
[15]LU W,DAI B J,LI H,et al.The influence of pitch frequency information in MFCC on speaker recognition system performance[J].Journal of China University of Science and Technology,2009,39(8):859-863,884.
[16]DONG Y F,SU H,LIU B,et al.Model level fusion dimensionemotion recognition method based on multi-headed attention mechanism[J].Journal of Signal Processing,2021,37(5):885-892.
[17]LIVINGSTONE S R,RUSSO F A,JOSEPH N.The ryerson audio-visual da-tabase of emotional speech and song(RAVDESS):a dynamic,multi-modal set of facial and vocal expressions in North American English[J].PlosOne,2018,13(5):e0196391.
[18]CIRAKMAN O,GUNSEL B.Online speaker emotion trackingwith a dynamic state transition model[C]//23rd International Conference on Pattern Recognition(ICPR).IEEE,2016:307-312.
[19]ZHENG Y,CHEN J N,WU F,et al.Research and Implementation of Speech Emotion Recognition Based on CGRU Model[J].Journal of Northeastern University(Natural Science Edition),2020,41(12):1680-1685.
[20]PURI T,SONIM,DHIMAN G,et al.Detection of Emotion of Speech for RAVDESS Audio Using Hybrid Convolution Neural Network.[J].Journal of healthcare engineering,2022,2022:8472947.
[21]JAHANGIR R,TEH Y W,MUJTABA G,et al.Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion[J].Machine Vision and Applications,2022,33(3):1-16.
[22]ZHAO J F,MAO X,CHEN L J.Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J].Biomedical Signal Processing and Control,2019,47:312-323.
[23]GARCÍA-ORDÁS M T,ALAIZ-MORETÓN H,BENÍTEZ-ANDRADES J A,et al.Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network[J].Biomedical Signal Processing and Control,2021,69:102946.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!