Computer Science ›› 2023, Vol. 50 ›› Issue (6A): 220800211-7.doi: 10.11896/jsjkx.220800211

• Artificial Intelligence • Previous Articles     Next Articles

Speech Emotion Recognition Based on Improved MFCC and Parallel Hybrid Model

CUI Lin1,2, CUI Chenlu1, LIU Zhengwei1, XUE Kai1   

  1. 1 School of Electronic Information,Xi’an Polytechnic University,Xi’an 710600,China;
    2 School of Navigation,Northwestern Polytechnical University,Xi’an 710072,China
  • Online:2023-06-10 Published:2023-06-12
  • About author:CUI Lin,born in 1984,Ph.D,associate professor.Her main research interests include arrayecture signal processing and speech signal processing. CUI Chenlu,born in 1997,master.Her main research interest is speech emotion recognition.
  • Supported by:
    Young Fund of the National Natural Science Foundation of China(61901347).

Abstract: The traditional MFCC not only ignores the influence of the pitch frequency in the voiced signal,but also cannot characterize the dynamic characteristics of the speech.Therefore,a moving average filter is proposed to filter out the pitch frequency of the voiced signal.After extracting the static MFCC features,the dynamic features are obtained by extracting their first-order difference and second-order difference.The obtained features are sent to the model for training.To construct a more efficient speech emotion recognition model,a parallel hybrid model integrating a multi-head attention mechanism is built.The multi-head attention mechanism can not only effectively prevent the gradient disappearance phenomenon from constructing a deeper network,but also perform different tasks to improve the accuracy.Finally,when classifying emotional features,the traditional softmax may increase the intra-class distance during classification,resulting in poor confidence in the model.Therefore,the center loss function is introduced to combine the two for classification.Experimental results show that the accuracy of the proposed method can reach 98.15 % and 96.26 % on the RAVDESS dataset and EMO-DB dataset,respectively.

Key words: Speech emotion recognition, MFCC, Multi-head attention mechanism, Moving average Filter, softmax

CLC Number: 

  • TP183
[1]DANNUO J,XIN H,JINGHAN X,et al.Design of Intelligent Vehicle Multimedia Human-Computer Interaction System[C]//IOP Conference Series:Materials Science and Engineering.IOP Publishing,2019
[2]ZHOU Y,SUN Y,ZHANG J,et al.Speech emotion recognition using both spectral and prosodic features[C]//International Conference on Information Engineering and Computer Science.IEEE,2009:1-4.
[3]LIU Z T,XU J P,WU M,et al.Overview of speech emotion feature extraction and dimensionality reduction methods[J].Journal of Computer Science,2018,41(12):2833-2851.
[4]SHIMIZU T,ONAGA H.Study on acoustic improvements by sound-absorbing panels and acoustical quality assessment of te-leconference systems[J].Applied Acoustics,2018,139(1):101-112.
[5]VERVERIDIS D,KOTROPOULOS C,PITAS I.Automaticemotional speech classification[C]//IEEE International Confe-rence on Acoustics,Speech,and Signal Processing.IEEE,2004:1-593.
[6]SCHULLER B,RIGOLL G,LANG M.Hidden Markov model-based speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing(ICASSP’03).IEEE,2003.
[7]LIU Y R,ZHANG X Y,CHEN G J,et al.VMD improves theemotional speech feature extraction of GFCC[J].Journal of Xi’ an University of Electronic Science and Technology,2019,46(5):24-30.
[8]MAO Q,DONG M,HUANG Z,et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE transactions on multimedia,2014,16(8):2203-2213.
[9]LEE J,TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[J].Interspeech,2015,5(1):10-13.
[10]VERKHOLYAK O V,KAYA H,KARPOV A A.ModelingShort-Term and Long-Term Dependencies of the Speech Signal for Paralinguistic Emotion Classification[J].SPIIRAS Procee-dings,2019,18(1):30-56.
[11]YU H,JI Y,LI Q.Student sentiment classification model based on GRU neural network and TF-IDF algorithm[J].Journal of Intelligent and Fuzzy Systems,2021,40(2):2301-2311.
[12]LI D D,SUN L Y,XU X L,et al.BLSTM and CNN Stacking Architecture for Speech Emotion Recognition[J].Neural Processing Letters,2021,53(6):1-19.
[13]CHEN Q P,HUANG G M.A novel dual attention-basedBLSTM with hybrid features in speech emotion recognition[J].Engineering Applications of Artificial Intelligence,2021,102:104277.
[14]CHEN W L,SUN X.Speech emotion recognition based on MFCCG-PCA[ J ].Journal of Peking University(Natural Science Edition),2015,51(2):269-274.
[15]LU W,DAI B J,LI H,et al.The influence of pitch frequency information in MFCC on speaker recognition system performance[J].Journal of China University of Science and Technology,2009,39(8):859-863,884.
[16]DONG Y F,SU H,LIU B,et al.Model level fusion dimensionemotion recognition method based on multi-headed attention mechanism[J].Journal of Signal Processing,2021,37(5):885-892.
[17]LIVINGSTONE S R,RUSSO F A,JOSEPH N.The ryerson audio-visual da-tabase of emotional speech and song(RAVDESS):a dynamic,multi-modal set of facial and vocal expressions in North American English[J].PlosOne,2018,13(5):e0196391.
[18]CIRAKMAN O,GUNSEL B.Online speaker emotion trackingwith a dynamic state transition model[C]//23rd International Conference on Pattern Recognition(ICPR).IEEE,2016:307-312.
[19]ZHENG Y,CHEN J N,WU F,et al.Research and Implementation of Speech Emotion Recognition Based on CGRU Model[J].Journal of Northeastern University(Natural Science Edition),2020,41(12):1680-1685.
[20]PURI T,SONIM,DHIMAN G,et al.Detection of Emotion of Speech for RAVDESS Audio Using Hybrid Convolution Neural Network.[J].Journal of healthcare engineering,2022,2022:8472947.
[21]JAHANGIR R,TEH Y W,MUJTABA G,et al.Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion[J].Machine Vision and Applications,2022,33(3):1-16.
[22]ZHAO J F,MAO X,CHEN L J.Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J].Biomedical Signal Processing and Control,2019,47:312-323.
[23]GARCÍA-ORDÁS M T,ALAIZ-MORETÓN H,BENÍTEZ-ANDRADES J A,et al.Sentiment analysis in non-fixed length audios using a Fully Convolutional Neural Network[J].Biomedical Signal Processing and Control,2021,69:102946.
[1] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[2] WANG Xue-guang, ZHU Jun-wen, ZHANG Ai-xin. Identification Method of Voiceprint Identity Based on ARIMA Prediction of MFCC Features [J]. Computer Science, 2022, 49(5): 92-97.
[3] XIAO Ding, ZHANG Yu-fan, JI Hou-ye. Electricity Theft Detection Based on Multi-head Attention Mechanism [J]. Computer Science, 2022, 49(1): 140-145.
[4] WANG Xue-guang, ZHU Jun-wen, ZHANG Ai-xin. Identification Method of Voiceprint Identity Based on MFCC Features [J]. Computer Science, 2021, 48(12): 343-348.
[5] WANG Rui-ping, JIA Zhen, LIU Chang, CHEN Ze-wei, LI Tian-rui. Deep Interest Factorization Machine Network Based on DeepFM [J]. Computer Science, 2021, 48(1): 226-232.
[6] ZHANG Zhi-yang, ZHANG Feng-li, CHEN Xue-qin, WANG Rui-jin. Information Cascade Prediction Model Based on Hierarchical Attention [J]. Computer Science, 2020, 47(6): 201-209.
[7] ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[8] CHEN Yan-wen,LI Kun,HAN Yan,WANG Yan-ping. Musical Note Recognition of Musical Instruments Based on MFCC and Constant Q Transform [J]. Computer Science, 2020, 47(3): 149-155.
[9] MA Su-gang, ZHAO Chen, SUN Han-lin, HAN Jun-gang. Yawning Detection Algorithm Based on Convolutional Neural Network [J]. Computer Science, 2018, 45(6A): 227-229.
[10] SHI Xin-yu, YU Long, TIAN Sheng-wei, YE Fei-yue, QIAN Jin and GAO Shuang-yin. Research on Classification of Oral Bioavailability Based on Deep Learning [J]. Computer Science, 2016, 43(4): 260-263.
[11] JIN Qin, CHEN Shi-zhe, LI Xi-rong, YANG Gang and XU Jie-ping. Speech Emotion Recognition Based on Acoustic Features [J]. Computer Science, 2015, 42(9): 24-28.
[12] JIANG Hai-hua and HU Bin. Speech Emotion Recognition in Mandarin Based on PCA and SVM [J]. Computer Science, 2015, 42(11): 270-273.
[13] SUN Wen-jing, LI Shi-qiang. Design and Implementation of a Audio Classification System Based on SVM [J]. Computer Science, 2010, 37(12): 209-210.
[14] HE Su-Ning,YU Jue-Bang (College of Electronic Engineering, UEST, Chengdu 610054). [J]. Computer Science, 2005, 32(9): 170-175.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!