Computer Science ›› 2020, Vol. 47 ›› Issue (5): 110-119.doi: 10.11896/jsjkx.190400122

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Survey of Acoustic Feature Extraction in Speech Tasks

ZHENG Chun-jun1,2, WANG Chun-li1, JIA Ning2   

  1. 1 College of Information Science and Technology,Dalian Maritime University,Dalian,Liaoning 116023,China
    2 School of Computer & Software,Dalian Neusoft University of Information,Dalian,Liaoning 116023,Chin
  • Received:2019-04-22 Online:2020-05-15 Published:2020-05-19
  • About author:ZHENG Chun-jun,born in 1976,master,associate professor,is a member of China Computer Federation.His main research interests include speech emotion recognition,deep learning and big data analysis.
  • Supported by:
    This work was supported by Liaoning Natural Science Foundation (20180551068).

Abstract: Speech isan important means of information transmission and communication,people often use speech as a medium for exchanging information.The acoustic signal of speech contains a large amount of speaker information,semantic information and rich emotional information.Therefore,three different directions of speech tasks,speaker recognition (SR),auto speech recognition (ASR),and speech emotion recognition (SER),are formed.Each of the three tasks uses different techniques and specific methods for information extraction and model design in their respective fields.Firstly,the historical routes of three tasks at the early stage of development at home and abroad were summarized.The development of speech tasks was summarized into four different stages.At the same time,the public phonetics features for three speech tasks were summarized.The focus of each type of feature was explained.Then,with the wide application of deep learning technology in various fields in recent years,the speech task is well developed.The application of the current popular deep learning model in acoustic modeling was analyzed separately.The acoustic features extraction methods and technical routes for three different speech tasks were summarized in two ways,supervised and unsupervised.In addition,a multi-channel fusion model based on attention mechanism for feature extraction was proposed.In order to solve three speech tasks at the same time,a multi-task model based personalized was proposed for speech feature extraction.This paper also proposed a multi-channel cooperative network model.By using this design idea,the accuracy of multi-task feature extraction can be improved.

Key words: Acoustic features extraction, Auto speech recognition, Deep lear-ning, Multi-channel fusion, Speaker recognition, Speech emotion recognition

CLC Number: 

  • TP183
[1]ZHANG S,ZHANG S,HUANG T,et al.Speech Emotion Re-cognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching[J].IEEE Transactions on Multimedia,2017,20(6):1576-1590.
[2]RICHARDSON F,REYNOLDS D,DEHAK N.A Unified Deep Neural Network for Speaker and Language Recognition[J].ar-Xiv:1504.00923.
[3]KANAGASUNDARAM A,DEAN D,SRIDHARAN S,et al.DNN based Speaker Recognition on Short Utterances[J].arXiv:1610.03190.
[4]LEE J,LEE M,CHANG J H.Ensemble of Jointly Trained Deep Neural Network-Based Acoustic Models for Reverberant Speech Recognition[J].arXiv:1608.04983.
[5]TANG Z,LI L,WANG D.Multi-task Recurrent Model for Speech and Speaker Recognition[C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).IEEE,2016.
[6]CHU W,CHEN R.Speaker Cluster-Based Speaker AdaptiveTraining for Deep Neural Network Acoustic Modeling[C]//ICASSP 2016.IEEE,2016.
[7]GHAHABI O,HERNANDO J.Deep Learning for Single and Multi-Session i-Vector Speaker Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,25(4).
[8]JIN Q,CHEN S Z,LI X R,et al.Speech emotion recognition based on acoustic characteristics [J].Computer Science,2015,42(9):24-28.
[9]WANG W,YANG L P,WEI L,et al.Extraction and Analysis of Speech Emotion Characteristics[J].Research and Exploration in Laboratory,2013,32(7):91-94,191.
[10]YANG M H,TAO J H,LI H,et al.Nature Multimodal Human-Computer-Interaction Dialog System[J].Computer Science,2014,41(10):12-18,35.
[11]RAMANARAYANAN V,PUGH R,QIAN Y,et al.Automatic Turn-Level Language Identification for Code-Switched Spanish-English Dialog[C]//9th International Workshop on Spoken Dialogue System Technology.2019:51-61.
[12]DELLAERT F,POLZIN T,WAIBEL A.Recognizing emotion in speech[C]//International Conference on Spoken Language.1996.
[13]AHMAD J,FIAZ M,KWON S I,et al.Gender Identification using MFCC for Telephone Applications- A Comparative Study[C]//International Journal of Computer Science and Electronics Engineering 3.5.2015:351-355.
[14]BANDELA S R,KUMAR T K.Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC[C]//International Conference on Computing.IEEE Computer Society,2017.
[15]ZHAO W,GAO Y,SINGH R,et al.Speaker identification from the sound of the human breath[J].arXiv:1712.00171v2.
[16]DENG L.A tutorial survey of architectures,algorithms,and applications for deep learning[J].Apsipa Transactions on Signal &Information Processing,2014,3.
[17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014.
[18]AWNI H,CASE C,CASPER J,et al.Deep Speech:Scaling up end-to-end speech recognition[J].arXiv:1412.5567.
[19]AMODEI D,ANUBHAI R,BATTENBERG E,et al.DeepSpeech 2:End-to-End Speech Recognition in English and Mandarin[J].arXiv:1712.00171.
[20]SATT A,ROZENBERG S,HOORY R.Efficient emotion recognition from speech using deep learning on spectrograms[C]//Proc.Interspeech 2017.2017:1089-1093.
[21]EYBEN F,SCHERER K R,TRUONG K P,et al.The GenevaMinimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing[J].IEEE Transactions on Affective Computing,2016,7(2):190-202.
[22]MULIMANI M,KOOLAGUDI S.Robust Acoustic Event Classification using Bag-of-Visual-Words[C]//Proc.Interspeech.2018:3319-3322.
[23]LI L,WANG D,ZHENG T F.System Combination for Short Utterance Speaker Recognition[C]//Signal & Information Processing Association Summit & Conference.IEEE,2016.
[24]ZHANG M,CHEN Y,LI L,et al.Speaker Recognition withCough,Laugh and “Wei”[J].arXiv:1706.07860.
[25]LI L,WANG D,ZHANG Z,et al.Deep Speaker Vectors for Semi Text-independent Speaker Verification[J].arXiv:1505.06427.
[26]LU L.Sequence Training and Adaptation of Highway DeepNeural Networks [C]//2016 IEEE Spoken Language Technology Workshop (SLT).2016.
[27]HAN K,YU D,TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014:223-227.
[28]MAO Q,MING D,HUANG Z,et al.Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks[J].IEEE Transactions on Multimedia,2014,16(8):2203-2213.
[29]SARMA M,GHAHREMANI P,POVEY D.Emotion Identification from raw speech signals using DNNs[C]//Interspeech.2018:3097-3101.
[30]PALAZ D,COLLOBERT R,et al.Analysis of cnn-based speech recognition system using raw speech as input[C]//Proceedings of Interspeech.2015:11-15.
[31]SAINATH T,PARADA C.Convolutional neural networks for small-footprint keyword spotting[C]//Proceedings of Interspeech.2015:1478-1482.
[32]CHEN L,LEE C M.Predicting Audience's Laughter Using Convolutional Neural Network [J].arXiv:1702.02584.
[33]CHAN W,LANE I.Deep convolutional neural networks for acoustic modeling in low resource languages[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Proces-sing.2015:2056-2060.
[34]HUANG Y L,LUO X X,LIU D R.Local Finite Weight Sharing of MFSC Coefficients Based CNN Speech Recognition[J].Control Engineering of China,2017,24(7):1507-1513.
[35]ALDENEH Z,PROVOST E M.Using regional saliency forspeech emotion recognition[C]//IEEE International Conference on Acoustics.IEEE,2017.
[36]KHORRAM S,JAISWAL M,GIDEON J,et al.The PRIORI Emotion Dataset:Linking Mood to Emotion Detected In-the-Wild[C]//Interspeech 2018.2018:1903-1907. [37]HUANG C W,NARAYANAN S.Shaking acoustic spectralsub-bands can better regularize learning in affective computing[C]//ICASSP 2018- 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018.
[38]ZHENG W Q,YU J S,ZOU Y X.An experimental study ofspeech emotion recognition based on deep convolutional neural networks[C]//2015 International Conference on Affective Computing and Intelligent Interaction (ACII).IEEE Computer So-ciety,2015.
[39]NIU Y,ZOU D,NIU Y,et al.A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks[J].arXiv:1707.09917.
[40]SWIETOJANSKI P,RENALS S.Differentiable Pooling for Unsupervised Acoustic Model Adaptation[J].IEEE/ACMTran-sactions on Audio,Speech,and Language Processing,2016,24(10):1773-1784.
[41]WANG D,ZHENG T F.Fransfer learning for speech and language processing[C]//Proceedings of APSIPA Annual Summit and Conference.APSIPA,2015.
[42]HUANG J T,LI J,YU D,et al.Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2013:7304-7308.
[43]ZHONG G,LIN X,CHEN K.Long Short-Term Attention[J].arXiv:1810.12752.
[44]GUPTA V,KENNY P,OUELLET P,et al.I-vector-basedspeaker adaptation of deep neural networks for french broadcast audio transcription[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.2014:6334-6338.
[45]GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional lstm networks[C]//International Joint Conference on Neural Networks.2005.
[46]BERINGER N,GRAVES A,SCHIEL F,et al.Classifying Unprompted Speech by Retraining LSTM Nets[J].Lecture Notes in Computer Science,2005,58(1956):575-581.
[47]LI J,MOHAMED A,ZWEIG G,et al.Exploring multidimen-sional lstms for large vocabulary ASR[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).2016:4940-4944.
[48]LI B,SAINATH T N,NARAYANAN A,et al.Acoustic mode-ling for Google home[C]//Proc.of INTERSPEECH.2017:399-403. [49]LEE J,TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[C]//Interspeech.2015.
[50]HAN W J,RUAN H B,CHEN X M.Towards Temporal Mo-delling of Categorical Speech Emotion Recognition[J].arXiv:10.21437/Interspeech,2018.
[51]TRIGEORGIS G,RINGEVAL F,BRÜCKNER R,et al.Adieufeatures? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics.IEEE,2016.
[52]TANG D,ZENG J,LI M.An End-to-End Deep Learning Framework for Speech Emotion Recognition of Atypical Individuals[C]//Proc. Interspeech.2018:162-166.
[53]PEKHOVSKY T,KORENEVSKY M.Investigation of UsingVAE for i-Vector Speaker Verification[J].arXiv:1705.09185.
[54]KAMPER H,JANSEN A,GOLDWATER S.A segmentalframework for fully-unsupervised large-vocabulary speech re-cognition[J].Computer Speech & Language,2017,46:154-174.
[55]CHUNG Y A,GLASS J.Speech2vec:A sequence-to-sequenceframework for learning word embeddings from speech[C]//INTERSPEECH.2018:811-815.
[56]LATIF S,RANA R,QADIR J.Variational AutoencodersforLearning Latent Representations of Speech Emotion:A Preliminary Study[C]//Interspeech 2018.2018:3107-3111.
[57]ZONG Z F,LI H,WANG Q.Multi-Channel Auto-Encoder for Speech Emotion Recognition[J].arXiv:1810.10662v1.
[58]LATIF S,RANA R,YOUNIS S,et al.Transfer Learning for Improving Speech Emotion Classification Accuracy[C]//INTERSPEECH.2018:257-261.
[59]LI C,MA X,JIANG B,et al.Deep Speaker:an End-to-End Neural Speaker Embedding System[J].arXiv:1705.02304.
[60]DUMPALA S H,PANDA A,KOPPARAPU S K.ImprovedI-vector-based Speaker Recognition for Utterances with Speaker Generated Non-speech sounds[J].arXiv:1705.09289.
[61]YI L,LIANG H,YAO T,et al.Comparison of Multiple Features and Modeling Methods for Text-dependent Speaker Verification[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2017.
[62]LI J H,YANG J A,WANG Y.New Feature Extraction Method Based on Bottleneck Deep Belief Networks and its Application in Language Recognition[J].Computer Science,2014,41(3):263-266.
[63]BHARGAVA M,ROSE R.Architectures for deep neural net-work based acoustic models defined directly over windowed speech waveforms[C]//INTERSPEECH.2015:6-10.
[64]LI S,XU L T.Research on Emotion Recognition AlgorithmBased on Spectrogram Feature Extraction of Bottleneck Feature[J].Computer Technology and Development,2017,27(5):82-86.
[65]SNYDER D,GARCIA-ROMERO D,POVEY D.Deep neuralnetwork embeddings for text-independent speaker verification[J].arXiv:10.21437/Interspeech.2017.
[66]KEREN G,SCHULLER,BJÖRN.Convolutional RNN:an Enhanced Model for Extracting Features from Sequential Data[C]//2016 International Joint Conference on Neural Networks (IJCNN).2016.
[67]MA X,WU Z,JIA J,et al.Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition[C]//ICASSP-2017.2016.
[68]Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms[C]//Interspeech 2018.2018:3683-3687.
[69]LUO D Q,ZOU Y X,HUANG D Y.Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition[C]//2018 Conference of the International Speech Communication Association(INTERSPEECH 2018).2018:152-156.
[70]MINGYI C,XUANJI H,JING Y,et al.3-D Convolutional Recurrent Neural Networks with Attention Model for SpeechEmotion Recognition[J].IEEE Signal Processing Letters,2018:1.
[71]SAKR M,ANDRIENKO G,BEHR T,et al.Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems[C]//Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.2011:505-508.
[72]NICHOLAS C,SHAHIN A,SANDRA O.Multimodal Bag-of-Words for Cross Domains Sentiment Analysis[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018.
[73]LI L,TANG Z,DONG W,et al.Collaborative Learning for Language and Speaker Recognition[C]//ICASSP 2017.2017.
[74]LI Y,WEI Z H,XU K.Hybrid Feature Selection Method ofChinese Emotional Characteristics Based on Lasso Algorithm[J].Computer Science,2018,45(1):39-46.
[1] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[2] GUO Xing-chen, YU Yi-biao. Robust Speaker Verification with Spoofing Attack Detection [J]. Computer Science, 2022, 49(6A): 531-536.
[3] JIANG Zong-li, FAN Ke, ZHANG Jin-li. Generative Adversarial Network and Meta-path Based Heterogeneous Network Representation Learning [J]. Computer Science, 2022, 49(1): 133-139.
[4] HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi. End-to-End Speaker Recognition Based on Frame-level Features [J]. Computer Science, 2020, 47(10): 169-173.
[5] YAO Zhe-wei, YANG Feng, HUANG Jing, LIU Ya-qin. Improved CycleGANs for Intravascular Ultrasound Image Enhancement [J]. Computer Science, 2019, 46(5): 221-227.
[6] JIN Qin, CHEN Shi-zhe, LI Xi-rong, YANG Gang and XU Jie-ping. Speech Emotion Recognition Based on Acoustic Features [J]. Computer Science, 2015, 42(9): 24-28.
[7] JIANG Hai-hua and HU Bin. Speech Emotion Recognition in Mandarin Based on PCA and SVM [J]. Computer Science, 2015, 42(11): 270-273.
[8] . TEo-CrCC Characteristic Parameter Extraction Method for Speaker Recognition in Noisy Environments [J]. Computer Science, 2012, 39(12): 198-203.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!