语音任务下声学特征提取综述

doi:10.11896/jsjkx.190400122

摘要/Abstract

摘要： 语音是一种重要的信息资源传递与交流方式,人们经常使用语音作为交流信息的媒介,在语音的声学信号中包含大量的说话者信息、语义信息和丰富的情感信息,因此形成了解决语音学任务的3个不同方向,即声纹识别(Speaker Recognition,SR)、语音识别(Auto Speech Recognition,ASR)和情感识别(Speech Emotion Recognition,SER),3个任务均在各自的领域使用不同的技术与特定的方法进行信息提取与模型设计。文中首先综述了3个任务在国内外早期的发展历史路线,将语音任务的发展归纳为4个不同阶段,同时总结了3个语音学任务在特征提取时所采用的公共语音学特征,并针对每类特征的侧重点进行了说明。然后,随着近年来深度学习技术在各个领域中的广泛应用,语音任务也得到了很好的发展,文中针对目前流行的深度学习模型在声学建模中的应用分别进行了分析,按照有监督、无监督的方式总结了针对3种不同语音任务的声学特征提取方式及技术路线,还总结了基于多通道并融合注意力机制的模型,用于语音的特征提取。为了同时完成语音识别、声纹识别和情感识别任务,针对声学信号的个性化特征提出了一个基于多任务的Tandem模型;此外,提出了一个多通道协作网络模型,利用这种设计思路可以提升多任务特征提取的准确度。

关键词: 多通道融合, 情感识别, 深度学习, 声纹识别, 声学特征提取, 语音识别

Abstract: Speech isan important means of information transmission and communication,people often use speech as a medium for exchanging information.The acoustic signal of speech contains a large amount of speaker information,semantic information and rich emotional information.Therefore,three different directions of speech tasks,speaker recognition (SR),auto speech recognition (ASR),and speech emotion recognition (SER),are formed.Each of the three tasks uses different techniques and specific methods for information extraction and model design in their respective fields.Firstly,the historical routes of three tasks at the early stage of development at home and abroad were summarized.The development of speech tasks was summarized into four different stages.At the same time,the public phonetics features for three speech tasks were summarized.The focus of each type of feature was explained.Then,with the wide application of deep learning technology in various fields in recent years,the speech task is well developed.The application of the current popular deep learning model in acoustic modeling was analyzed separately.The acoustic features extraction methods and technical routes for three different speech tasks were summarized in two ways,supervised and unsupervised.In addition,a multi-channel fusion model based on attention mechanism for feature extraction was proposed.In order to solve three speech tasks at the same time,a multi-task model based personalized was proposed for speech feature extraction.This paper also proposed a multi-channel cooperative network model.By using this design idea,the accuracy of multi-task feature extraction can be improved.

Key words: Acoustic features extraction, Auto speech recognition, Deep lear-ning, Multi-channel fusion, Speaker recognition, Speech emotion recognition

中图分类号:

TP183

郑纯军, 王春立, 贾宁. 语音任务下声学特征提取综述[J]. 计算机科学, 2020, 47(5): 110-119. https://doi.org/10.11896/jsjkx.190400122

ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks[J]. Computer Science, 2020, 47(5): 110-119. https://doi.org/10.11896/jsjkx.190400122

参考文献

[1]ZHANG S,ZHANG S,HUANG T,et al.Speech Emotion Re-cognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching[J].IEEE Transactions on Multimedia,2017,20(6):1576-1590.
[2]RICHARDSON F,REYNOLDS D,DEHAK N.A Unified Deep Neural Network for Speaker and Language Recognition[J].ar-Xiv:1504.00923.
[3]KANAGASUNDARAM A,DEAN D,SRIDHARAN S,et al.DNN based Speaker Recognition on Short Utterances[J].arXiv:1610.03190.
[4]LEE J,LEE M,CHANG J H.Ensemble of Jointly Trained Deep Neural Network-Based Acoustic Models for Reverberant Speech Recognition[J].arXiv:1608.04983.
[5]TANG Z,LI L,WANG D.Multi-task Recurrent Model for Speech and Speaker Recognition[C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).IEEE,2016.
[6]CHU W,CHEN R.Speaker Cluster-Based Speaker AdaptiveTraining for Deep Neural Network Acoustic Modeling[C]//ICASSP 2016.IEEE,2016.
[7]GHAHABI O,HERNANDO J.Deep Learning for Single and Multi-Session i-Vector Speaker Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,25(4).
[8]JIN Q,CHEN S Z,LI X R,et al.Speech emotion recognition based on acoustic characteristics [J].Computer Science,2015,42(9):24-28.
[9]WANG W,YANG L P,WEI L,et al.Extraction and Analysis of Speech Emotion Characteristics[J].Research and Exploration in Laboratory,2013,32(7):91-94,191.
[10]YANG M H,TAO J H,LI H,et al.Nature Multimodal Human-Computer-Interaction Dialog System[J].Computer Science,2014,41(10):12-18,35.
[11]RAMANARAYANAN V,PUGH R,QIAN Y,et al.Automatic Turn-Level Language Identification for Code-Switched Spanish-English Dialog[C]//9th International Workshop on Spoken Dialogue System Technology.2019:51-61.
[12]DELLAERT F,POLZIN T,WAIBEL A.Recognizing emotion in speech[C]//International Conference on Spoken Language.1996.
[13]AHMAD J,FIAZ M,KWON S I,et al.Gender Identification using MFCC for Telephone Applications- A Comparative Study[C]//International Journal of Computer Science and Electronics Engineering 3.5.2015:351-355.
[14]BANDELA S R,KUMAR T K.Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC[C]//International Conference on Computing.IEEE Computer Society,2017.
[15]ZHAO W,GAO Y,SINGH R,et al.Speaker identification from the sound of the human breath[J].arXiv:1712.00171v2.
[16]DENG L.A tutorial survey of architectures,algorithms,and applications for deep learning[J].Apsipa Transactions on Signal &Information Processing,2014,3.
[17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014.
[18]AWNI H,CASE C,CASPER J,et al.Deep Speech:Scaling up end-to-end speech recognition[J].arXiv:1412.5567.
[19]AMODEI D,ANUBHAI R,BATTENBERG E,et al.DeepSpeech 2:End-to-End Speech Recognition in English and Mandarin[J].arXiv:1712.00171.
[20]SATT A,ROZENBERG S,HOORY R.Efficient emotion recognition from speech using deep learning on spectrograms[C]//Proc.Interspeech 2017.2017:1089-1093.
[21]EYBEN F,SCHERER K R,TRUONG K P,et al.The GenevaMinimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing[J].IEEE Transactions on Affective Computing,2016,7(2):190-202.
[22]MULIMANI M,KOOLAGUDI S.Robust Acoustic Event Classification using Bag-of-Visual-Words[C]//Proc.Interspeech.2018:3319-3322.
[23]LI L,WANG D,ZHENG T F.System Combination for Short Utterance Speaker Recognition[C]//Signal & Information Processing Association Summit & Conference.IEEE,2016.
[24]ZHANG M,CHEN Y,LI L,et al.Speaker Recognition withCough,Laugh and “Wei”[J].arXiv:1706.07860.
[25]LI L,WANG D,ZHANG Z,et al.Deep Speaker Vectors for Semi Text-independent Speaker Verification[J].arXiv:1505.06427.
[26]LU L.Sequence Training and Adaptation of Highway DeepNeural Networks [C]//2016 IEEE Spoken Language Technology Workshop (SLT).2016.
[27]HAN K,YU D,TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014:223-227.
[28]MAO Q,MING D,HUANG Z,et al.Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks[J].IEEE Transactions on Multimedia,2014,16(8):2203-2213.
[29]SARMA M,GHAHREMANI P,POVEY D.Emotion Identification from raw speech signals using DNNs[C]//Interspeech.2018:3097-3101.
[30]PALAZ D,COLLOBERT R,et al.Analysis of cnn-based speech recognition system using raw speech as input[C]//Proceedings of Interspeech.2015:11-15.
[31]SAINATH T,PARADA C.Convolutional neural networks for small-footprint keyword spotting[C]//Proceedings of Interspeech.2015:1478-1482.
[32]CHEN L,LEE C M.Predicting Audience's Laughter Using Convolutional Neural Network [J].arXiv:1702.02584.
[33]CHAN W,LANE I.Deep convolutional neural networks for acoustic modeling in low resource languages[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Proces-sing.2015:2056-2060.
[34]HUANG Y L,LUO X X,LIU D R.Local Finite Weight Sharing of MFSC Coefficients Based CNN Speech Recognition[J].Control Engineering of China,2017,24(7):1507-1513.
[35]ALDENEH Z,PROVOST E M.Using regional saliency forspeech emotion recognition[C]//IEEE International Conference on Acoustics.IEEE,2017.
[36]KHORRAM S,JAISWAL M,GIDEON J,et al.The PRIORI Emotion Dataset:Linking Mood to Emotion Detected In-the-Wild[C]//Interspeech 2018.2018:1903-1907. [37]HUANG C W,NARAYANAN S.Shaking acoustic spectralsub-bands can better regularize learning in affective computing[C]//ICASSP 2018- 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018.
[38]ZHENG W Q,YU J S,ZOU Y X.An experimental study ofspeech emotion recognition based on deep convolutional neural networks[C]//2015 International Conference on Affective Computing and Intelligent Interaction (ACII).IEEE Computer So-ciety,2015.
[39]NIU Y,ZOU D,NIU Y,et al.A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks[J].arXiv:1707.09917.
[40]SWIETOJANSKI P,RENALS S.Differentiable Pooling for Unsupervised Acoustic Model Adaptation[J].IEEE/ACMTran-sactions on Audio,Speech,and Language Processing,2016,24(10):1773-1784.
[41]WANG D,ZHENG T F.Fransfer learning for speech and language processing[C]//Proceedings of APSIPA Annual Summit and Conference.APSIPA,2015.
[42]HUANG J T,LI J,YU D,et al.Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2013:7304-7308.
[43]ZHONG G,LIN X,CHEN K.Long Short-Term Attention[J].arXiv:1810.12752.
[44]GUPTA V,KENNY P,OUELLET P,et al.I-vector-basedspeaker adaptation of deep neural networks for french broadcast audio transcription[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.2014:6334-6338.
[45]GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional lstm networks[C]//International Joint Conference on Neural Networks.2005.
[46]BERINGER N,GRAVES A,SCHIEL F,et al.Classifying Unprompted Speech by Retraining LSTM Nets[J].Lecture Notes in Computer Science,2005,58(1956):575-581.
[47]LI J,MOHAMED A,ZWEIG G,et al.Exploring multidimen-sional lstms for large vocabulary ASR[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).2016:4940-4944.
[48]LI B,SAINATH T N,NARAYANAN A,et al.Acoustic mode-ling for Google home[C]//Proc.of INTERSPEECH.2017:399-403. [49]LEE J,TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[C]//Interspeech.2015.
[50]HAN W J,RUAN H B,CHEN X M.Towards Temporal Mo-delling of Categorical Speech Emotion Recognition[J].arXiv:10.21437/Interspeech,2018.
[51]TRIGEORGIS G,RINGEVAL F,BRÜCKNER R,et al.Adieufeatures? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics.IEEE,2016.
[52]TANG D,ZENG J,LI M.An End-to-End Deep Learning Framework for Speech Emotion Recognition of Atypical Individuals[C]//Proc. Interspeech.2018:162-166.
[53]PEKHOVSKY T,KORENEVSKY M.Investigation of UsingVAE for i-Vector Speaker Verification[J].arXiv:1705.09185.
[54]KAMPER H,JANSEN A,GOLDWATER S.A segmentalframework for fully-unsupervised large-vocabulary speech re-cognition[J].Computer Speech & Language,2017,46:154-174.
[55]CHUNG Y A,GLASS J.Speech2vec:A sequence-to-sequenceframework for learning word embeddings from speech[C]//INTERSPEECH.2018:811-815.
[56]LATIF S,RANA R,QADIR J.Variational AutoencodersforLearning Latent Representations of Speech Emotion:A Preliminary Study[C]//Interspeech 2018.2018:3107-3111.
[57]ZONG Z F,LI H,WANG Q.Multi-Channel Auto-Encoder for Speech Emotion Recognition[J].arXiv:1810.10662v1.
[58]LATIF S,RANA R,YOUNIS S,et al.Transfer Learning for Improving Speech Emotion Classification Accuracy[C]//INTERSPEECH.2018:257-261.
[59]LI C,MA X,JIANG B,et al.Deep Speaker:an End-to-End Neural Speaker Embedding System[J].arXiv:1705.02304.
[60]DUMPALA S H,PANDA A,KOPPARAPU S K.ImprovedI-vector-based Speaker Recognition for Utterances with Speaker Generated Non-speech sounds[J].arXiv:1705.09289.
[61]YI L,LIANG H,YAO T,et al.Comparison of Multiple Features and Modeling Methods for Text-dependent Speaker Verification[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2017.
[62]LI J H,YANG J A,WANG Y.New Feature Extraction Method Based on Bottleneck Deep Belief Networks and its Application in Language Recognition[J].Computer Science,2014,41(3):263-266.
[63]BHARGAVA M,ROSE R.Architectures for deep neural net-work based acoustic models defined directly over windowed speech waveforms[C]//INTERSPEECH.2015:6-10.
[64]LI S,XU L T.Research on Emotion Recognition AlgorithmBased on Spectrogram Feature Extraction of Bottleneck Feature[J].Computer Technology and Development,2017,27(5):82-86.
[65]SNYDER D,GARCIA-ROMERO D,POVEY D.Deep neuralnetwork embeddings for text-independent speaker verification[J].arXiv:10.21437/Interspeech.2017.
[66]KEREN G,SCHULLER,BJÖRN.Convolutional RNN:an Enhanced Model for Extracting Features from Sequential Data[C]//2016 International Joint Conference on Neural Networks (IJCNN).2016.
[67]MA X,WU Z,JIA J,et al.Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition[C]//ICASSP-2017.2016.
[68]Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms[C]//Interspeech 2018.2018:3683-3687.
[69]LUO D Q,ZOU Y X,HUANG D Y.Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition[C]//2018 Conference of the International Speech Communication Association(INTERSPEECH 2018).2018:152-156.
[70]MINGYI C,XUANJI H,JING Y,et al.3-D Convolutional Recurrent Neural Networks with Attention Model for SpeechEmotion Recognition[J].IEEE Signal Processing Letters,2018:1.
[71]SAKR M,ANDRIENKO G,BEHR T,et al.Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems[C]//Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.2011:505-508.
[72]NICHOLAS C,SHAHIN A,SANDRA O.Multimodal Bag-of-Words for Cross Domains Sentiment Analysis[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018.
[73]LI L,TANG Z,DONG W,et al.Collaborative Learning for Language and Speaker Recognition[C]//ICASSP 2017.2017.
[74]LI Y,WEI Z H,XU K.Hybrid Feature Selection Method ofChinese Emotional Characteristics Based on Lasso Algorithm[J].Computer Science,2018,45(1):39-46.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed