计算机科学 ›› 2020, Vol. 47 ›› Issue (5): 110-119.doi: 10.11896/jsjkx.190400122

• 计算机图形学&多媒体 • 上一篇    下一篇

语音任务下声学特征提取综述

郑纯军1,2, 王春立1, 贾宁2   

  1. 1 大连海事大学信息科学技术学院 辽宁 大连116023
    2 大连东软信息学院计算机与软件学院 辽宁 大连116023
  • 收稿日期:2019-04-22 出版日期:2020-05-15 发布日期:2020-05-19
  • 通讯作者: 郑纯军(zhengchunjun@neusoft.edu.cn)
  • 基金资助:
    辽宁省自然科学基金(20180551068)

Survey of Acoustic Feature Extraction in Speech Tasks

ZHENG Chun-jun1,2, WANG Chun-li1, JIA Ning2   

  1. 1 College of Information Science and Technology,Dalian Maritime University,Dalian,Liaoning 116023,China
    2 School of Computer & Software,Dalian Neusoft University of Information,Dalian,Liaoning 116023,Chin
  • Received:2019-04-22 Online:2020-05-15 Published:2020-05-19
  • About author:ZHENG Chun-jun,born in 1976,master,associate professor,is a member of China Computer Federation.His main research interests include speech emotion recognition,deep learning and big data analysis.
  • Supported by:
    This work was supported by Liaoning Natural Science Foundation (20180551068).

摘要: 语音是一种重要的信息资源传递与交流方式,人们经常使用语音作为交流信息的媒介,在语音的声学信号中包含大量的说话者信息、语义信息和丰富的情感信息,因此形成了解决语音学任务的3个不同方向,即声纹识别(Speaker Recognition,SR)、语音识别(Auto Speech Recognition,ASR)和情感识别(Speech Emotion Recognition,SER),3个任务均在各自的领域使用不同的技术与特定的方法进行信息提取与模型设计。文中首先综述了3个任务在国内外早期的发展历史路线,将语音任务的发展归纳为4个不同阶段,同时总结了3个语音学任务在特征提取时所采用的公共语音学特征,并针对每类特征的侧重点进行了说明。然后,随着近年来深度学习技术在各个领域中的广泛应用,语音任务也得到了很好的发展,文中针对目前流行的深度学习模型在声学建模中的应用分别进行了分析,按照有监督、无监督的方式总结了针对3种不同语音任务的声学特征提取方式及技术路线,还总结了基于多通道并融合注意力机制的模型,用于语音的特征提取。为了同时完成语音识别、声纹识别和情感识别任务,针对声学信号的个性化特征提出了一个基于多任务的Tandem模型;此外,提出了一个多通道协作网络模型,利用这种设计思路可以提升多任务特征提取的准确度。

关键词: 多通道融合, 情感识别, 深度学习, 声纹识别, 声学特征提取, 语音识别

Abstract: Speech isan important means of information transmission and communication,people often use speech as a medium for exchanging information.The acoustic signal of speech contains a large amount of speaker information,semantic information and rich emotional information.Therefore,three different directions of speech tasks,speaker recognition (SR),auto speech recognition (ASR),and speech emotion recognition (SER),are formed.Each of the three tasks uses different techniques and specific methods for information extraction and model design in their respective fields.Firstly,the historical routes of three tasks at the early stage of development at home and abroad were summarized.The development of speech tasks was summarized into four different stages.At the same time,the public phonetics features for three speech tasks were summarized.The focus of each type of feature was explained.Then,with the wide application of deep learning technology in various fields in recent years,the speech task is well developed.The application of the current popular deep learning model in acoustic modeling was analyzed separately.The acoustic features extraction methods and technical routes for three different speech tasks were summarized in two ways,supervised and unsupervised.In addition,a multi-channel fusion model based on attention mechanism for feature extraction was proposed.In order to solve three speech tasks at the same time,a multi-task model based personalized was proposed for speech feature extraction.This paper also proposed a multi-channel cooperative network model.By using this design idea,the accuracy of multi-task feature extraction can be improved.

Key words: Acoustic features extraction, Auto speech recognition, Deep lear-ning, Multi-channel fusion, Speaker recognition, Speech emotion recognition

中图分类号: 

  • TP183
[1]ZHANG S,ZHANG S,HUANG T,et al.Speech Emotion Re-cognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching[J].IEEE Transactions on Multimedia,2017,20(6):1576-1590.
[2]RICHARDSON F,REYNOLDS D,DEHAK N.A Unified Deep Neural Network for Speaker and Language Recognition[J].ar-Xiv:1504.00923.
[3]KANAGASUNDARAM A,DEAN D,SRIDHARAN S,et al.DNN based Speaker Recognition on Short Utterances[J].arXiv:1610.03190.
[4]LEE J,LEE M,CHANG J H.Ensemble of Jointly Trained Deep Neural Network-Based Acoustic Models for Reverberant Speech Recognition[J].arXiv:1608.04983.
[5]TANG Z,LI L,WANG D.Multi-task Recurrent Model for Speech and Speaker Recognition[C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).IEEE,2016.
[6]CHU W,CHEN R.Speaker Cluster-Based Speaker AdaptiveTraining for Deep Neural Network Acoustic Modeling[C]//ICASSP 2016.IEEE,2016.
[7]GHAHABI O,HERNANDO J.Deep Learning for Single and Multi-Session i-Vector Speaker Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,25(4).
[8]JIN Q,CHEN S Z,LI X R,et al.Speech emotion recognition based on acoustic characteristics [J].Computer Science,2015,42(9):24-28.
[9]WANG W,YANG L P,WEI L,et al.Extraction and Analysis of Speech Emotion Characteristics[J].Research and Exploration in Laboratory,2013,32(7):91-94,191.
[10]YANG M H,TAO J H,LI H,et al.Nature Multimodal Human-Computer-Interaction Dialog System[J].Computer Science,2014,41(10):12-18,35.
[11]RAMANARAYANAN V,PUGH R,QIAN Y,et al.Automatic Turn-Level Language Identification for Code-Switched Spanish-English Dialog[C]//9th International Workshop on Spoken Dialogue System Technology.2019:51-61.
[12]DELLAERT F,POLZIN T,WAIBEL A.Recognizing emotion in speech[C]//International Conference on Spoken Language.1996.
[13]AHMAD J,FIAZ M,KWON S I,et al.Gender Identification using MFCC for Telephone Applications- A Comparative Study[C]//International Journal of Computer Science and Electronics Engineering 3.5.2015:351-355.
[14]BANDELA S R,KUMAR T K.Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC[C]//International Conference on Computing.IEEE Computer Society,2017.
[15]ZHAO W,GAO Y,SINGH R,et al.Speaker identification from the sound of the human breath[J].arXiv:1712.00171v2.
[16]DENG L.A tutorial survey of architectures,algorithms,and applications for deep learning[J].Apsipa Transactions on Signal &Information Processing,2014,3.
[17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014.
[18]AWNI H,CASE C,CASPER J,et al.Deep Speech:Scaling up end-to-end speech recognition[J].arXiv:1412.5567.
[19]AMODEI D,ANUBHAI R,BATTENBERG E,et al.DeepSpeech 2:End-to-End Speech Recognition in English and Mandarin[J].arXiv:1712.00171.
[20]SATT A,ROZENBERG S,HOORY R.Efficient emotion recognition from speech using deep learning on spectrograms[C]//Proc.Interspeech 2017.2017:1089-1093.
[21]EYBEN F,SCHERER K R,TRUONG K P,et al.The GenevaMinimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing[J].IEEE Transactions on Affective Computing,2016,7(2):190-202.
[22]MULIMANI M,KOOLAGUDI S.Robust Acoustic Event Classification using Bag-of-Visual-Words[C]//Proc.Interspeech.2018:3319-3322.
[23]LI L,WANG D,ZHENG T F.System Combination for Short Utterance Speaker Recognition[C]//Signal & Information Processing Association Summit & Conference.IEEE,2016.
[24]ZHANG M,CHEN Y,LI L,et al.Speaker Recognition withCough,Laugh and “Wei”[J].arXiv:1706.07860.
[25]LI L,WANG D,ZHANG Z,et al.Deep Speaker Vectors for Semi Text-independent Speaker Verification[J].arXiv:1505.06427.
[26]LU L.Sequence Training and Adaptation of Highway DeepNeural Networks [C]//2016 IEEE Spoken Language Technology Workshop (SLT).2016.
[27]HAN K,YU D,TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014:223-227.
[28]MAO Q,MING D,HUANG Z,et al.Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks[J].IEEE Transactions on Multimedia,2014,16(8):2203-2213.
[29]SARMA M,GHAHREMANI P,POVEY D.Emotion Identification from raw speech signals using DNNs[C]//Interspeech.2018:3097-3101.
[30]PALAZ D,COLLOBERT R,et al.Analysis of cnn-based speech recognition system using raw speech as input[C]//Proceedings of Interspeech.2015:11-15.
[31]SAINATH T,PARADA C.Convolutional neural networks for small-footprint keyword spotting[C]//Proceedings of Interspeech.2015:1478-1482.
[32]CHEN L,LEE C M.Predicting Audience's Laughter Using Convolutional Neural Network [J].arXiv:1702.02584.
[33]CHAN W,LANE I.Deep convolutional neural networks for acoustic modeling in low resource languages[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Proces-sing.2015:2056-2060.
[34]HUANG Y L,LUO X X,LIU D R.Local Finite Weight Sharing of MFSC Coefficients Based CNN Speech Recognition[J].Control Engineering of China,2017,24(7):1507-1513.
[35]ALDENEH Z,PROVOST E M.Using regional saliency forspeech emotion recognition[C]//IEEE International Conference on Acoustics.IEEE,2017.
[36]KHORRAM S,JAISWAL M,GIDEON J,et al.The PRIORI Emotion Dataset:Linking Mood to Emotion Detected In-the-Wild[C]//Interspeech 2018.2018:1903-1907. [37]HUANG C W,NARAYANAN S.Shaking acoustic spectralsub-bands can better regularize learning in affective computing[C]//ICASSP 2018- 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018.
[38]ZHENG W Q,YU J S,ZOU Y X.An experimental study ofspeech emotion recognition based on deep convolutional neural networks[C]//2015 International Conference on Affective Computing and Intelligent Interaction (ACII).IEEE Computer So-ciety,2015.
[39]NIU Y,ZOU D,NIU Y,et al.A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks[J].arXiv:1707.09917.
[40]SWIETOJANSKI P,RENALS S.Differentiable Pooling for Unsupervised Acoustic Model Adaptation[J].IEEE/ACMTran-sactions on Audio,Speech,and Language Processing,2016,24(10):1773-1784.
[41]WANG D,ZHENG T F.Fransfer learning for speech and language processing[C]//Proceedings of APSIPA Annual Summit and Conference.APSIPA,2015.
[42]HUANG J T,LI J,YU D,et al.Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2013:7304-7308.
[43]ZHONG G,LIN X,CHEN K.Long Short-Term Attention[J].arXiv:1810.12752.
[44]GUPTA V,KENNY P,OUELLET P,et al.I-vector-basedspeaker adaptation of deep neural networks for french broadcast audio transcription[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.2014:6334-6338.
[45]GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional lstm networks[C]//International Joint Conference on Neural Networks.2005.
[46]BERINGER N,GRAVES A,SCHIEL F,et al.Classifying Unprompted Speech by Retraining LSTM Nets[J].Lecture Notes in Computer Science,2005,58(1956):575-581.
[47]LI J,MOHAMED A,ZWEIG G,et al.Exploring multidimen-sional lstms for large vocabulary ASR[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).2016:4940-4944.
[48]LI B,SAINATH T N,NARAYANAN A,et al.Acoustic mode-ling for Google home[C]//Proc.of INTERSPEECH.2017:399-403. [49]LEE J,TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[C]//Interspeech.2015.
[50]HAN W J,RUAN H B,CHEN X M.Towards Temporal Mo-delling of Categorical Speech Emotion Recognition[J].arXiv:10.21437/Interspeech,2018.
[51]TRIGEORGIS G,RINGEVAL F,BRÜCKNER R,et al.Adieufeatures? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics.IEEE,2016.
[52]TANG D,ZENG J,LI M.An End-to-End Deep Learning Framework for Speech Emotion Recognition of Atypical Individuals[C]//Proc. Interspeech.2018:162-166.
[53]PEKHOVSKY T,KORENEVSKY M.Investigation of UsingVAE for i-Vector Speaker Verification[J].arXiv:1705.09185.
[54]KAMPER H,JANSEN A,GOLDWATER S.A segmentalframework for fully-unsupervised large-vocabulary speech re-cognition[J].Computer Speech & Language,2017,46:154-174.
[55]CHUNG Y A,GLASS J.Speech2vec:A sequence-to-sequenceframework for learning word embeddings from speech[C]//INTERSPEECH.2018:811-815.
[56]LATIF S,RANA R,QADIR J.Variational AutoencodersforLearning Latent Representations of Speech Emotion:A Preliminary Study[C]//Interspeech 2018.2018:3107-3111.
[57]ZONG Z F,LI H,WANG Q.Multi-Channel Auto-Encoder for Speech Emotion Recognition[J].arXiv:1810.10662v1.
[58]LATIF S,RANA R,YOUNIS S,et al.Transfer Learning for Improving Speech Emotion Classification Accuracy[C]//INTERSPEECH.2018:257-261.
[59]LI C,MA X,JIANG B,et al.Deep Speaker:an End-to-End Neural Speaker Embedding System[J].arXiv:1705.02304.
[60]DUMPALA S H,PANDA A,KOPPARAPU S K.ImprovedI-vector-based Speaker Recognition for Utterances with Speaker Generated Non-speech sounds[J].arXiv:1705.09289.
[61]YI L,LIANG H,YAO T,et al.Comparison of Multiple Features and Modeling Methods for Text-dependent Speaker Verification[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2017.
[62]LI J H,YANG J A,WANG Y.New Feature Extraction Method Based on Bottleneck Deep Belief Networks and its Application in Language Recognition[J].Computer Science,2014,41(3):263-266.
[63]BHARGAVA M,ROSE R.Architectures for deep neural net-work based acoustic models defined directly over windowed speech waveforms[C]//INTERSPEECH.2015:6-10.
[64]LI S,XU L T.Research on Emotion Recognition AlgorithmBased on Spectrogram Feature Extraction of Bottleneck Feature[J].Computer Technology and Development,2017,27(5):82-86.
[65]SNYDER D,GARCIA-ROMERO D,POVEY D.Deep neuralnetwork embeddings for text-independent speaker verification[J].arXiv:10.21437/Interspeech.2017.
[66]KEREN G,SCHULLER,BJÖRN.Convolutional RNN:an Enhanced Model for Extracting Features from Sequential Data[C]//2016 International Joint Conference on Neural Networks (IJCNN).2016.
[67]MA X,WU Z,JIA J,et al.Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition[C]//ICASSP-2017.2016.
[68]Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms[C]//Interspeech 2018.2018:3683-3687.
[69]LUO D Q,ZOU Y X,HUANG D Y.Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition[C]//2018 Conference of the International Speech Communication Association(INTERSPEECH 2018).2018:152-156.
[70]MINGYI C,XUANJI H,JING Y,et al.3-D Convolutional Recurrent Neural Networks with Attention Model for SpeechEmotion Recognition[J].IEEE Signal Processing Letters,2018:1.
[71]SAKR M,ANDRIENKO G,BEHR T,et al.Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems[C]//Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.2011:505-508.
[72]NICHOLAS C,SHAHIN A,SANDRA O.Multimodal Bag-of-Words for Cross Domains Sentiment Analysis[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018.
[73]LI L,TANG Z,DONG W,et al.Collaborative Learning for Language and Speaker Recognition[C]//ICASSP 2017.2017.
[74]LI Y,WEI Z H,XU K.Hybrid Feature Selection Method ofChinese Emotional Characteristics Based on Lasso Algorithm[J].Computer Science,2018,45(1):39-46.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[3] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[8] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[9] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[10] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[11] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[12] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[13] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[14] 祝文韬, 兰先超, 罗唤霖, 岳彬, 汪洋.
改进Faster R-CNN的光学遥感飞机目标检测
Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN
计算机科学, 2022, 49(6A): 378-383. https://doi.org/10.11896/jsjkx.210300121
[15] 王建明, 陈响育, 杨自忠, 史晨阳, 张宇航, 钱正坤.
不同数据增强方法对模型识别精度的影响
Influence of Different Data Augmentation Methods on Model Recognition Accuracy
计算机科学, 2022, 49(6A): 418-423. https://doi.org/10.11896/jsjkx.210700210
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!