Computer Science ›› 2020, Vol. 47 ›› Issue (5): 110-119.doi: 10.11896/jsjkx.190400122
• Computer Graphics & Multimedia • Previous Articles Next Articles
ZHENG Chun-jun1,2, WANG Chun-li1, JIA Ning2
CLC Number:
[1]ZHANG S,ZHANG S,HUANG T,et al.Speech Emotion Re-cognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching[J].IEEE Transactions on Multimedia,2017,20(6):1576-1590. [2]RICHARDSON F,REYNOLDS D,DEHAK N.A Unified Deep Neural Network for Speaker and Language Recognition[J].ar-Xiv:1504.00923. [3]KANAGASUNDARAM A,DEAN D,SRIDHARAN S,et al.DNN based Speaker Recognition on Short Utterances[J].arXiv:1610.03190. [4]LEE J,LEE M,CHANG J H.Ensemble of Jointly Trained Deep Neural Network-Based Acoustic Models for Reverberant Speech Recognition[J].arXiv:1608.04983. [5]TANG Z,LI L,WANG D.Multi-task Recurrent Model for Speech and Speaker Recognition[C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).IEEE,2016. [6]CHU W,CHEN R.Speaker Cluster-Based Speaker AdaptiveTraining for Deep Neural Network Acoustic Modeling[C]//ICASSP 2016.IEEE,2016. [7]GHAHABI O,HERNANDO J.Deep Learning for Single and Multi-Session i-Vector Speaker Recognition[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2017,25(4). [8]JIN Q,CHEN S Z,LI X R,et al.Speech emotion recognition based on acoustic characteristics [J].Computer Science,2015,42(9):24-28. [9]WANG W,YANG L P,WEI L,et al.Extraction and Analysis of Speech Emotion Characteristics[J].Research and Exploration in Laboratory,2013,32(7):91-94,191. [10]YANG M H,TAO J H,LI H,et al.Nature Multimodal Human-Computer-Interaction Dialog System[J].Computer Science,2014,41(10):12-18,35. [11]RAMANARAYANAN V,PUGH R,QIAN Y,et al.Automatic Turn-Level Language Identification for Code-Switched Spanish-English Dialog[C]//9th International Workshop on Spoken Dialogue System Technology.2019:51-61. [12]DELLAERT F,POLZIN T,WAIBEL A.Recognizing emotion in speech[C]//International Conference on Spoken Language.1996. [13]AHMAD J,FIAZ M,KWON S I,et al.Gender Identification using MFCC for Telephone Applications- A Comparative Study[C]//International Journal of Computer Science and Electronics Engineering 3.5.2015:351-355. [14]BANDELA S R,KUMAR T K.Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC[C]//International Conference on Computing.IEEE Computer Society,2017. [15]ZHAO W,GAO Y,SINGH R,et al.Speaker identification from the sound of the human breath[J].arXiv:1712.00171v2. [16]DENG L.A tutorial survey of architectures,algorithms,and applications for deep learning[J].Apsipa Transactions on Signal &Information Processing,2014,3. [17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014. [18]AWNI H,CASE C,CASPER J,et al.Deep Speech:Scaling up end-to-end speech recognition[J].arXiv:1412.5567. [19]AMODEI D,ANUBHAI R,BATTENBERG E,et al.DeepSpeech 2:End-to-End Speech Recognition in English and Mandarin[J].arXiv:1712.00171. [20]SATT A,ROZENBERG S,HOORY R.Efficient emotion recognition from speech using deep learning on spectrograms[C]//Proc.Interspeech 2017.2017:1089-1093. [21]EYBEN F,SCHERER K R,TRUONG K P,et al.The GenevaMinimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing[J].IEEE Transactions on Affective Computing,2016,7(2):190-202. [22]MULIMANI M,KOOLAGUDI S.Robust Acoustic Event Classification using Bag-of-Visual-Words[C]//Proc.Interspeech.2018:3319-3322. [23]LI L,WANG D,ZHENG T F.System Combination for Short Utterance Speaker Recognition[C]//Signal & Information Processing Association Summit & Conference.IEEE,2016. [24]ZHANG M,CHEN Y,LI L,et al.Speaker Recognition withCough,Laugh and “Wei”[J].arXiv:1706.07860. [25]LI L,WANG D,ZHANG Z,et al.Deep Speaker Vectors for Semi Text-independent Speaker Verification[J].arXiv:1505.06427. [26]LU L.Sequence Training and Adaptation of Highway DeepNeural Networks [C]//2016 IEEE Spoken Language Technology Workshop (SLT).2016. [27]HAN K,YU D,TASHEV I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth Annual Conference of the International Speech Communication Association.2014:223-227. [28]MAO Q,MING D,HUANG Z,et al.Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks[J].IEEE Transactions on Multimedia,2014,16(8):2203-2213. [29]SARMA M,GHAHREMANI P,POVEY D.Emotion Identification from raw speech signals using DNNs[C]//Interspeech.2018:3097-3101. [30]PALAZ D,COLLOBERT R,et al.Analysis of cnn-based speech recognition system using raw speech as input[C]//Proceedings of Interspeech.2015:11-15. [31]SAINATH T,PARADA C.Convolutional neural networks for small-footprint keyword spotting[C]//Proceedings of Interspeech.2015:1478-1482. [32]CHEN L,LEE C M.Predicting Audience's Laughter Using Convolutional Neural Network [J].arXiv:1702.02584. [33]CHAN W,LANE I.Deep convolutional neural networks for acoustic modeling in low resource languages[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Proces-sing.2015:2056-2060. [34]HUANG Y L,LUO X X,LIU D R.Local Finite Weight Sharing of MFSC Coefficients Based CNN Speech Recognition[J].Control Engineering of China,2017,24(7):1507-1513. [35]ALDENEH Z,PROVOST E M.Using regional saliency forspeech emotion recognition[C]//IEEE International Conference on Acoustics.IEEE,2017. [36]KHORRAM S,JAISWAL M,GIDEON J,et al.The PRIORI Emotion Dataset:Linking Mood to Emotion Detected In-the-Wild[C]//Interspeech 2018.2018:1903-1907. [37]HUANG C W,NARAYANAN S.Shaking acoustic spectralsub-bands can better regularize learning in affective computing[C]//ICASSP 2018- 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2018. [38]ZHENG W Q,YU J S,ZOU Y X.An experimental study ofspeech emotion recognition based on deep convolutional neural networks[C]//2015 International Conference on Affective Computing and Intelligent Interaction (ACII).IEEE Computer So-ciety,2015. [39]NIU Y,ZOU D,NIU Y,et al.A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks[J].arXiv:1707.09917. [40]SWIETOJANSKI P,RENALS S.Differentiable Pooling for Unsupervised Acoustic Model Adaptation[J].IEEE/ACMTran-sactions on Audio,Speech,and Language Processing,2016,24(10):1773-1784. [41]WANG D,ZHENG T F.Fransfer learning for speech and language processing[C]//Proceedings of APSIPA Annual Summit and Conference.APSIPA,2015. [42]HUANG J T,LI J,YU D,et al.Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2013:7304-7308. [43]ZHONG G,LIN X,CHEN K.Long Short-Term Attention[J].arXiv:1810.12752. [44]GUPTA V,KENNY P,OUELLET P,et al.I-vector-basedspeaker adaptation of deep neural networks for french broadcast audio transcription[C]//Proc of IEEE International Conference on Acoustics,Speech and Signal Processing.2014:6334-6338. [45]GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional lstm networks[C]//International Joint Conference on Neural Networks.2005. [46]BERINGER N,GRAVES A,SCHIEL F,et al.Classifying Unprompted Speech by Retraining LSTM Nets[J].Lecture Notes in Computer Science,2005,58(1956):575-581. [47]LI J,MOHAMED A,ZWEIG G,et al.Exploring multidimen-sional lstms for large vocabulary ASR[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).2016:4940-4944. [48]LI B,SAINATH T N,NARAYANAN A,et al.Acoustic mode-ling for Google home[C]//Proc.of INTERSPEECH.2017:399-403. [49]LEE J,TASHEV I.High-level feature representation using recurrent neural network for speech emotion recognition[C]//Interspeech.2015. [50]HAN W J,RUAN H B,CHEN X M.Towards Temporal Mo-delling of Categorical Speech Emotion Recognition[J].arXiv:10.21437/Interspeech,2018. [51]TRIGEORGIS G,RINGEVAL F,BRÜCKNER R,et al.Adieufeatures? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]//IEEE International Conference on Acoustics.IEEE,2016. [52]TANG D,ZENG J,LI M.An End-to-End Deep Learning Framework for Speech Emotion Recognition of Atypical Individuals[C]//Proc. Interspeech.2018:162-166. [53]PEKHOVSKY T,KORENEVSKY M.Investigation of UsingVAE for i-Vector Speaker Verification[J].arXiv:1705.09185. [54]KAMPER H,JANSEN A,GOLDWATER S.A segmentalframework for fully-unsupervised large-vocabulary speech re-cognition[J].Computer Speech & Language,2017,46:154-174. [55]CHUNG Y A,GLASS J.Speech2vec:A sequence-to-sequenceframework for learning word embeddings from speech[C]//INTERSPEECH.2018:811-815. [56]LATIF S,RANA R,QADIR J.Variational AutoencodersforLearning Latent Representations of Speech Emotion:A Preliminary Study[C]//Interspeech 2018.2018:3107-3111. [57]ZONG Z F,LI H,WANG Q.Multi-Channel Auto-Encoder for Speech Emotion Recognition[J].arXiv:1810.10662v1. [58]LATIF S,RANA R,YOUNIS S,et al.Transfer Learning for Improving Speech Emotion Classification Accuracy[C]//INTERSPEECH.2018:257-261. [59]LI C,MA X,JIANG B,et al.Deep Speaker:an End-to-End Neural Speaker Embedding System[J].arXiv:1705.02304. [60]DUMPALA S H,PANDA A,KOPPARAPU S K.ImprovedI-vector-based Speaker Recognition for Utterances with Speaker Generated Non-speech sounds[J].arXiv:1705.09289. [61]YI L,LIANG H,YAO T,et al.Comparison of Multiple Features and Modeling Methods for Text-dependent Speaker Verification[C]//2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).IEEE,2017. [62]LI J H,YANG J A,WANG Y.New Feature Extraction Method Based on Bottleneck Deep Belief Networks and its Application in Language Recognition[J].Computer Science,2014,41(3):263-266. [63]BHARGAVA M,ROSE R.Architectures for deep neural net-work based acoustic models defined directly over windowed speech waveforms[C]//INTERSPEECH.2015:6-10. [64]LI S,XU L T.Research on Emotion Recognition AlgorithmBased on Spectrogram Feature Extraction of Bottleneck Feature[J].Computer Technology and Development,2017,27(5):82-86. [65]SNYDER D,GARCIA-ROMERO D,POVEY D.Deep neuralnetwork embeddings for text-independent speaker verification[J].arXiv:10.21437/Interspeech.2017. [66]KEREN G,SCHULLER,BJÖRN.Convolutional RNN:an Enhanced Model for Extracting Features from Sequential Data[C]//2016 International Joint Conference on Neural Networks (IJCNN).2016. [67]MA X,WU Z,JIA J,et al.Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition[C]//ICASSP-2017.2016. [68]Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms[C]//Interspeech 2018.2018:3683-3687. [69]LUO D Q,ZOU Y X,HUANG D Y.Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition[C]//2018 Conference of the International Speech Communication Association(INTERSPEECH 2018).2018:152-156. [70]MINGYI C,XUANJI H,JING Y,et al.3-D Convolutional Recurrent Neural Networks with Attention Model for SpeechEmotion Recognition[J].IEEE Signal Processing Letters,2018:1. [71]SAKR M,ANDRIENKO G,BEHR T,et al.Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems[C]//Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.2011:505-508. [72]NICHOLAS C,SHAHIN A,SANDRA O.Multimodal Bag-of-Words for Cross Domains Sentiment Analysis[C]//IEEE International Conference on Acoustics,Speech,and Signal Processing.IEEE,2018. [73]LI L,TANG Z,DONG W,et al.Collaborative Learning for Language and Speaker Recognition[C]//ICASSP 2017.2017. [74]LI Y,WEI Z H,XU K.Hybrid Feature Selection Method ofChinese Emotional Characteristics Based on Lasso Algorithm[J].Computer Science,2018,45(1):39-46. |
[1] | XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141. |
[2] | GUO Xing-chen, YU Yi-biao. Robust Speaker Verification with Spoofing Attack Detection [J]. Computer Science, 2022, 49(6A): 531-536. |
[3] | JIANG Zong-li, FAN Ke, ZHANG Jin-li. Generative Adversarial Network and Meta-path Based Heterogeneous Network Representation Learning [J]. Computer Science, 2022, 49(1): 133-139. |
[4] | HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi. End-to-End Speaker Recognition Based on Frame-level Features [J]. Computer Science, 2020, 47(10): 169-173. |
[5] | YAO Zhe-wei, YANG Feng, HUANG Jing, LIU Ya-qin. Improved CycleGANs for Intravascular Ultrasound Image Enhancement [J]. Computer Science, 2019, 46(5): 221-227. |
[6] | JIN Qin, CHEN Shi-zhe, LI Xi-rong, YANG Gang and XU Jie-ping. Speech Emotion Recognition Based on Acoustic Features [J]. Computer Science, 2015, 42(9): 24-28. |
[7] | JIANG Hai-hua and HU Bin. Speech Emotion Recognition in Mandarin Based on PCA and SVM [J]. Computer Science, 2015, 42(11): 270-273. |
[8] | . TEo-CrCC Characteristic Parameter Extraction Method for Speaker Recognition in Noisy Environments [J]. Computer Science, 2012, 39(12): 198-203. |