基于帧级特征的端到端说话人识别

doi:10.11896/jsjkx.190800054

计算机科学 ›› 2020, Vol. 47 ›› Issue (10): 169-173.doi: 10.11896/jsjkx.190800054

• 计算机图形学&多媒体 • 上一篇下一篇

基于帧级特征的端到端说话人识别

花明, 李冬冬, 王喆, 高大启

华东理工大学信息科学与工程学院上海200237

收稿日期:2019-08-13 修回日期:2019-11-28 出版日期:2020-10-15 发布日期:2020-10-16
通讯作者: 李冬冬(ldd@ecust.edu.cn)
作者简介:961564330@qq.com
基金资助:
国家自然科学基金项目(61806078);国家重大新药开发科技专项(2019ZX0921004);上海市教育发展基金会和上海市教育委员会“曙光计划”(61725301)

End-to-End Speaker Recognition Based on Frame-level Features

HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi

School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China

Received:2019-08-13 Revised:2019-11-28 Online:2020-10-15 Published:2020-10-16
About author:HUA Ming,born in 1995,postgraduate,is a member of China Computer Federation.His main research interests include speaker recognition and deep learning.
LI Dong-dong,born in 1981,Ph.D,associate professor.Her main research interests include speech processing and affective computing.
Supported by:
Natural Science Foundation of China (61806078), National Major Scientific and Technological Special Project for “Significant New Drugs Development”(2019ZX09201004) and “Shuguang Program” supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission (61725301)

摘要/Abstract

摘要： 现有的说话人识别方法仍存在许多不足。基于话语级特征输入的端到端方法由于语音长短不一致需要将输入处理为同等大小,而特征训练加后验分类的两阶段方法使得识别系统过于复杂,这些因素都会影响模型的性能。文中提出了基于帧级特征的端到端说话人识别方法。模型采用帧级语音作为输入,同等大小的帧级特征有效解决了话语级语音输入长度不一致的问题,且帧级特征可保留更多的话者信息。与如今主流的两阶段法识别系统相比,端到端的识别方法将特征训练和分类打分一体化,简化了模型的复杂性。在训练阶段,每段语音被分帧成多个帧级语音输入到卷积神经网络(Convolutional Neural Networks,CNN)用于训练模型。在评估阶段,训练好的CNN模型对帧级语音进行分类,每段语音基于多个帧的预测得分计算该条语音数据的预测类别。每段语音的类别通过取各帧最多预测类别和各帧预测值平均的方法来计算。为了验证方法的有效性,使用普通话情感语音语料库(MASC)的语音数据进行训练和测试。实验结果表明,与现有方法相比,基于帧级特征的端到端识别方法的性能表现更佳。

关键词: 端到端, 话语级语音, 卷积神经网络, 说话人识别, 帧级特征

Abstract: There are still many shortcomings in the existing speaker recognition methods.The end-to-end method based on utte-rance-level features requires to process the input to be the same size due to the inconsistency of the speech length.The two-stage method of feature training with posterior classification makes the recognition system too complex.These factors affect the performance of the model.This paper proposed an end-to-end speaker recognition method based on frame-level features.The model uses frame-level speech as input,and the same size frame-level features effectively solve the problem of inconsistent speech-level speech input length,and the frame-level features can retain more speaker information.Compared with the mainstream two-stage identification system,the end-to-end identification method integrates feature training and classification,which simplifies the complexity of the model.During the training phase,each speech is segmented into multiple frame-level speech inputs to a Convolutional Neural Network (CNN) for training the model.In the evaluation phase,the trained CNN model classifies the frame-level speech,and each segment of speech calculates the prediction category of the speech data based on the prediction scores of multiple frames.The maximum predicted category of each frame and the average prediction value of each frame are adopted to calculate the class of each segment of speech respectively.In order to verify the validity of the work,the speech data of the Mandarin Emotio-nal Speech Corpus (MASC) were used for training and testing.The experimental results show that the end-to-end recognition method based on frame-level features achieves better performance than the existing methods.

Key words: Convolutional Neural Networks, End-to-end, Frame-level features, Speaker recognition, Utterance-level speech

中图分类号:

TP301

花明, 李冬冬, 王喆, 高大启. 基于帧级特征的端到端说话人识别[J]. 计算机科学, 2020, 47(10): 169-173. https://doi.org/10.11896/jsjkx.190800054

HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi. End-to-End Speaker Recognition Based on Frame-level Features[J]. Computer Science, 2020, 47(10): 169-173. https://doi.org/10.11896/jsjkx.190800054

参考文献

[1]HANSEN J H L,HASAN T.Speaker Recognition by Machines and Humans:A tutorial review [J].IEEE Signal Processing Magazine,2015,32(6):74-99.
[2]REYNOLDS D A.An overview of automatic speaker recognition technology [C]//IEEE International Conference on Acoustics.IEEE,2011.
[3]VERGIN R,O’SHAUGHNESSY D,FARHAT A.Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition [J].IEEE Transactions on Speech and Audio Processing,1999,7(5):525-532.
[4]REYNOLDS D A,ROSE R C.Robust text-independent speaker identification using Gaussian mixture speaker models [J].IEEE Transactions on Speech and Audio Processing,1995,3(1):72-83.
[5]REYNOLDS D A,QUATIERI T F,DUNN R B.Speaker Verification Using Adapted Gaussian Mixture Models [J].Digital Signal Processing,2000,10(1/2/3):19-41.
[6]MACHLICA,LUKÁŠ,ZAJÍC,et al.An Efficient Implementation of Probabilistic Linear Discriminant Analysis [C]//IEEE International Conference on Acoustics.IEEE,2013.
[7]DEHAK N,KENNY P J,DEHAK R,et al.Front-End FactorAnalysis for Speaker Verification [J].IEEE Transactions on Audio,Speech and Language Processing,2011,19(4):788-798.
[8]WANG H L,QI X L,WU G S.Research Progress of Object Detection Technology Baseon Convolutional Neural Network in Deep Learning[J].Computer Science,2018,45(9):11-19.
[9]ZHU J Y,PARK T,ISOLA P,et al.Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks [C]//International Conference on Computer Vision(ICCV).2017:2242-2251.
[10]LIU J,JIN Z Q.Facial Expression Transfer Method Based on Deep Learning[J].Computer Science,2019,46(S1):250-253.
[11]JI R F,CAI X Y,BO X.An End-to-End Text-IndependentSpeaker Identification System on ShortUtterances.[C]//Annual Conference of the International Speech Communication Association(INTERSPEECH).2018:3628-3632.
[12]LUKIC Y,VOGT C,DURR O,et al.Speaker identification and clustering using convolutional neural networks[C]//2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).IEEE,2016.
[13]LI N ,TUO D Y,SU D,et al.Deep Discriminative Embeddings for Duration Robust Speaker Verification[C]//Conference of the International Speech Communication Association.2018.
[14]TORFI A,DAWSON J,NASRABADI N M.Text-Independent Speaker Verification Using 3D Convolutional Neural Networks[C]//IEEE International Conference on Multimedia and Expo.2018:1-6.
[15]NAGRANI A,CHUNG J S,ZISSERMAN A.VoxCeleb:alarge-scale speaker identification dataset[C]//Conference of the
International Speech Communication Association.2017.
[16]HRŬZ M,ZAJÍC Z.Convolutional Neural Network for speaker change detection in telephone speaker diarization system[C]//IEEE International Conference on Acoustics.IEEE,2017.
[17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014.
[18]YU-HSIN C,MORENO I L,TARA N S,et al.Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition [C]//Conference of the International Speech Communication Association.2015.
[19]HINTON G E.Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair [C]//International Conference on International Conference on Machine Learning.Omnipress,2010.
[20]YUAN L,YAN M Q,NAN X C,et al.Deep feature for text-dependent speaker verification[J].Speech Communication,2015,73:1-13.
[21]WU T,YANG Y,WU Z,et al.MASC:A Speech Corpus inMandarin for Emotion Analysis and Affective Speaker Recognition [C]//Speaker & Language Recognition Workshop.IEEE,2006.
[22]YANG Y C,WU Z H,WU T,et al.Mandarin Affective Speech LDC2007S09.[EB/OL].https://catalog.ldc.upenn.edu/LDC2007S09.
[23]IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift [C]//2015 International Conference on Machine Learning.
[24]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:A Simple Way to Prevent Neural Networks from Overfitting [J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[25]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[C]//2015 International Conference on Learning Representations(Poster).2015.
[26]SADJADI S O,SLANEY M,HECK A L.MSR Identity Toolbox v1.0:A MATLAB Toolbox for Speaker Recognition Research[EB/OL].https://www.microsoft.com/en-us/research/publication/msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2.

相关文章 15

[1]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2]	李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[3]	陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[4]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[5]	檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6]	张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[7]	戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[8]	刘月红, 牛少华, 神显豪. 基于卷积神经网络的虚拟现实视频帧内预测编码 Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network 计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179
[9]	徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[10]	金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[11]	孙福权, 崔志清, 邹彭, 张琨. 基于多尺度特征的脑肿瘤分割算法 Brain Tumor Segmentation Algorithm Based on Multi-scale Features 计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217
[12]	吴子斌, 闫巧. 基于动量的映射式梯度下降算法 Projected Gradient Descent Algorithm with Momentum 计算机科学, 2022, 49(6A): 178-183. https://doi.org/10.11896/jsjkx.210500039
[13]	郭星辰, 俞一彪. 具有仿冒攻击检测的鲁棒性说话人识别 Robust Speaker Verification with Spoofing Attack Detection 计算机科学, 2022, 49(6A): 531-536. https://doi.org/10.11896/jsjkx.210500147
[14]	王杉, 徐楚怡, 师春香, 张瑛. 基于CNN-LSTM的卫星云图云分类方法研究 Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM 计算机科学, 2022, 49(6A): 675-679. https://doi.org/10.11896/jsjkx.210300177
[15]	李荪, 曹峰. 智能语音技术端到端框架模型分析和趋势研究 Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology 计算机科学, 2022, 49(6A): 331-336. https://doi.org/10.11896/jsjkx.210500180

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于帧级特征的端到端说话人识别

End-to-End Speaker Recognition Based on Frame-level Features

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0