计算机科学 ›› 2020, Vol. 47 ›› Issue (10): 169-173.doi: 10.11896/jsjkx.190800054
花明, 李冬冬, 王喆, 高大启
HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi
摘要: 现有的说话人识别方法仍存在许多不足。基于话语级特征输入的端到端方法由于语音长短不一致需要将输入处理为同等大小,而特征训练加后验分类的两阶段方法使得识别系统过于复杂,这些因素都会影响模型的性能。文中提出了基于帧级特征的端到端说话人识别方法。模型采用帧级语音作为输入,同等大小的帧级特征有效解决了话语级语音输入长度不一致的问题,且帧级特征可保留更多的话者信息。与如今主流的两阶段法识别系统相比,端到端的识别方法将特征训练和分类打分一体化,简化了模型的复杂性。在训练阶段,每段语音被分帧成多个帧级语音输入到卷积神经网络(Convolutional Neural Networks,CNN)用于训练模型。在评估阶段,训练好的CNN模型对帧级语音进行分类,每段语音基于多个帧的预测得分计算该条语音数据的预测类别。每段语音的类别通过取各帧最多预测类别和各帧预测值平均的方法来计算。为了验证方法的有效性,使用普通话情感语音语料库(MASC)的语音数据进行训练和测试。实验结果表明,与现有方法相比,基于帧级特征的端到端识别方法的性能表现更佳。
中图分类号:
[1]HANSEN J H L,HASAN T.Speaker Recognition by Machines and Humans:A tutorial review [J].IEEE Signal Processing Magazine,2015,32(6):74-99. [2]REYNOLDS D A.An overview of automatic speaker recognition technology [C]//IEEE International Conference on Acoustics.IEEE,2011. [3]VERGIN R,O’SHAUGHNESSY D,FARHAT A.Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition [J].IEEE Transactions on Speech and Audio Processing,1999,7(5):525-532. [4]REYNOLDS D A,ROSE R C.Robust text-independent speaker identification using Gaussian mixture speaker models [J].IEEE Transactions on Speech and Audio Processing,1995,3(1):72-83. [5]REYNOLDS D A,QUATIERI T F,DUNN R B.Speaker Verification Using Adapted Gaussian Mixture Models [J].Digital Signal Processing,2000,10(1/2/3):19-41. [6]MACHLICA,LUKÁ,ZAJÍC,et al.An Efficient Implementation of Probabilistic Linear Discriminant Analysis [C]//IEEE International Conference on Acoustics.IEEE,2013. [7]DEHAK N,KENNY P J,DEHAK R,et al.Front-End FactorAnalysis for Speaker Verification [J].IEEE Transactions on Audio,Speech and Language Processing,2011,19(4):788-798. [8]WANG H L,QI X L,WU G S.Research Progress of Object Detection Technology Baseon Convolutional Neural Network in Deep Learning[J].Computer Science,2018,45(9):11-19. [9]ZHU J Y,PARK T,ISOLA P,et al.Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks [C]//International Conference on Computer Vision(ICCV).2017:2242-2251. [10]LIU J,JIN Z Q.Facial Expression Transfer Method Based on Deep Learning[J].Computer Science,2019,46(S1):250-253. [11]JI R F,CAI X Y,BO X.An End-to-End Text-IndependentSpeaker Identification System on ShortUtterances.[C]//Annual Conference of the International Speech Communication Association(INTERSPEECH).2018:3628-3632. [12]LUKIC Y,VOGT C,DURR O,et al.Speaker identification and clustering using convolutional neural networks[C]//2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).IEEE,2016. [13]LI N ,TUO D Y,SU D,et al.Deep Discriminative Embeddings for Duration Robust Speaker Verification[C]//Conference of the International Speech Communication Association.2018. [14]TORFI A,DAWSON J,NASRABADI N M.Text-Independent Speaker Verification Using 3D Convolutional Neural Networks[C]//IEEE International Conference on Multimedia and Expo.2018:1-6. [15]NAGRANI A,CHUNG J S,ZISSERMAN A.VoxCeleb:alarge-scale speaker identification dataset[C]//Conference of the International Speech Communication Association.2017. [16]HRŬZ M,ZAJÍC Z.Convolutional Neural Network for speaker change detection in telephone speaker diarization system[C]//IEEE International Conference on Acoustics.IEEE,2017. [17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014. [18]YU-HSIN C,MORENO I L,TARA N S,et al.Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition [C]//Conference of the International Speech Communication Association.2015. [19]HINTON G E.Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair [C]//International Conference on International Conference on Machine Learning.Omnipress,2010. [20]YUAN L,YAN M Q,NAN X C,et al.Deep feature for text-dependent speaker verification[J].Speech Communication,2015,73:1-13. [21]WU T,YANG Y,WU Z,et al.MASC:A Speech Corpus inMandarin for Emotion Analysis and Affective Speaker Recognition [C]//Speaker & Language Recognition Workshop.IEEE,2006. [22]YANG Y C,WU Z H,WU T,et al.Mandarin Affective Speech LDC2007S09.[EB/OL].https://catalog.ldc.upenn.edu/LDC2007S09. [23]IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift [C]//2015 International Conference on Machine Learning. [24]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:A Simple Way to Prevent Neural Networks from Overfitting [J].Journal of Machine Learning Research,2014,15(1):1929-1958. [25]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[C]//2015 International Conference on Learning Representations(Poster).2015. [26]SADJADI S O,SLANEY M,HECK A L.MSR Identity Toolbox v1.0:A MATLAB Toolbox for Speaker Recognition Research[EB/OL].https://www.microsoft.com/en-us/research/publication/msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2. |
[1] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[2] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[3] | 陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121 |
[4] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[5] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[6] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
[7] | 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮. 基于DNGAN的磁共振图像超分辨率重建算法 Super-resolution Reconstruction of MRI Based on DNGAN 计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105 |
[8] | 刘月红, 牛少华, 神显豪. 基于卷积神经网络的虚拟现实视频帧内预测编码 Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network 计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179 |
[9] | 徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085 |
[10] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
[11] | 孙福权, 崔志清, 邹彭, 张琨. 基于多尺度特征的脑肿瘤分割算法 Brain Tumor Segmentation Algorithm Based on Multi-scale Features 计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217 |
[12] | 吴子斌, 闫巧. 基于动量的映射式梯度下降算法 Projected Gradient Descent Algorithm with Momentum 计算机科学, 2022, 49(6A): 178-183. https://doi.org/10.11896/jsjkx.210500039 |
[13] | 郭星辰, 俞一彪. 具有仿冒攻击检测的鲁棒性说话人识别 Robust Speaker Verification with Spoofing Attack Detection 计算机科学, 2022, 49(6A): 531-536. https://doi.org/10.11896/jsjkx.210500147 |
[14] | 王杉, 徐楚怡, 师春香, 张瑛. 基于CNN-LSTM的卫星云图云分类方法研究 Study on Cloud Classification Method of Satellite Cloud Images Based on CNN-LSTM 计算机科学, 2022, 49(6A): 675-679. https://doi.org/10.11896/jsjkx.210300177 |
[15] | 李荪, 曹峰. 智能语音技术端到端框架模型分析和趋势研究 Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology 计算机科学, 2022, 49(6A): 331-336. https://doi.org/10.11896/jsjkx.210500180 |
|