Computer Science ›› 2020, Vol. 47 ›› Issue (10): 169-173.doi: 10.11896/jsjkx.190800054

• Computer Graphics & Multimedia • Previous Articles     Next Articles

End-to-End Speaker Recognition Based on Frame-level Features

HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi   

  1. School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2019-08-13 Revised:2019-11-28 Online:2020-10-15 Published:2020-10-16
  • About author:HUA Ming,born in 1995,postgraduate,is a member of China Computer Federation.His main research interests include speaker recognition and deep learning.
    LI Dong-dong,born in 1981,Ph.D,associate professor.Her main research interests include speech processing and affective computing.
  • Supported by:
    Natural Science Foundation of China (61806078), National Major Scientific and Technological Special Project for “Significant New Drugs Development”(2019ZX09201004) and “Shuguang Program” supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission (61725301)

Abstract: There are still many shortcomings in the existing speaker recognition methods.The end-to-end method based on utte-rance-level features requires to process the input to be the same size due to the inconsistency of the speech length.The two-stage method of feature training with posterior classification makes the recognition system too complex.These factors affect the performance of the model.This paper proposed an end-to-end speaker recognition method based on frame-level features.The model uses frame-level speech as input,and the same size frame-level features effectively solve the problem of inconsistent speech-level speech input length,and the frame-level features can retain more speaker information.Compared with the mainstream two-stage identification system,the end-to-end identification method integrates feature training and classification,which simplifies the complexity of the model.During the training phase,each speech is segmented into multiple frame-level speech inputs to a Convolutional Neural Network (CNN) for training the model.In the evaluation phase,the trained CNN model classifies the frame-level speech,and each segment of speech calculates the prediction category of the speech data based on the prediction scores of multiple frames.The maximum predicted category of each frame and the average prediction value of each frame are adopted to calculate the class of each segment of speech respectively.In order to verify the validity of the work,the speech data of the Mandarin Emotio-nal Speech Corpus (MASC) were used for training and testing.The experimental results show that the end-to-end recognition method based on frame-level features achieves better performance than the existing methods.

Key words: Convolutional Neural Networks, End-to-end, Frame-level features, Speaker recognition, Utterance-level speech

CLC Number: 

  • TP301
[1]HANSEN J H L,HASAN T.Speaker Recognition by Machines and Humans:A tutorial review [J].IEEE Signal Processing Magazine,2015,32(6):74-99.
[2]REYNOLDS D A.An overview of automatic speaker recognition technology [C]//IEEE International Conference on Acoustics.IEEE,2011.
[3]VERGIN R,O’SHAUGHNESSY D,FARHAT A.Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition [J].IEEE Transactions on Speech and Audio Processing,1999,7(5):525-532.
[4]REYNOLDS D A,ROSE R C.Robust text-independent speaker identification using Gaussian mixture speaker models [J].IEEE Transactions on Speech and Audio Processing,1995,3(1):72-83.
[5]REYNOLDS D A,QUATIERI T F,DUNN R B.Speaker Verification Using Adapted Gaussian Mixture Models [J].Digital Signal Processing,2000,10(1/2/3):19-41.
[6]MACHLICA,LUKÁŠ,ZAJÍC,et al.An Efficient Implementation of Probabilistic Linear Discriminant Analysis [C]//IEEE International Conference on Acoustics.IEEE,2013.
[7]DEHAK N,KENNY P J,DEHAK R,et al.Front-End FactorAnalysis for Speaker Verification [J].IEEE Transactions on Audio,Speech and Language Processing,2011,19(4):788-798.
[8]WANG H L,QI X L,WU G S.Research Progress of Object Detection Technology Baseon Convolutional Neural Network in Deep Learning[J].Computer Science,2018,45(9):11-19.
[9]ZHU J Y,PARK T,ISOLA P,et al.Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks [C]//International Conference on Computer Vision(ICCV).2017:2242-2251.
[10]LIU J,JIN Z Q.Facial Expression Transfer Method Based on Deep Learning[J].Computer Science,2019,46(S1):250-253.
[11]JI R F,CAI X Y,BO X.An End-to-End Text-IndependentSpeaker Identification System on ShortUtterances.[C]//Annual Conference of the International Speech Communication Association(INTERSPEECH).2018:3628-3632.
[12]LUKIC Y,VOGT C,DURR O,et al.Speaker identification and clustering using convolutional neural networks[C]//2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).IEEE,2016.
[13]LI N ,TUO D Y,SU D,et al.Deep Discriminative Embeddings for Duration Robust Speaker Verification[C]//Conference of the International Speech Communication Association.2018.
[14]TORFI A,DAWSON J,NASRABADI N M.Text-Independent Speaker Verification Using 3D Convolutional Neural Networks[C]//IEEE International Conference on Multimedia and Expo.2018:1-6.
[15]NAGRANI A,CHUNG J S,ZISSERMAN A.VoxCeleb:alarge-scale speaker identification dataset[C]//Conference of the
International Speech Communication Association.2017.
[16]HRŬZ M,ZAJÍC Z.Convolutional Neural Network for speaker change detection in telephone speaker diarization system[C]//IEEE International Conference on Acoustics.IEEE,2017.
[17]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014.
[18]YU-HSIN C,MORENO I L,TARA N S,et al.Locally-Connected and Convolutional Neural Networks for Small Footprint Speaker Recognition [C]//Conference of the International Speech Communication Association.2015.
[19]HINTON G E.Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair [C]//International Conference on International Conference on Machine Learning.Omnipress,2010.
[20]YUAN L,YAN M Q,NAN X C,et al.Deep feature for text-dependent speaker verification[J].Speech Communication,2015,73:1-13.
[21]WU T,YANG Y,WU Z,et al.MASC:A Speech Corpus inMandarin for Emotion Analysis and Affective Speaker Recognition [C]//Speaker & Language Recognition Workshop.IEEE,2006.
[22]YANG Y C,WU Z H,WU T,et al.Mandarin Affective Speech LDC2007S09.[EB/OL].https://catalog.ldc.upenn.edu/LDC2007S09.
[23]IOFFE S,SZEGEDY C.Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shift [C]//2015 International Conference on Machine Learning.
[24]SRIVASTAVA N,HINTON G,KRIZHEVSKY A,et al.Dropout:A Simple Way to Prevent Neural Networks from Overfitting [J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[25]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[C]//2015 International Conference on Learning Representations(Poster).2015.
[26]SADJADI S O,SLANEY M,HECK A L.MSR Identity Toolbox v1.0:A MATLAB Toolbox for Speaker Recognition Research[EB/OL].https://www.microsoft.com/en-us/research/publication/msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2.
[1] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[2] WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423.
[3] SUN Jie-qi, LI Ya-feng, ZHANG Wen-bo, LIU Peng-hui. Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation [J]. Computer Science, 2022, 49(6A): 434-440.
[4] GUO Xing-chen, YU Yi-biao. Robust Speaker Verification with Spoofing Attack Detection [J]. Computer Science, 2022, 49(6A): 531-536.
[5] LI Sun, CAO Feng. Analysis and Trend Research of End-to-End Framework Model of Intelligent Speech Technology [J]. Computer Science, 2022, 49(6A): 331-336.
[6] YANG Run-yan, CHENG Gao-feng, LIU Jian. Study on Keyword Search Framework Based on End-to-End Automatic Speech Recognition [J]. Computer Science, 2022, 49(1): 53-58.
[7] CHEN Zhi-yi, SUI Jie. DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection [J]. Computer Science, 2022, 49(1): 101-107.
[8] ZHANG Peng, WANG Xin-qing, XIAO Yi, DUAN Bao-guo, XU Hong-hui. Real-time Binocular Depth Estimation Algorithm Based on Semantic Edge Drive [J]. Computer Science, 2021, 48(9): 216-222.
[9] HE Qing-fang, WANG Hui, CHENG Guang. Research on Classification of Breast Cancer Pathological Tissues with Adaptive Small Data Set [J]. Computer Science, 2021, 48(6A): 67-73.
[10] LIU Dong, WANG Ye-fei, LIN Jian-ping, MA Hai-chuan, YANG Run-yu. Advances in End-to-End Optimized Image Compression Technologies [J]. Computer Science, 2021, 48(3): 1-8.
[11] JIANG Qi, SU Wei, XIE Ying, ZHOUHONG An-ping, ZHANG Jiu-wen, CAI Chuan. End-to-End Chinese-Braille Automatic Conversion Based on Transformer [J]. Computer Science, 2021, 48(11A): 136-141.
[12] GAO Chuang, LI Jian-hua, JI Xiu-yi, ZHU Cheng-long, LI Shi-liang, LI Hong-lin. Drug Target Interaction Prediction Method Based on Graph Convolutional Neural Network [J]. Computer Science, 2021, 48(10): 127-134.
[13] SUN Yan-li, YE Jiong-yao. Convolutional Neural Networks Compression Based on Pruning and Quantization [J]. Computer Science, 2020, 47(8): 261-266.
[14] MA Hai-Jiang. Recommendation Algorithm Based on Convolutional Neural Network and Constrained Probability Matrix Factorization [J]. Computer Science, 2020, 47(6A): 540-545.
[15] ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!