计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 270-274.doi: 10.11896/jsjkx.210400041

• 图像处理& 多媒体技术 • 上一篇    下一篇

基于LSTM神经网络的声纹识别

刘晓璇, 季怡, 刘纯平   

  1. 苏州大学计算机科学与技术学院 江苏 苏州215006
  • 出版日期:2021-11-10 发布日期:2021-11-12
  • 通讯作者: 季怡(jiyi@suda.edu.cn)
  • 作者简介:liuxiaoxuan_n@163.com
  • 基金资助:
    秦惠䇹与李政道中国大学生见习进修基金;国家自然科学基金面上项目(61773272);江苏省高等学校自然科学研究重大项目(19KJA230001)

Voiceprint Recognition Based on LSTM Neural Network

LIU Xiao-xuan, JI Yi, LIU Chun-ping   

  1. School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:LIU Xiao-xuan,born in 1999,undergraduate.Her main research interests include machine learning and pattern recognition.
    JI Yi,born in 1973,Ph.D,associate professor,is a member of China Computer Federation.Her main research interests include pattern recognition and computervision.
  • Supported by:
    Hui-Chun Chin and Tsung-Dao Lee Chinese Undergraduate Research Endowment(CURE),National Natural Science Foundation of China(61773272) and Natural Science Foundation of the Jiangsu Higher Education Institutions of China(19KJA230001).

摘要: 声纹识别利用说话人生物特征的个体差异性,通过声音来识别说话人的身份。声纹具有非接触、易采集、特征稳定等特点,应用领域十分广泛。现有的统计模型方法具有提取特征单一、泛化能力不强等局限性。近年来,随着人工智能深度学习的快速发展,神经网络模型在声纹识别领域崭露头角。文中提出基于长短时记忆(Long Short-Term Memory,LSTM)神经网络的声纹识别方法,使用语谱图提取声纹特征作为模型输入,从而实现文本无关的声纹识别。语谱图能够综合表征语音信号在时间方向上的频率和能量信息,表达的声纹特征更加丰富。LSTM神经网络擅长捕捉时序特征,着重考虑了时间维度上的信息,相比其他神经网络模型,更契合语音数据的特点。文中将LSTM神经网络长期学习的优势与声纹语谱图的时序特征有效结合,实验结果表明,在THCHS-30语音数据集上取得了84.31%的识别正确率。在自然环境下,对于3 s的短语音,该方法的识别正确率达96.67%,与现有的高斯混合模型和卷积神经网络方法相比,所提方法的识别性能更优。

关键词: 长短时记忆, 深度学习, 神经网络, 声纹识别, 语谱图

Abstract: Voiceprint recognition determines the identification of the given speaker by voice,using the individual differences of biological characteristics.It has a wide range of use,with the characteristics of non-contact,simple acquisition,feature stability and so on.The existing statistical methods of voiceprint recognition have the limitations of single-source extracted feature and weak generalization ability.In recent years,with the rapid development of artificial intelligence and deep learning,neural networks are emerging in the field of voiceprint recognition.In this paper,a method based on Long Short-Term Memory (Long Short-Term Memory,LSTM) neural network was proposed to realize text-independent voiceprint recognition,using spectrograms to extract voiceprint features as the model input.Spectrograms can represent the frequency and energy information of voice signal in time direction comprehensively,and express more abundant voiceprint features.LSTM neural network is good at capturing temporal features,focusing on the information in time dimension,which is more consistent with the characteristics of voice data compared with other neural network models.The method in this paper combined the long-term learning of LSTM neural network with the sequential feature of voiceprint spectrograms effectively.The experimental results show that 84.31% accuracy is achieved on THCHS-30 voice data set.For three seconds short voice in natural environment,the accuracy of this method is 96.67%,which is better than the existing methods such as Gaussian Mixture Model and Convolutional Neural Network.

Key words: Deep learning, Long Short-Term Memory, Neural network, Spectrogram, Voiceprint recognition

中图分类号: 

  • TP391.4
[1]REYNOLDS D A.An overview of automatic speaker recognition technology[C]//IEEE International Conference on Acoustics.IEEE,2011.
[2]FURUI S.Recent advances in speaker recognition[J].Pattern Recognition Letters,1997,18(9):859-872.
[3]ATAL B S.Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification[J].The Journal of the Acoustical Society of America,1974,55(6):1304-1312.
[4]VERGIN R,O'SHAUGHNESSY D,FARHAT A.Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition[J].IEEE Transactions on Speech and Audio Processing,1999,7(5):525-532.
[5]RABINER L R.A tutorial on hidden Markov models and selected applications in speech recognition[J].Proceedings of the IEEE,1989,77(2):257-286.
[6]REYNOLDS D A,ROSE R C.Robust text-independent speaker identification using Gaussian mixture speaker models[J].IEEE Transactions on Speech and Audio Processing,1995,3(1):72-83.
[7]REYNOLDS D A,QUATIERI T F,DUNN R B.Speaker verification using adapted Gaussian mixture models[J].Digital Signal Processing,2000,10(1/2/3):19-41.
[8]CHEN C,QI F.Review on Development of Convolutional Neural Network and Its Application in Computer Vision[J].Computer Science,2019,46(3):63-73.
[9]GRAVES A,MOHAMED A,HINTON G.Speech recognitionwith deep recurrent neural networks[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:6645-6649.
[10]LOPEZ M I,GONZALEZ D J,PLCHOT O,et al.Automatic language identification using deep neural networks[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2014:5337-5341.
[11]ZHENG C J,WANG C L,JIA N.Survey of Acoustic FeatureExtraction in Speech Tasks[J].Computer Science,2020,47(5):110-119.
[12]ROSENBERG A E,SOONG F K.Evaluation of a vector quanti-
zation talker recognition system in text independent and text dependent modes[J].Computer Speech & Language,1987,2(3/4):143-157.
[13]FURUI S.Cepstral analysis technique for automatic speakerverification[J].IEEE Transactions on Acoustics,Speech and Signal Processing,1981,29(2):254-272.
[14]XIANG B,BERGER T.Efficient text-independent speaker verification with structural Gaussian mixture models and neural network[J].IEEE Transactions on Speech and Audio Processing,2003,11(5):447-456.
[15]LUCK J E.Automatic speaker verification using cepstral mea-surements[J].The Journal of the Acoustical Society of America,1969,46(4B):1026-1032.
[16]RICHARDSON F,REYNOLDS D,DEHAK N.Deep neuralnetwork approaches to speaker and language recognition[J].IEEE Signal Processing Letters,2015,22(10):1671-1675.
[17]HUANG J T,LI J,GONG Y.An analysis of convolutional neural networks for speech recognition[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2015:4989-4993.
[18]HEIGOLD G,MORENO I,BENGIO S,et al.End-to-end text-dependent speaker verification[C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2016:5115-5119.
[19]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[20]WU L Q,ZHANG D,LI S S,et al.Multi-modal Emotion Recognition Approach Based on Multi-task Learning[J].Computer Science,2019,46(11):284-290.
[21]HUA M,LI D D,WANG Z,et al.End-to-End Speaker Recognition Based on Frame-level Features[J].Computer Science,2020,47(10):169-173.
[22]WANG D,ZHANG X.Thchs-30:A free chinese speech corpus[J].arXiv:1512.01882,2015.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[4] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[5] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[6] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[7] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[8] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[9] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[10] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[11] 王润安, 邹兆年.
基于物理操作级模型的查询执行时间预测方法
Query Performance Prediction Based on Physical Operation-level Models
计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074
[12] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[13] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[14] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[15] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!