计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 214-221.doi: 10.11896/jsjkx.240100222

• 计算机图形学&多媒体 • 上一篇    下一篇

基于子频带前端模型和反向特征融合的说话人确认方法

王萌威, 杨哲   

  1. 苏州大学计算机科学与技术学院 江苏 苏州 215006
  • 收稿日期:2024-01-31 修回日期:2024-06-12 出版日期:2025-03-15 发布日期:2025-03-07
  • 通讯作者: 杨哲(yangzhe@suda.edu.cn)
  • 作者简介:(mwwang@stu.suda.edu.cn)
  • 基金资助:
    教育部产学合作协同育人项目(220606363154256)

Speaker Verification Method Based on Sub-band Front-end Model and Inverse Feature Fusion

WANG Mengwei, YANG Zhe   

  1. School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2024-01-31 Revised:2024-06-12 Online:2025-03-15 Published:2025-03-07
  • About author:WANG Mengwei,born in 1998,postgraduate.His main research interests include speaker recognition and audio classification.
    YANG Zhe,born in 1978,Ph.D,asso-ciate professor.His main research in-terests include artificial intelligence,machine learning and big data.
  • Supported by:
    Ministry of Education University-Industry Collaborative Education Program(220606363154256).

摘要: 现有说话人确认方法中用于提取帧级特征的时延神经网络(TDNN)存在两个问题,一是缺少对局部频率特征的建模能力,二是多层特征融合方式无法对高层和低层特征之间的复杂关系进行有效建模。因此,提出一种新的前端模型以及一种新的多层特征融合方式。在前端模型中,通过将输入特征图划分为多个子频带,并逐层扩大子频带的频率范围,使TDNN可以渐进地对局部频率特征进行建模。同时,在主干模型中新增一条由高层向低层传递的反向路径,对相邻两层输出特征之间的关系进行建模,并将反向路径中每层的输出拼接后作为融合后的特征。此外,在主干模型中使用逆瓶颈层的设计,进一步提升模型的性能。在VoxCeleb1测试集上的实验结果表明,所提方法与目前的TDNN方法相比,等错误率和最小代价检测函数分别降低了9%和14%,而参数量仅为目前方法的52%。

关键词: 声纹识别, 说话人确认, 时延神经网络, 子频带特征提取, 多层特征融合

Abstract: Two problems with time delay neural networks(TDNN) used to extract frame-level features in existing speaker confirmation methods are the lack of the ability to model local frequency features and the inability of the multilayer feature fusion approach to effectively model the complex relationships between high-level and low-level features.Therefore,a new front-end model as well as a new multilayer feature fusion approach are proposed.In the front-end model,by dividing the input feature map into multiple sub-bands and expanding the frequency range of the sub-bands layer by layer,the TDNN can model the local frequency features progressively.Meanwhile,a new inverse path passing from higher to lower layers is added to the backbone model to model the relationship between the output features of two adjacent layers,and the outputs of each layer in the inverse path are concatenated to serve as the fused features.In addition,the design of the inverse bottleneck layer is used in the backbone model to further improve the performance of the model.Experimental results on the VoxCeleb1 test set show that the proposed method has a relative reduction of 9% in the equal error rate and 14% in the minimum cost detection function,compared to the current TDNN method,while the number of parameters is only 52% of the current method.

Key words: Speaker recognition, Speaker verification, Time delay neural network, Sub-band feature extraction, Multilayer feature fusion

中图分类号: 

  • TP183
[1]SHOME N,SARKAR A,GHOSH A K,et al.Speaker Recognition through Deep Learning Techniques:A Comprehensive Review and Research Challenges[J].Periodica Polytechnica Electrical Engineering and Computer Science,2023,67(3):300-336.
[2]BAI Z,ZHANG X L.Speaker recognition based on deep lear-ning:An overview[J].Neural Networks,2021,140:65-99.
[3]WAN Z K,REN Q H,QIN Y C,et al.Statistical pyramid dense time delay neural network for speaker verification[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:7532-7536.
[4]BENHAFID Z,SELOUANI S A,AMROUCHE A,et al.Attention-based factorized TDNN for a noise-robust and spoof-aware speaker verification system[J].International Journal of Speech Technology,2023,26(4):881-894.
[5]DESPLANQUES B,THIENPONDT J,DEMUYNCK K.ECAPA-TDNN:Emphasized Channel Attention,Propagation and Aggregation in TDNN Based Speaker Verification[C]//Proceedings Interspeech.2020:3830-3834.
[6]ZHANG X,LIU Q,GUO Q,et al.EIPFD-ResNet:Emphasized Information Propagation and Feature Distribution in ResNet Based Speaker Verification[J].Journal of Chinese Computer Systems.2023,44(3):463-470.
[7]KYNYCH F,ZDANSKY J,CERVA P,et al.Online Speaker Diarization Using Optimized SE-ResNet Architecture[C]//International Conference on Text,Speech,and Dialogue.Cham:Springer Nature Switzerland,2023:176-187.
[8]CHUNG J S,HUH J,MUN S,et al.In Defence of MetricLearning for Speaker Recognition[C]//Proceedings Interspeech.2020:2977-2981.
[9]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2014:4052-4056.
[10]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust DNN embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2018:5329-5333.
[11]SNYDER D,GARCIA-ROMERO D,POVEY D,et al.Deep neural network embeddings for text-independent speaker verification[C]//Proceedings Interspeech.2017:999-1003.
[12]GAO Z,SONG Y,MCLOUGHLIN I,et al.Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System[C]//Proceedings Interspeech.2019:361-365.
[13]TANG Y,DING G,HUANG J,et al.Deep speaker embedding learning with multi-level pooling for text-independent speaker verification[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2019:6116-6120.
[14]DEHAK N,KENNY P J,DEHAK R,et al.Front-end factoranalysis for speaker verification[J].IEEE Transactions on Au-dio,Speech,and Language Processing,2010,19(4):788-798.
[15]CHOWDHURY F A R R,WANG Q,MORENO I L,et al.Attention-based models for text-dependent speaker verification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2018:5359-5363.
[16]WANG Z,YAO K,LI X,et al.Multi-resolution multi-head attention in deep speaker embedding[C]//2020 IEEE Interna-tional Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2020:6464-6468.
[17]ZHANG Y,LV Z,WU H,et al.MFA-Conformer:Multi-scaleFeature Aggregation Conformer for Automatic Speaker Verification[C]//Proceedings Interspeech.2022:306-310.
[18]LI C,MA X,JIANG B,et al.Deep speaker:an end-to-end neural speaker embedding system[J].arXiv:1705.02304,2017.
[19]GU B,GUO W,DAI L,et al.An improved deep neural network for modeling speaker characteristics at different temporal scales[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2020:6814-6818.
[20]THIENPONDT J,DESPLANQUES B,DEMUYNCK K.In-tegrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification[C]//Proceedings Interspeech.2021:2302-2306.
[21]LIU T,DAS R K,LEE K A,et al.MFA:TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2022:7517-7521.
[22]ZHAO Z,LI Z,WANG W,et al.PCF:ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2023:1-5.
[23]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4510-4520.
[24]LIU Z,MAO H,WU C Y,et al.A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:11976-11986.
[25]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2117-2125.
[26]JUNG Y,KYE S.M,CHOI Y,et al.Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances[C]//Proceedings Interspeech.2020:1501-1505.
[27]SCHROFF F,KALENICHENKO D,PHILBIN J.Facenet:Aunified embedding for face recognition and clustering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:815-823.
[28]LIU W,WEN Y,YU Z,et al.Large-margin softmax loss for convolutional neural networks[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48.2016:507-516.
[29]LIU W,WEN Y,YU Z,et al.Sphereface:Deep hypersphere embedding for face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:212-220.
[30]WANG F,CHENG J,LIU W,et al.Additive margin softmax for face verification[J].IEEE Signal Processing Letters,2018,25(7):926-930.
[31]DENG J,GUO J,XUE N,et al.Arcface:Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4690-4699.
[32]NAGRANI A,CHUNG J S,XIE W,et al.Voxceleb:Large-scale speaker verification in the wild[J].Computer Speech & Language,2020,60:101027.
[33]CHUNG J S,NAGRANI A,ZISSERMAN A.VoxCeleb2:Deep Speaker Recognition[C]//Proceedings Interspeech.2018:1086-1090.
[34]SNYDER D,CHEN G,POVEY D.MUSAN:A Music,Speech,and Noise Corpus[J].arXiv:1510.08484,2015.
[35]KO T,PEDDINTI V,POVEY D,et al.A study on data augmentation of reverberant speech for robust speech recognition[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2017:5220-5224.
[36]PARK D S,CHAN W,ZHANG Y,et al.SpecAugment:A Simple Data Augmentation Method for Automatic Speech Recognition[C]//Interspeech.2019:2613-2617.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!