基于子频带前端模型和反向特征融合的说话人确认方法

doi:10.11896/jsjkx.240100222

Abstract

Abstract: Two problems with time delay neural networks(TDNN) used to extract frame-level features in existing speaker confirmation methods are the lack of the ability to model local frequency features and the inability of the multilayer feature fusion approach to effectively model the complex relationships between high-level and low-level features.Therefore,a new front-end model as well as a new multilayer feature fusion approach are proposed.In the front-end model,by dividing the input feature map into multiple sub-bands and expanding the frequency range of the sub-bands layer by layer,the TDNN can model the local frequency features progressively.Meanwhile,a new inverse path passing from higher to lower layers is added to the backbone model to model the relationship between the output features of two adjacent layers,and the outputs of each layer in the inverse path are concatenated to serve as the fused features.In addition,the design of the inverse bottleneck layer is used in the backbone model to further improve the performance of the model.Experimental results on the VoxCeleb1 test set show that the proposed method has a relative reduction of 9% in the equal error rate and 14% in the minimum cost detection function,compared to the current TDNN method,while the number of parameters is only 52% of the current method.

Key words: Speaker recognition, Speaker verification, Time delay neural network, Sub-band feature extraction, Multilayer feature fusion

CLC Number:

TP183

WANG Mengwei, YANG Zhe. Speaker Verification Method Based on Sub-band Front-end Model and Inverse Feature Fusion[J].Computer Science, 2025, 52(3): 214-221.

References

[1]SHOME N,SARKAR A,GHOSH A K,et al.Speaker Recognition through Deep Learning Techniques:A Comprehensive Review and Research Challenges[J].Periodica Polytechnica Electrical Engineering and Computer Science,2023,67(3):300-336.
[2]BAI Z,ZHANG X L.Speaker recognition based on deep lear-ning:An overview[J].Neural Networks,2021,140:65-99.
[3]WAN Z K,REN Q H,QIN Y C,et al.Statistical pyramid dense time delay neural network for speaker verification[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:7532-7536.
[4]BENHAFID Z,SELOUANI S A,AMROUCHE A,et al.Attention-based factorized TDNN for a noise-robust and spoof-aware speaker verification system[J].International Journal of Speech Technology,2023,26(4):881-894.
[5]DESPLANQUES B,THIENPONDT J,DEMUYNCK K.ECAPA-TDNN:Emphasized Channel Attention,Propagation and Aggregation in TDNN Based Speaker Verification[C]//Proceedings Interspeech.2020:3830-3834.
[6]ZHANG X,LIU Q,GUO Q,et al.EIPFD-ResNet:Emphasized Information Propagation and Feature Distribution in ResNet Based Speaker Verification[J].Journal of Chinese Computer Systems.2023,44(3):463-470.
[7]KYNYCH F,ZDANSKY J,CERVA P,et al.Online Speaker Diarization Using Optimized SE-ResNet Architecture[C]//International Conference on Text,Speech,and Dialogue.Cham:Springer Nature Switzerland,2023:176-187.
[8]CHUNG J S,HUH J,MUN S,et al.In Defence of MetricLearning for Speaker Recognition[C]//Proceedings Interspeech.2020:2977-2981.
[9]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2014:4052-4056.
[10]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust DNN embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2018:5329-5333.
[11]SNYDER D,GARCIA-ROMERO D,POVEY D,et al.Deep neural network embeddings for text-independent speaker verification[C]//Proceedings Interspeech.2017:999-1003.
[12]GAO Z,SONG Y,MCLOUGHLIN I,et al.Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System[C]//Proceedings Interspeech.2019:361-365.
[13]TANG Y,DING G,HUANG J,et al.Deep speaker embedding learning with multi-level pooling for text-independent speaker verification[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2019:6116-6120.
[14]DEHAK N,KENNY P J,DEHAK R,et al.Front-end factoranalysis for speaker verification[J].IEEE Transactions on Au-dio,Speech,and Language Processing,2010,19(4):788-798.
[15]CHOWDHURY F A R R,WANG Q,MORENO I L,et al.Attention-based models for text-dependent speaker verification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2018:5359-5363.
[16]WANG Z,YAO K,LI X,et al.Multi-resolution multi-head attention in deep speaker embedding[C]//2020 IEEE Interna-tional Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2020:6464-6468.
[17]ZHANG Y,LV Z,WU H,et al.MFA-Conformer:Multi-scaleFeature Aggregation Conformer for Automatic Speaker Verification[C]//Proceedings Interspeech.2022:306-310.
[18]LI C,MA X,JIANG B,et al.Deep speaker:an end-to-end neural speaker embedding system[J].arXiv:1705.02304,2017.
[19]GU B,GUO W,DAI L,et al.An improved deep neural network for modeling speaker characteristics at different temporal scales[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2020:6814-6818.
[20]THIENPONDT J,DESPLANQUES B,DEMUYNCK K.In-tegrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification[C]//Proceedings Interspeech.2021:2302-2306.
[21]LIU T,DAS R K,LEE K A,et al.MFA:TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2022:7517-7521.
[22]ZHAO Z,LI Z,WANG W,et al.PCF:ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2023:1-5.
[23]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4510-4520.
[24]LIU Z,MAO H,WU C Y,et al.A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:11976-11986.
[25]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2117-2125.
[26]JUNG Y,KYE S.M,CHOI Y,et al.Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances[C]//Proceedings Interspeech.2020:1501-1505.
[27]SCHROFF F,KALENICHENKO D,PHILBIN J.Facenet:Aunified embedding for face recognition and clustering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:815-823.
[28]LIU W,WEN Y,YU Z,et al.Large-margin softmax loss for convolutional neural networks[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48.2016:507-516.
[29]LIU W,WEN Y,YU Z,et al.Sphereface:Deep hypersphere embedding for face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:212-220.
[30]WANG F,CHENG J,LIU W,et al.Additive margin softmax for face verification[J].IEEE Signal Processing Letters,2018,25(7):926-930.
[31]DENG J,GUO J,XUE N,et al.Arcface:Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4690-4699.
[32]NAGRANI A,CHUNG J S,XIE W,et al.Voxceleb:Large-scale speaker verification in the wild[J].Computer Speech & Language,2020,60:101027.
[33]CHUNG J S,NAGRANI A,ZISSERMAN A.VoxCeleb2:Deep Speaker Recognition[C]//Proceedings Interspeech.2018:1086-1090.
[34]SNYDER D,CHEN G,POVEY D.MUSAN:A Music,Speech,and Noise Corpus[J].arXiv:1510.08484,2015.
[35]KO T,PEDDINTI V,POVEY D,et al.A study on data augmentation of reverberant speech for robust speech recognition[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2017:5220-5224.
[36]PARK D S,CHAN W,ZHANG Y,et al.SpecAugment:A Simple Data Augmentation Method for Automatic Speech Recognition[C]//Interspeech.2019:2613-2617.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Speaker Verification Method Based on Sub-band Front-end Model and Inverse Feature Fusion

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 6

Metrics

Comments

Recommended 0

[1]	LIU Xiaohu, CHEN Defu, LI Jun, ZHOU Xuwen, HU Shan, ZHOU Hao. Speaker Verification Network Based on Multi-scale Convolutional Encoder [J]. Computer Science, 2024, 51(6A): 230700083-6.
[2]	GUO Xing-chen, YU Yi-biao. Robust Speaker Verification with Spoofing Attack Detection [J]. Computer Science, 2022, 49(6A): 531-536.
[3]	ZHENG Chun-jun, WANG Chun-li, JIA Ning. Survey of Acoustic Feature Extraction in Speech Tasks [J]. Computer Science, 2020, 47(5): 110-119.
[4]	HUA Ming, LI Dong-dong, WANG Zhe, GAO Da-qi. End-to-End Speaker Recognition Based on Frame-level Features [J]. Computer Science, 2020, 47(10): 169-173.
[5]	LUO Yuan and SUN Long. New Method of Robust Voiceprint Feature Extraction and Fusion [J]. Computer Science, 2016, 43(8): 297-299.
[6]	. TEo-CrCC Characteristic Parameter Extraction Method for Speaker Recognition in Noisy Environments [J]. Computer Science, 2012, 39(12): 198-203.