计算机科学 ›› 2025, Vol. 52 ›› Issue (3): 214-221.doi: 10.11896/jsjkx.240100222
王萌威, 杨哲
WANG Mengwei, YANG Zhe
摘要: 现有说话人确认方法中用于提取帧级特征的时延神经网络(TDNN)存在两个问题,一是缺少对局部频率特征的建模能力,二是多层特征融合方式无法对高层和低层特征之间的复杂关系进行有效建模。因此,提出一种新的前端模型以及一种新的多层特征融合方式。在前端模型中,通过将输入特征图划分为多个子频带,并逐层扩大子频带的频率范围,使TDNN可以渐进地对局部频率特征进行建模。同时,在主干模型中新增一条由高层向低层传递的反向路径,对相邻两层输出特征之间的关系进行建模,并将反向路径中每层的输出拼接后作为融合后的特征。此外,在主干模型中使用逆瓶颈层的设计,进一步提升模型的性能。在VoxCeleb1测试集上的实验结果表明,所提方法与目前的TDNN方法相比,等错误率和最小代价检测函数分别降低了9%和14%,而参数量仅为目前方法的52%。
中图分类号:
[1]SHOME N,SARKAR A,GHOSH A K,et al.Speaker Recognition through Deep Learning Techniques:A Comprehensive Review and Research Challenges[J].Periodica Polytechnica Electrical Engineering and Computer Science,2023,67(3):300-336. [2]BAI Z,ZHANG X L.Speaker recognition based on deep lear-ning:An overview[J].Neural Networks,2021,140:65-99. [3]WAN Z K,REN Q H,QIN Y C,et al.Statistical pyramid dense time delay neural network for speaker verification[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:7532-7536. [4]BENHAFID Z,SELOUANI S A,AMROUCHE A,et al.Attention-based factorized TDNN for a noise-robust and spoof-aware speaker verification system[J].International Journal of Speech Technology,2023,26(4):881-894. [5]DESPLANQUES B,THIENPONDT J,DEMUYNCK K.ECAPA-TDNN:Emphasized Channel Attention,Propagation and Aggregation in TDNN Based Speaker Verification[C]//Proceedings Interspeech.2020:3830-3834. [6]ZHANG X,LIU Q,GUO Q,et al.EIPFD-ResNet:Emphasized Information Propagation and Feature Distribution in ResNet Based Speaker Verification[J].Journal of Chinese Computer Systems.2023,44(3):463-470. [7]KYNYCH F,ZDANSKY J,CERVA P,et al.Online Speaker Diarization Using Optimized SE-ResNet Architecture[C]//International Conference on Text,Speech,and Dialogue.Cham:Springer Nature Switzerland,2023:176-187. [8]CHUNG J S,HUH J,MUN S,et al.In Defence of MetricLearning for Speaker Recognition[C]//Proceedings Interspeech.2020:2977-2981. [9]VARIANI E,LEI X,MCDERMOTT E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2014:4052-4056. [10]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust DNN embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2018:5329-5333. [11]SNYDER D,GARCIA-ROMERO D,POVEY D,et al.Deep neural network embeddings for text-independent speaker verification[C]//Proceedings Interspeech.2017:999-1003. [12]GAO Z,SONG Y,MCLOUGHLIN I,et al.Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System[C]//Proceedings Interspeech.2019:361-365. [13]TANG Y,DING G,HUANG J,et al.Deep speaker embedding learning with multi-level pooling for text-independent speaker verification[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2019:6116-6120. [14]DEHAK N,KENNY P J,DEHAK R,et al.Front-end factoranalysis for speaker verification[J].IEEE Transactions on Au-dio,Speech,and Language Processing,2010,19(4):788-798. [15]CHOWDHURY F A R R,WANG Q,MORENO I L,et al.Attention-based models for text-dependent speaker verification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2018:5359-5363. [16]WANG Z,YAO K,LI X,et al.Multi-resolution multi-head attention in deep speaker embedding[C]//2020 IEEE Interna-tional Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2020:6464-6468. [17]ZHANG Y,LV Z,WU H,et al.MFA-Conformer:Multi-scaleFeature Aggregation Conformer for Automatic Speaker Verification[C]//Proceedings Interspeech.2022:306-310. [18]LI C,MA X,JIANG B,et al.Deep speaker:an end-to-end neural speaker embedding system[J].arXiv:1705.02304,2017. [19]GU B,GUO W,DAI L,et al.An improved deep neural network for modeling speaker characteristics at different temporal scales[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2020:6814-6818. [20]THIENPONDT J,DESPLANQUES B,DEMUYNCK K.In-tegrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification[C]//Proceedings Interspeech.2021:2302-2306. [21]LIU T,DAS R K,LEE K A,et al.MFA:TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2022:7517-7521. [22]ZHAO Z,LI Z,WANG W,et al.PCF:ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2023:1-5. [23]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4510-4520. [24]LIU Z,MAO H,WU C Y,et al.A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:11976-11986. [25]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2117-2125. [26]JUNG Y,KYE S.M,CHOI Y,et al.Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances[C]//Proceedings Interspeech.2020:1501-1505. [27]SCHROFF F,KALENICHENKO D,PHILBIN J.Facenet:Aunified embedding for face recognition and clustering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:815-823. [28]LIU W,WEN Y,YU Z,et al.Large-margin softmax loss for convolutional neural networks[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48.2016:507-516. [29]LIU W,WEN Y,YU Z,et al.Sphereface:Deep hypersphere embedding for face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:212-220. [30]WANG F,CHENG J,LIU W,et al.Additive margin softmax for face verification[J].IEEE Signal Processing Letters,2018,25(7):926-930. [31]DENG J,GUO J,XUE N,et al.Arcface:Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4690-4699. [32]NAGRANI A,CHUNG J S,XIE W,et al.Voxceleb:Large-scale speaker verification in the wild[J].Computer Speech & Language,2020,60:101027. [33]CHUNG J S,NAGRANI A,ZISSERMAN A.VoxCeleb2:Deep Speaker Recognition[C]//Proceedings Interspeech.2018:1086-1090. [34]SNYDER D,CHEN G,POVEY D.MUSAN:A Music,Speech,and Noise Corpus[J].arXiv:1510.08484,2015. [35]KO T,PEDDINTI V,POVEY D,et al.A study on data augmentation of reverberant speech for robust speech recognition[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New York:IEEE Press,2017:5220-5224. [36]PARK D S,CHAN W,ZHANG Y,et al.SpecAugment:A Simple Data Augmentation Method for Automatic Speech Recognition[C]//Interspeech.2019:2613-2617. |
|