计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230700083-6.doi: 10.11896/jsjkx.230700083
刘小湖1, 陈德富1, 李俊2, 周旭文1, 胡姗1, 周浩1
LIU Xiaohu1, CHEN Defu1, LI Jun2, ZHOU Xuwen1, HU Shan1, ZHOU Hao1
摘要: 说话人验证是一种有效的生物身份验证方法,说话人嵌入特征的质量在很大程度上影响着说话人验证系统的性能。最近,Transformer模型在自动语音识别领域展现出了巨大的潜力,但由于Transformer中传统的自注意力机制对局部特征的提取能力较弱,难以提取有效的说话人嵌入特征,因此Transformer模型在说话人验证领域的性能难以超越以往的基于卷积网络的模型。为了提高Transformer对局部特征的提取能力,文中提出了一种新的自注意力机制用于Transformer编码器,称为多尺度卷积自注意力编码器(Multi-scale Convolutional Self-Attention Encoder,MCAE)。利用不同尺度的卷积操作来提取多时间尺度信息,并通过融合时域和频域的特征,使模型获得更丰富的局部特征表示,这样的编码器设计对于说话人验证是更有效的。通过实验表明,在3个公开的测试集上,所提方法的综合性能表现更佳。与传统的Transformer编码器相比,MCAE也是更轻量级的,这更有利于模型的应用部署。
中图分类号:
[1]HANSEN J H L,HASAN T.Speaker recognition by machines and humans:A tutorial review[J].IEEE Signal Processing Magazine,2015,32(6):74-99. [2]CAMPBELL J P,SHEN W,CAMPBELL W M,et al.Forensic speaker recognition[J].IEEE Signal Processing Magazine,2009,26(2):95-103. [3]CHAMPOD C,MEUWLY D.The inference of identity in forensic speaker recognition[J].Speech Communication,2000,31(2/3):193-203. [4]TOGNERI R,PULLELLA D.An overview of speaker identification:Accuracy and robustness issues[J].IEEE Circuits and Systems Magazine,2011,11(2):23-61. [5]BAI Z,ZHANG X L.Speaker recognition based on deep lear-ning:An overview[J].Neural Networks,2021,140:65-99. [6]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Calgary:IEEE,2018:5329-5333. [7]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE onference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:770-778. [8]XIE S,GIRSHICK R,DOLLÁR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1492-1500. [9]GAO S H,CHENG M M,ZHAO K,et al.Res2net:A newmulti-scale backbone architecture[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(2):652-662. [10]ZHOU T,ZHAO Y,WU J.Resnext and res2net structures for speaker verification[C]//2021 IEEE Spoken Language Techno-logy Workshop(SLT).Shenzhen:IEEE,2021:301-307. [11]KIM J,SHIM H,HEO J,et al.RawNeXt:Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP 2022).Singapore:IEEE,2022:7647-7651. [12]DESPLANQUES B,THIENPONDT J,DEMUYNCK K.Ecapa-tdnn:Emphasized channel attention,propagation and aggregation in tdnn based speaker verification[J].arXiv:2005.07143,2020. [13]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [14]GONG X,LU Y,ZHOU Z,et al.Layer-wise fast adaptation for end-to-end multi-accent speech recognition[J].arXiv:2204.09883,2022. [15]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[J].arXiv:2005.08100,2020. [16]SAFARI P,INDIA M,HERNANDO J.Self-attention encodingand pooling for speaker recognition[J].arXiv:2008.01077,2020. [17]MARY N J M S,UMESH S,KATTA S V.S-vectors and TESA:Speaker embeddings and a speaker authenticator based on transformer encoder[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,30:404-413. [18]WANG R,AO J,ZHOU L,et al.Multi-view self-attention based transformer for speaker recognition[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore:IEEE,2022:6732-6736. [19]ZHANG Y,LV Z,WU H,et al.Mfa-conformer:Multi-scale feature aggregation conformer for automatic speaker verification[J].arXiv:2203.15249,2022. [20]SANG M,ZHAO Y,LIU G,et al.Improving Transformer-Based Networks with Locality for Automatic Speaker Verification[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2023).Rhodes Island:IEEE,2023:1-5. [21]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016. [22]HENDRYCKS D,GIMPEL K.Gaussian error linear units(gelus)[J].arXiv:1606.08415,2016. [23]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:4510-4520. [24]OKABE K,KOSHINAKA T,SHINODA K.Attentive statistics pooling for deep speaker embedding[J].arXiv:1803.10963,2018. [25]NAGRANI A,CHUNG J S,ZISSERMAN A.Voxceleb:a large-scale speaker identification dataset[J].arXiv:1706.08612,2017. [26]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018. [27]WANG H,WANG Y,ZHOU Z,et al.Cosface:Large margin cosine loss for deep face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:5265-5274. [28]HAN B,CHEN Z,QIAN Y.Local information modeling with self-attention for speaker verification[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore:IEEE,2022:6727-6731. [29]ZHANG N,WANG J,HONG Z,et al.DT-SV:A Transformer-based Time-domain Approach for Speaker Verification[C]//2022 International Joint Conference on Neural Networks(IJCNN).Padua:IEEE,2022:1-7. [30]WANG F,SONG Z,JIANG H,et al.MACCIF-TDNN:MultiAspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU).Cartagena:IEEE,2021:214-219. |
|