计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230700083-6.doi: 10.11896/jsjkx.230700083

• 人工智能 • 上一篇    下一篇

基于多尺度卷积编码器的说话人验证网络

刘小湖1, 陈德富1, 李俊2, 周旭文1, 胡姗1, 周浩1   

  1. 1 浙江工业大学信息工程学院 杭州 310023
    2 浙江讯飞智能科技有限公司 杭州 310000
  • 发布日期:2024-06-06
  • 通讯作者: 陈德富(defuchen@zjut.edu.cn)
  • 作者简介:(201806060508@zjut.edu.cn)
  • 基金资助:
    杭州市重大科技创新项目(2022AIZD0055)

Speaker Verification Network Based on Multi-scale Convolutional Encoder

LIU Xiaohu1, CHEN Defu1, LI Jun2, ZHOU Xuwen1, HU Shan1, ZHOU Hao1   

  1. 1 School of Information Engineering,Zhejiang University of Technology,Hangzhou 310023,China
    2 Zhejiang Iflytek Intelligent Technology Co.,Ltd,Hangzhou 310000,China
  • Published:2024-06-06
  • About author:LIU Xiaohu,born in 2000,postgra-duate.His main research interests include speaker recognition and deep learning.
    CHEN Defu,born in 1981,Ph.D.His main research interests include data intelligence,IoT theory and architecture.
  • Supported by:
    Hangzhou Major Scientific and Technological Innovation Project(2022AIZD0055).

摘要: 说话人验证是一种有效的生物身份验证方法,说话人嵌入特征的质量在很大程度上影响着说话人验证系统的性能。最近,Transformer模型在自动语音识别领域展现出了巨大的潜力,但由于Transformer中传统的自注意力机制对局部特征的提取能力较弱,难以提取有效的说话人嵌入特征,因此Transformer模型在说话人验证领域的性能难以超越以往的基于卷积网络的模型。为了提高Transformer对局部特征的提取能力,文中提出了一种新的自注意力机制用于Transformer编码器,称为多尺度卷积自注意力编码器(Multi-scale Convolutional Self-Attention Encoder,MCAE)。利用不同尺度的卷积操作来提取多时间尺度信息,并通过融合时域和频域的特征,使模型获得更丰富的局部特征表示,这样的编码器设计对于说话人验证是更有效的。通过实验表明,在3个公开的测试集上,所提方法的综合性能表现更佳。与传统的Transformer编码器相比,MCAE也是更轻量级的,这更有利于模型的应用部署。

关键词: 说话人验证, 说话人嵌入, 自注意力机制, Transformer编码器, 多尺度卷积

Abstract: Speaker verification is an effective biometric authentication method,and the quality of speaker embedding features largely affects the performance of speaker verification systems.Recently,the Transformer model has shown great potential in the field of automatic speech recognition,but it is difficult to extract effective speaker embedding features because the traditional self-attention mechanism of the Transformer model is weak for local feature extraction.The performance of the Transformer model in the field of speaker verification can hardly surpass that of the previous convolutional network-based models.In order to improve the Transformer’s ability to extract local features,this paper proposes a new self-attention mechanism for Transformer encoder,called multi-scale convolutional self-attention encoder(MCAE).Using convolution operations of different sizes to extract multi-time-scale information and by fusing features in the time and frequency domains,it enables the model to obtain a richer representation of local features,and such an encoder design is more effective for speaker verification.It is shown experimentally that the proposed method is better in terms of comprehensive performance on three publicly available test sets.The MCAE is more lightweight compared to the conventional Transformer encoder,which is more favorable for the deployment of the model in applications.

Key words: Speaker verification, Speaker embedding, Self-attention mechanism, Transformer encoder, Multi-scale convolution

中图分类号: 

  • TP301
[1]HANSEN J H L,HASAN T.Speaker recognition by machines and humans:A tutorial review[J].IEEE Signal Processing Magazine,2015,32(6):74-99.
[2]CAMPBELL J P,SHEN W,CAMPBELL W M,et al.Forensic speaker recognition[J].IEEE Signal Processing Magazine,2009,26(2):95-103.
[3]CHAMPOD C,MEUWLY D.The inference of identity in forensic speaker recognition[J].Speech Communication,2000,31(2/3):193-203.
[4]TOGNERI R,PULLELLA D.An overview of speaker identification:Accuracy and robustness issues[J].IEEE Circuits and Systems Magazine,2011,11(2):23-61.
[5]BAI Z,ZHANG X L.Speaker recognition based on deep lear-ning:An overview[J].Neural Networks,2021,140:65-99.
[6]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Calgary:IEEE,2018:5329-5333.
[7]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE onference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:770-778.
[8]XIE S,GIRSHICK R,DOLLÁR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1492-1500.
[9]GAO S H,CHENG M M,ZHAO K,et al.Res2net:A newmulti-scale backbone architecture[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(2):652-662.
[10]ZHOU T,ZHAO Y,WU J.Resnext and res2net structures for speaker verification[C]//2021 IEEE Spoken Language Techno-logy Workshop(SLT).Shenzhen:IEEE,2021:301-307.
[11]KIM J,SHIM H,HEO J,et al.RawNeXt:Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP 2022).Singapore:IEEE,2022:7647-7651.
[12]DESPLANQUES B,THIENPONDT J,DEMUYNCK K.Ecapa-tdnn:Emphasized channel attention,propagation and aggregation in tdnn based speaker verification[J].arXiv:2005.07143,2020.
[13]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[14]GONG X,LU Y,ZHOU Z,et al.Layer-wise fast adaptation for end-to-end multi-accent speech recognition[J].arXiv:2204.09883,2022.
[15]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[J].arXiv:2005.08100,2020.
[16]SAFARI P,INDIA M,HERNANDO J.Self-attention encodingand pooling for speaker recognition[J].arXiv:2008.01077,2020.
[17]MARY N J M S,UMESH S,KATTA S V.S-vectors and TESA:Speaker embeddings and a speaker authenticator based on transformer encoder[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,30:404-413.
[18]WANG R,AO J,ZHOU L,et al.Multi-view self-attention based transformer for speaker recognition[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore:IEEE,2022:6732-6736.
[19]ZHANG Y,LV Z,WU H,et al.Mfa-conformer:Multi-scale feature aggregation conformer for automatic speaker verification[J].arXiv:2203.15249,2022.
[20]SANG M,ZHAO Y,LIU G,et al.Improving Transformer-Based Networks with Locality for Automatic Speaker Verification[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2023).Rhodes Island:IEEE,2023:1-5.
[21]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[22]HENDRYCKS D,GIMPEL K.Gaussian error linear units(gelus)[J].arXiv:1606.08415,2016.
[23]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:4510-4520.
[24]OKABE K,KOSHINAKA T,SHINODA K.Attentive statistics pooling for deep speaker embedding[J].arXiv:1803.10963,2018.
[25]NAGRANI A,CHUNG J S,ZISSERMAN A.Voxceleb:a large-scale speaker identification dataset[J].arXiv:1706.08612,2017.
[26]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018.
[27]WANG H,WANG Y,ZHOU Z,et al.Cosface:Large margin cosine loss for deep face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:5265-5274.
[28]HAN B,CHEN Z,QIAN Y.Local information modeling with self-attention for speaker verification[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore:IEEE,2022:6727-6731.
[29]ZHANG N,WANG J,HONG Z,et al.DT-SV:A Transformer-based Time-domain Approach for Speaker Verification[C]//2022 International Joint Conference on Neural Networks(IJCNN).Padua:IEEE,2022:1-7.
[30]WANG F,SONG Z,JIANG H,et al.MACCIF-TDNN:MultiAspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU).Cartagena:IEEE,2021:214-219.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!