Computer Science ›› 2024, Vol. 51 ›› Issue (6A): 230700083-6.doi: 10.11896/jsjkx.230700083

• Artificial Intelligenc • Previous Articles     Next Articles

Speaker Verification Network Based on Multi-scale Convolutional Encoder

LIU Xiaohu1, CHEN Defu1, LI Jun2, ZHOU Xuwen1, HU Shan1, ZHOU Hao1   

  1. 1 School of Information Engineering,Zhejiang University of Technology,Hangzhou 310023,China
    2 Zhejiang Iflytek Intelligent Technology Co.,Ltd,Hangzhou 310000,China
  • Published:2024-06-06
  • About author:LIU Xiaohu,born in 2000,postgra-duate.His main research interests include speaker recognition and deep learning.
    CHEN Defu,born in 1981,Ph.D.His main research interests include data intelligence,IoT theory and architecture.
  • Supported by:
    Hangzhou Major Scientific and Technological Innovation Project(2022AIZD0055).

Abstract: Speaker verification is an effective biometric authentication method,and the quality of speaker embedding features largely affects the performance of speaker verification systems.Recently,the Transformer model has shown great potential in the field of automatic speech recognition,but it is difficult to extract effective speaker embedding features because the traditional self-attention mechanism of the Transformer model is weak for local feature extraction.The performance of the Transformer model in the field of speaker verification can hardly surpass that of the previous convolutional network-based models.In order to improve the Transformer’s ability to extract local features,this paper proposes a new self-attention mechanism for Transformer encoder,called multi-scale convolutional self-attention encoder(MCAE).Using convolution operations of different sizes to extract multi-time-scale information and by fusing features in the time and frequency domains,it enables the model to obtain a richer representation of local features,and such an encoder design is more effective for speaker verification.It is shown experimentally that the proposed method is better in terms of comprehensive performance on three publicly available test sets.The MCAE is more lightweight compared to the conventional Transformer encoder,which is more favorable for the deployment of the model in applications.

Key words: Speaker verification, Speaker embedding, Self-attention mechanism, Transformer encoder, Multi-scale convolution

CLC Number: 

  • TP301
[1]HANSEN J H L,HASAN T.Speaker recognition by machines and humans:A tutorial review[J].IEEE Signal Processing Magazine,2015,32(6):74-99.
[2]CAMPBELL J P,SHEN W,CAMPBELL W M,et al.Forensic speaker recognition[J].IEEE Signal Processing Magazine,2009,26(2):95-103.
[3]CHAMPOD C,MEUWLY D.The inference of identity in forensic speaker recognition[J].Speech Communication,2000,31(2/3):193-203.
[4]TOGNERI R,PULLELLA D.An overview of speaker identification:Accuracy and robustness issues[J].IEEE Circuits and Systems Magazine,2011,11(2):23-61.
[5]BAI Z,ZHANG X L.Speaker recognition based on deep lear-ning:An overview[J].Neural Networks,2021,140:65-99.
[6]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Calgary:IEEE,2018:5329-5333.
[7]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE onference on Computer Vision and Pattern Recognition.Las Vegas:IEEE,2016:770-778.
[8]XIE S,GIRSHICK R,DOLLÁR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE,2017:1492-1500.
[9]GAO S H,CHENG M M,ZHAO K,et al.Res2net:A newmulti-scale backbone architecture[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(2):652-662.
[10]ZHOU T,ZHAO Y,WU J.Resnext and res2net structures for speaker verification[C]//2021 IEEE Spoken Language Techno-logy Workshop(SLT).Shenzhen:IEEE,2021:301-307.
[11]KIM J,SHIM H,HEO J,et al.RawNeXt:Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Proces-sing(ICASSP 2022).Singapore:IEEE,2022:7647-7651.
[12]DESPLANQUES B,THIENPONDT J,DEMUYNCK K.Ecapa-tdnn:Emphasized channel attention,propagation and aggregation in tdnn based speaker verification[J].arXiv:2005.07143,2020.
[13]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[14]GONG X,LU Y,ZHOU Z,et al.Layer-wise fast adaptation for end-to-end multi-accent speech recognition[J].arXiv:2204.09883,2022.
[15]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[J].arXiv:2005.08100,2020.
[16]SAFARI P,INDIA M,HERNANDO J.Self-attention encodingand pooling for speaker recognition[J].arXiv:2008.01077,2020.
[17]MARY N J M S,UMESH S,KATTA S V.S-vectors and TESA:Speaker embeddings and a speaker authenticator based on transformer encoder[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,30:404-413.
[18]WANG R,AO J,ZHOU L,et al.Multi-view self-attention based transformer for speaker recognition[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore:IEEE,2022:6732-6736.
[19]ZHANG Y,LV Z,WU H,et al.Mfa-conformer:Multi-scale feature aggregation conformer for automatic speaker verification[J].arXiv:2203.15249,2022.
[20]SANG M,ZHAO Y,LIU G,et al.Improving Transformer-Based Networks with Locality for Automatic Speaker Verification[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2023).Rhodes Island:IEEE,2023:1-5.
[21]BA J L,KIROS J R,HINTON G E.Layer normalization[J].arXiv:1607.06450,2016.
[22]HENDRYCKS D,GIMPEL K.Gaussian error linear units(gelus)[J].arXiv:1606.08415,2016.
[23]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:4510-4520.
[24]OKABE K,KOSHINAKA T,SHINODA K.Attentive statistics pooling for deep speaker embedding[J].arXiv:1803.10963,2018.
[25]NAGRANI A,CHUNG J S,ZISSERMAN A.Voxceleb:a large-scale speaker identification dataset[J].arXiv:1706.08612,2017.
[26]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018.
[27]WANG H,WANG Y,ZHOU Z,et al.Cosface:Large margin cosine loss for deep face recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:5265-5274.
[28]HAN B,CHEN Z,QIAN Y.Local information modeling with self-attention for speaker verification[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).Singapore:IEEE,2022:6727-6731.
[29]ZHANG N,WANG J,HONG Z,et al.DT-SV:A Transformer-based Time-domain Approach for Speaker Verification[C]//2022 International Joint Conference on Neural Networks(IJCNN).Padua:IEEE,2022:1-7.
[30]WANG F,SONG Z,JIANG H,et al.MACCIF-TDNN:MultiAspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU).Cartagena:IEEE,2021:214-219.
[1] QUE Yue, GAN Menghan, LIU Zhiwei. Object Detection with Receptive Field Expansion and Multi-branch Aggregation [J]. Computer Science, 2024, 51(6A): 230600151-6.
[2] ZHANG Lanxin, XIANG Ling, LI Xianze, CHEN Jinpeng. Intelligent Fault Diagnosis Method for Rolling Bearing Based on SAMNV3 [J]. Computer Science, 2024, 51(6A): 230700167-6.
[3] LI Zekai, BAI Zhengyao, XIAO Xiao, ZHANG Yihan, YOU Yilin. Point Cloud Upsampling Network Incorporating Transformer and Multi-stage Learning Framework [J]. Computer Science, 2024, 51(6): 231-238.
[4] ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru. Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer [J]. Computer Science, 2024, 51(2): 196-204.
[5] TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[6] YAN Mingqiang, YU Pengfei, LI Haiyan, LI Hongsong. Arbitrary Image Style Transfer with Consistent Semantic Style [J]. Computer Science, 2023, 50(7): 129-136.
[7] LI Fan, JIA Dongli, YAO Yumin, TU Jun. Graph Neural Network Few Shot Image Classification Network Based on Residual and Self-attention Mechanism [J]. Computer Science, 2023, 50(6A): 220500104-5.
[8] DOU Zhi, HU Chenguang, LIANG Jingyi, ZHENG Liming, LIU Guoqi. Lightweight Target Detection Algorithm Based on Improved Yolov4-tiny [J]. Computer Science, 2023, 50(6A): 220700006-7.
[9] WANG Xianwang, ZHOU Hao, ZHANG Minghui, ZHU Youwei. Hyperspectral Image Classification Based on Swin Transformer and 3D Residual Multilayer Fusion Network [J]. Computer Science, 2023, 50(5): 155-160.
[10] YANG Bin, LIANG Jing, ZHOU Jiawei, ZHAO Mengci. Study on Interpretable Click-Through Rate Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(5): 12-20.
[11] YIN Haitao, WANG Tianyou. Image Denoising Algorithm Based on Deep Multi-scale Convolution Sparse Coding [J]. Computer Science, 2023, 50(4): 133-140.
[12] ZHANG Dehui, DONG Anming, YU Jiguo, ZHAO Kai andZHOU You. Speech Enhancement Based on Generative Adversarial Networks with Gated Recurrent Units and Self-attention Mechanisms [J]. Computer Science, 2023, 50(11A): 230200203-9.
[13] CHEN Jiajun, CHEN Wei, ZHAO Lei. Road Network Topology-aware Trajectory Representation Learning [J]. Computer Science, 2023, 50(11): 114-121.
[14] ZHANG Jingyuan, WANG Hongxia, HE Peisong. Multitask Transformer-based Network for Image Splicing Manipulation Detection [J]. Computer Science, 2023, 50(1): 114-122.
[15] JIN Fang-yan, WANG Xiu-li. Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM [J]. Computer Science, 2022, 49(7): 179-186.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!