计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 231-241.doi: 10.11896/jsjkx.240100059
王康月1,2, 程铭1,2, 谢奕香2,3, 邹小兵3, 李明1,2
WANG Kangyue1,2, CHENG Ming1,2, XIE Yixiang2,3, ZOU Xiaobing3, LI Ming1,2
摘要: 说话人日志技术在智能语音转写领域扮演着关键的角色,其核心任务是按照说话人的身份对多人音频进行分割和聚类,以便更好地对音频内容及转写文本进行整理。在医疗访谈领域,说话人日志技术是自动化评估的前置条件。医疗交互对话领域天然存在角色信息,以孤独症辅助诊断为例,典型的情境包括医生、家长和接受诊断的孩子这3种明确定义的角色。但在实际对话中,角色和说话人之间的对应关系可能并非一一对应。例如,在孤独症诊断过程中,每次会话仅涉及一个孩子,而医生或家长的数量却是不确定的。文中认为语音片段中隐含的角色信息与声纹信息可以进行有效的互补,进而降低错误率,故提出一种将角色信息引入序列到序列目标说话人语音活动检测(Seq2Seq-TSVAD)中的方法。在CPEP-3数据集上,说话人日志的错误率(DER)为20.61%,相比Seq2Seq-TSVAD方法降低了9.8%,相比模块化说话人日志方法降低了19.3%,表明孤独症访谈场景下角色信息在提升说话人日志性能方面具有明显的作用。
中图分类号:
[1]TRANTER S E,REYNOLDS D A,MEMBER S.An Overview of Automatic Speaker Diarization Systems[J].IEEE Transactions on Audio,Speech,and Language Processing,2006,14(5):1557-1565. [2]ANGUERA X,BOZONNET S,EVANS N,et al.Speaker diarization:A review of recent research[J].IEEE Transactions on Audio,Speech,and Language Processing,2012,20(2):356-370. [3]PARK T J,KANDA N,DIMITRIADIS D,et al.A review ofspeaker diarization:Recent advances with deep learning[J].Computer Speech & Language,2022,72:101317. [4]KENNY P,REYNOLDS D,CASTALDO F.Diarization of telephone conversations using factor analysis[J].IEEE Journal of Selected Topics in Signal Processing,2010,4(6):1059-1070. [5]WEBER P,ENZINGER E,LABRADOR B,et al.Validations of an alpha version of the E3 Forensic Speech Science System(E3FS3) core software tools[J].Forensic Science International:Synergy,2022,4:100223. [6]KUMAR M,KIM S H,LORD C,et al.Improving speaker dia-rization for naturalistic child-adult conversational interactions using contextual information[J].The Journal of the Acoustical Society of America,2020,147(2):196-200. [7]MIRHEIDARI B,BLACKBURN D,HARKNESS K,et al.To-ward the automation of diagnostic conversation analysis in patients with memory complaints[J].Journal of Alzheimer's Di-sease,2017,58(2):373-387. [8]RYANT N,SINGH P,KRISHNAMOHAN V,et al.The thirdDIHARD diarization challenge[J].arXiv:2012.01477,2020. [9]XU X,ZOU X B,LI T Y.Expert consensus on early identification,screening,and early intervention for children with autism spectrum disorders [J].Chinese Journal of Pediatrics,2017,55(12):890-897. [10]PENG Y H,JING J,YE X F,et al.Response Characteristics of Children with Autism Spectrum Disorder with Different Cognitive Functions in PEP-3 [J].Chinese Journal of Child Health Care,2014,22(4):358-360. [11]SONG D Y,KIM S Y,BONG G,et al.The use of artificial intelligence in screening and diagnosis of autism spectrum disorder:a literature review[J].Journal of the Korean Academy of Child and Adolescent Psychiatry,2019,30(4):145. [12]NORDAHL-HANSEN A,KAALE A,ULVUND S E.Language assessment in children with autism spectrum disorder:Concurrent validity between report-based assessments and direct tests[J].Research in Autism Spectrum Disorders,2014,8(9):1100-1106. [13]CHENG M,ZHANG Y,XIE Y,et al.Computer-Aided Autism Spectrum Disorder Diagnosis With Behavior Signal Processing[J].IEEE Transactions on Affective Computing,2023,14(4):2982-3000. [14]HAN J,LIU J,ZHOU Y Y.The Current Application Status of Information Technology in Autism Intervention [J].China Educational Technology and Equipment,2023(23):1-5,22. [15]LAHIRI R,FENG T,HEBBAR R,et al.Robust self supervised speech embeddings for child-adult classification in interactions involving children with autism[J].arXiv:2307.16398,2023. [16]XU A,HEBBAR R,LAHIRI R,et al.Understanding SpokenLanguage Development of Children with ASD Using Pre-trained Speech Embeddings[J].arXiv:2305.14117,2023. [17]LIN Q,CAI W,YANG L,et al.DIHARD II is still hard:Experimental results and discussions from the DKU-LENOVO team[J].arXiv:2002.12761,2020. [18]WOOTERS C,FUNG J,PESKIN B,et al.Towards robustspeaker segmentation:The ICSI-SRI fall 2004 diarization system[J/OL].http://www1.cs.columbia.edu/~julia/papers/EARS-RT04f-spkr.pdf. [19]TSIARTAS A,CHASPARI T,KATSAMANIS N,et al.Multi-band long-term signal variability features for robust voice activity detection[C]//Interspeech.2013:718-722. [20]LIN Q,LI T,LI M.The DKU Speech Activity Detection andSpeaker Identification Systems for Fearless Steps Challenge Phase-02[C]//Interspeech.2020:2607-2611. [21]ZHANG Z C,TAN Z W,ZHANG C R,et al.Speech Endpoint Detection Based on Bayesian Decision of Logarithmic Spectrum Ratio in High and Low Frequency Bands [J].Computer Science,2021,48(6A):33-37. [22]TEMKO A,MACHO D,NADEU C.Enhanced SVM trainingfor robust speech activity detection[C]//2007 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP'07).Vol,4,IEEE,2007. [23]ZAZO R,SAINATH T N,SIMKO G,et al.Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection[C]//Interspeech.2016:3668-3672. [24]CHANG S Y,LI B,SIMKO G,et al.Temporal modeling using dilated convolution and gating for voice-activity-detection[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5549-5553. [25]YIN R,BREDIN H,BARRAS C.Speaker change detection in broadcast tv using bidirectional long short-term memory networks[C]//Interspeech.2017. [26]HRÚZ M,ZAJÍC Z.Convolutional neural network for speakerchange detection in telephone speaker diarization system[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2017:4945-4949. [27]WISNIEWKSI G,BREDIN H,GELLY G,et al.Combiningspeaker turn embedding and incremental structure prediction for low-latency speaker diarization[C]//ISCA.2017. [28]YIN R,BREDIN H,BARRAS C.Neural speech turn segmentation and affinity propagation for speaker diarization[C]//An-nual Conference of the International Speech Communication Association.2018. [29]LIN Q,YIN R,LI M,et al.Lstm based similarity measurement with spectral clustering for speaker diarization[J].arXiv:1907.10393,2019. [30]SELL G,SNYDER D,MCCREE A,et al.Diarization is Hard:Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge[C]//Interspeech.2018:2808-2812. [31]LIN Q,HOU Y,LI M.Self-attentive similarity measurementstrategies in speaker diarization[C]//Interspeech.2020:284-288. [32]SELL G,GARCIA-ROMERO D.Speaker diarization with PLDA i-vector scoring and unsupervised calibration[C]//2014 IEEE Spoken Language Technology Workshop(SLT).IEEE,2014:413-417. [33]DEHAK N,KENNY P J,DEHAK R,et al.Front-end factoranalysis for speaker verification[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,19(4):788-798. [34]GARCIA-ROMERO D,ESPY-WILSON C Y.Analysis of i-vector length normalization in speaker recognition systems[C]//Twelfth Annual Conference of the International Speech Communication Association.2011. [35]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5329-5333. [36]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [37]DENG L H,DENG F,ZHANG G X,et al.Multi-Scale End-to-End Speaker Recognition System Based on Improved Res2Net[J].Computer Engineering and Applications,2023,59(24):110-120. [38]SENOUSSAOUI M,KENNY P,STAFYLAKIS T,et al.Astudy of the cosine distance-based mean shift for telephone speech diarization[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2013,22(1):217-227. [39]KENNY P,STAFYLAKIS T,OUELLET P,et al.PLDA for speaker verification with utterances of arbitrary duration[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:7649-7653. [40]HAMERLY G,ELKAN C.Learning the k in k-means[C]//Proceedings of the 16th International Conference on Neural Information Processing Systems.2003:281-288. [41]GOWDA K C,KRISHNA G.Agglomerative clustering using the concept of mutual nearest neighbourhood[J].Pattern Recognition,1978,10(2):105-112. [42]ZHANG A,WANG Q,ZHU Z,et al.Fully supervised speaker diarization[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2019:6301-6305. [43]LI S,CAO F.Analysis and Trend Research on End-to-EndFramework Models of Intelligent Speech Technology [J].Computer Science,2022,49(S1):331-336. [44]WANG Q,DOWNEY C,WAN L,et al.Speaker diarization with LSTM[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5239-5243. [45]YU D,KOLBÆK M,TAN Z H,et al.Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2017:241-245. [46]FUJITA Y,KANDA N,HORIGUCHI S,et al.End-to-end neural speaker diarization with permutation-free objectives[J].ar-Xiv:1909.05952,2019. [47]HORIGUCHI S,FUJITA Y,WATANABE S,et al.End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors[J].arXiv:2005.09921,2020. [48]KINOSHITA K,DELCROIX M,TAWARA N.Integrating end-to-end neural and clustering-based diarization:Getting the best of both worlds[J].arXiv:2010.13366,2020. [49]PRZYBOCKI M,MARTIN A.2000 NIST Speaker Recognition Evaluation(LDC2001S97)[S].Philadelphia,PA:Linguistic Data Consortium,2001. [50]MEDENNIKOV I,KORENEVSKY M,PRISYACH T,et al.Target-speaker voice activity detection:A novel approach for multi-speaker diarization in a dinner party scenario[J].arXiv:2005.07272,2020. [51]CHUNG J S,NAGRANI A,COTO E,et al.VoxSRC 2019:The first VoxCeleb speaker recognition challenge[J].arXiv:1912.02522,2019. [52]NAGRANI A,CHUNG J S,HUH J,et al.Voxsrc 2020:Thesecond voxceleb speaker recognition challenge[J].arXiv:2012.06867,2020. [53]BROWN A,HUH J,CHUNG J S,et al.Voxsrc 2021:The third voxceleb speaker recognition challenge[J].arXiv:2201.04583,2022. [54]HUH J,BROWN A,JUNG J W,et al.Voxsrc 2022:The fourth voxceleb speaker recognition challenge[J].arXiv:2302.10248,2023. [55]RYANT N,CHURCH K,CIERI C,et al.First DIHARD challenge evaluation plan[J/OL].https://catalog.ldc.upenn.edu/docs/LDC2019S12/first_dihard_eval_plan_v1.3.pdf. [56]RYANT N,CHURCH K,CIERI C,et al.The Second DIHARD Diarization Challenge:Dataset,Task,and Baselines[C]//Interspeech.2019. [57]RYANT N,CHURCH K,CIERI C,et al.Third DIHARD challenge evaluation plan[J].arXiv:2006.05815,2020. [58]WATANABE S,MANDEL M,BARKER J,et al.CHiME-6challenge:Tackling multispeaker speech recognition for unsegmented recordings[J].arXiv:2004.09249,2020. [59]YU F,ZHANG S,FU Y,et al.M2MeT:The ICASSP 2022multi-channel multi-party meeting transcription challenge[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:6167-6171. [60]WANG W,CAI D,LIN Q,et al.The dku-dukeece-lenovo system for the diarization task of the 2021 voxceleb speaker recognition challenge[J].arXiv:2109.02002,2021. [61]WANG W,QIN X,CHENG M,et al.The dku-dukeece diarization system for the voxceleb speaker recognition challenge 2022[J].arXiv:2210.01677,2022. [62]WANG W,LIN Q,CAI D,et al.Similarity measurement of segment-level speaker embeddings in speaker diarization[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:2645-2658. [63]WANG W,QIN X,LI M.Cross-channel attention-based targetspeaker voice activity detection:Experimental results for the m2met challenge[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:9171-9175. [64]CHEN Y,WANG S,QIAN Y,et al.End-to-end speaker-depen-dent voice activity detection[J].arXiv:2009.09906,2020. [65]DING S,WANG Q,CHANG S Y,et al.Personal VAD:Speaker-conditioned voice activity detection[J].arXiv:1908.04284,2019. [66]CHENG M,WANG W,ZHANG Y,et al.Target-speaker voice activity detection via sequence-to-sequence prediction[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5. [67]CHUNG J S,HUH J,NAGRANI A,et al.Spot the conversation:speaker diarisation in the wild[J].arXiv:2007.01216,2020. [68]FLEMOTOMOS N,PAPADOPOULOS P,GIBSON J,et al.Combined Speaker Clustering and Role Recognition in Conve-rsational Speech[C]//Interspeech.2018:1378-1382. [69]FLEMOTOMOS N,NARAYANAN S.Multimodal clustering with role induced constraints for speaker diarization[J].arXiv:2204.00657,2022. [70]LIU S,LI F,ZHANG H,et al.Dab-detr:Dynamic anchor boxes are better queries for detr[J].arXiv:2201.12329,2022. [71]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141. [72]DENG J,GUO J,XUE N,et al.Arcface:Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4690-4699. [73]QIN X,CAI D,LI M.Robust multi-channel far-field speakerverification under different in-domain data availability scenarios[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,31:71-85. [74]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018. [75]QIN X,LI M,BU H,et al.The interspeech 2020 far-field spea-ker verification challenge[J].arXiv:2005.08046,2020. [76]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [77]黄清,方木云.一种基于 HMM 算法改进的语音识别系统[J].重庆工商大学学报(自然科学版),2022,39(5):56-61 [78]YAO Z,WU D,WANG X,et al.Wenet:Production orientedstreaming and non-streaming end-to-end speech recognition toolkit[J].arXiv:2102.01547,2021. [79]FU Y,CHENG L,LV S,et al.Aishell-4:An open source dataset for speech enhancement,separation,recognition and speaker diarization in conference scenario[J].arXiv:2104.03603,2021. [80]WANG Z,WU S,CHEN H,et al.The multimodal information based speech processing(misp) 2022 challenge:Audio-visual diarization and recognition[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5. [81]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[J].arXiv:2005.08100,2020. [82]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems December.2017:6000-6010. [83]SNYDER D,CHEN G,POVEY D.Musan:A music,speech,and noise corpus[J].arXiv:1510.08484,2015. [84]UNWIN A,KLEINMAN K.The iris data set:In search of the source of virginica[J].Significance,2021,18(6):26-29. [85]BREDIN H,YIN R,CORIA J M,et al.pyannote.audio:neural building blocks for speaker diarization[J].arXiv:1911.01255,2019. [86]LANDINI F,PROFANT J,DIEZ M,et al.Bayesian HMM clustering of x-vector sequences(VBx) in speaker diarization:Theory,implementation and analysis on standard tasks[J].Computer Speech & Language,2022,71:101254. |
|