计算机科学 ›› 2025, Vol. 52 ›› Issue (2): 231-241.doi: 10.11896/jsjkx.240100059

• 人工智能 • 上一篇    下一篇

孤独症访谈场景下融入角色信息的说话人日志方法

王康月1,2, 程铭1,2, 谢奕香2,3, 邹小兵3, 李明1,2   

  1. 1 武汉大学计算机学院 武汉 430072
    2 昆山杜克大学大数据研究中心 江苏 苏州 215316
    3 中山大学附属第三医院儿童行为发育中心 广州 510630
  • 收稿日期:2024-01-04 修回日期:2024-06-02 出版日期:2025-02-15 发布日期:2025-02-17
  • 通讯作者: 李明(ming.li369@dukekunshan.edu.cn)
  • 作者简介:(kangyue.wang@whu.edu.cn)
  • 基金资助:
    国家自然科学基金面上项目(62171207);广州市重点研发计划(202007030011)

Role-aware Speaker Diarization in Autism Interview Scenarios

WANG Kangyue1,2, CHENG Ming1,2, XIE Yixiang2,3, ZOU Xiaobing3, LI Ming1,2   

  1. 1 School of Computer Science,Wuhan University,Wuhan 430072,China
    2 Data Science Research Center,Duke Kunshan University,Suzhou,Jiangsu 215316,China
    3 Child Development and Behavior Center,Third Affiliated Hospital of Sun Yat-sen University,Guangzhou 510630,China
  • Received:2024-01-04 Revised:2024-06-02 Online:2025-02-15 Published:2025-02-17
  • About author:WANG Kangyue,born in 2000,master.Her main research interests include deep learning and speaker diarization.
    LI Ming,born in 1983,Ph.D,professor,Ph.D supervisor.His main research interests include audio,speech and language processing,as well as multimodal behavior signal analysis and interpretation.
  • Supported by:
    General Program of the National Natural Science Foundation of China(62171207) and Key R&D Program of Guangdong Province,China(202007030011).

摘要: 说话人日志技术在智能语音转写领域扮演着关键的角色,其核心任务是按照说话人的身份对多人音频进行分割和聚类,以便更好地对音频内容及转写文本进行整理。在医疗访谈领域,说话人日志技术是自动化评估的前置条件。医疗交互对话领域天然存在角色信息,以孤独症辅助诊断为例,典型的情境包括医生、家长和接受诊断的孩子这3种明确定义的角色。但在实际对话中,角色和说话人之间的对应关系可能并非一一对应。例如,在孤独症诊断过程中,每次会话仅涉及一个孩子,而医生或家长的数量却是不确定的。文中认为语音片段中隐含的角色信息与声纹信息可以进行有效的互补,进而降低错误率,故提出一种将角色信息引入序列到序列目标说话人语音活动检测(Seq2Seq-TSVAD)中的方法。在CPEP-3数据集上,说话人日志的错误率(DER)为20.61%,相比Seq2Seq-TSVAD方法降低了9.8%,相比模块化说话人日志方法降低了19.3%,表明孤独症访谈场景下角色信息在提升说话人日志性能方面具有明显的作用。

关键词: 说话人日志, 角色分类, 特定说话人语音活动检测, 声纹特征提取, 孤独症谱系障碍

Abstract: Speaker diarization technology plays a pivotal role in the field of intelligent speech transcription,with its core task being the segmentation and clustering of multi-speaker audio based on speaker identities,thereby facilitating better organization of audio content and transcribed text.In the scenarios of medical interview,speaker diarization technology serves as a prerequisite for subsequent automated assessment.Role information is naturally present in the field of medical interactive dialogue,taking autism as an example,the typical situation includes three well-defined roles:doctor,parent,and child undergoing diagnosis.However, in actual conversation,the correspondence between the role and the speaker may not always be one-to-one.For instance,during autism diagnosis,each conversation may involve only one child,while the number of doctors or parents may vary.We believe that the role information and the speaker information embedded in each speech segment can effectively complement each other,thereby reducing the diarization error rate.In this study,we propose a method integrating role information into the sequence-to-sequence target speaker voice activity detection(Seq2Seq-TSVAD) framework,achieving a diarization error rate(DER) of 20.61% on the CPEP-3 dataset.This error rate is 9.8% lower compared to the Seq2Seq-TSVAD baseline method and 19.3% lower compared to the conventional modular speaker diarization method,underscoring the significant effect of role information in enhancing speaker diarization performance in autism interview scenarios.

Key words: Speaker diarization, Role classification, Specific speaker voice activity detection, Voiceprint feature extraction, Autism spectrum disorder

中图分类号: 

  • TP391.42
[1]TRANTER S E,REYNOLDS D A,MEMBER S.An Overview of Automatic Speaker Diarization Systems[J].IEEE Transactions on Audio,Speech,and Language Processing,2006,14(5):1557-1565.
[2]ANGUERA X,BOZONNET S,EVANS N,et al.Speaker diarization:A review of recent research[J].IEEE Transactions on Audio,Speech,and Language Processing,2012,20(2):356-370.
[3]PARK T J,KANDA N,DIMITRIADIS D,et al.A review ofspeaker diarization:Recent advances with deep learning[J].Computer Speech & Language,2022,72:101317.
[4]KENNY P,REYNOLDS D,CASTALDO F.Diarization of telephone conversations using factor analysis[J].IEEE Journal of Selected Topics in Signal Processing,2010,4(6):1059-1070.
[5]WEBER P,ENZINGER E,LABRADOR B,et al.Validations of an alpha version of the E3 Forensic Speech Science System(E3FS3) core software tools[J].Forensic Science International:Synergy,2022,4:100223.
[6]KUMAR M,KIM S H,LORD C,et al.Improving speaker dia-rization for naturalistic child-adult conversational interactions using contextual information[J].The Journal of the Acoustical Society of America,2020,147(2):196-200.
[7]MIRHEIDARI B,BLACKBURN D,HARKNESS K,et al.To-ward the automation of diagnostic conversation analysis in patients with memory complaints[J].Journal of Alzheimer's Di-sease,2017,58(2):373-387.
[8]RYANT N,SINGH P,KRISHNAMOHAN V,et al.The thirdDIHARD diarization challenge[J].arXiv:2012.01477,2020.
[9]XU X,ZOU X B,LI T Y.Expert consensus on early identification,screening,and early intervention for children with autism spectrum disorders [J].Chinese Journal of Pediatrics,2017,55(12):890-897.
[10]PENG Y H,JING J,YE X F,et al.Response Characteristics of Children with Autism Spectrum Disorder with Different Cognitive Functions in PEP-3 [J].Chinese Journal of Child Health Care,2014,22(4):358-360.
[11]SONG D Y,KIM S Y,BONG G,et al.The use of artificial intelligence in screening and diagnosis of autism spectrum disorder:a literature review[J].Journal of the Korean Academy of Child and Adolescent Psychiatry,2019,30(4):145.
[12]NORDAHL-HANSEN A,KAALE A,ULVUND S E.Language assessment in children with autism spectrum disorder:Concurrent validity between report-based assessments and direct tests[J].Research in Autism Spectrum Disorders,2014,8(9):1100-1106.
[13]CHENG M,ZHANG Y,XIE Y,et al.Computer-Aided Autism Spectrum Disorder Diagnosis With Behavior Signal Processing[J].IEEE Transactions on Affective Computing,2023,14(4):2982-3000.
[14]HAN J,LIU J,ZHOU Y Y.The Current Application Status of Information Technology in Autism Intervention [J].China Educational Technology and Equipment,2023(23):1-5,22.
[15]LAHIRI R,FENG T,HEBBAR R,et al.Robust self supervised speech embeddings for child-adult classification in interactions involving children with autism[J].arXiv:2307.16398,2023.
[16]XU A,HEBBAR R,LAHIRI R,et al.Understanding SpokenLanguage Development of Children with ASD Using Pre-trained Speech Embeddings[J].arXiv:2305.14117,2023.
[17]LIN Q,CAI W,YANG L,et al.DIHARD II is still hard:Experimental results and discussions from the DKU-LENOVO team[J].arXiv:2002.12761,2020.
[18]WOOTERS C,FUNG J,PESKIN B,et al.Towards robustspeaker segmentation:The ICSI-SRI fall 2004 diarization system[J/OL].http://www1.cs.columbia.edu/~julia/papers/EARS-RT04f-spkr.pdf.
[19]TSIARTAS A,CHASPARI T,KATSAMANIS N,et al.Multi-band long-term signal variability features for robust voice activity detection[C]//Interspeech.2013:718-722.
[20]LIN Q,LI T,LI M.The DKU Speech Activity Detection andSpeaker Identification Systems for Fearless Steps Challenge Phase-02[C]//Interspeech.2020:2607-2611.
[21]ZHANG Z C,TAN Z W,ZHANG C R,et al.Speech Endpoint Detection Based on Bayesian Decision of Logarithmic Spectrum Ratio in High and Low Frequency Bands [J].Computer Science,2021,48(6A):33-37.
[22]TEMKO A,MACHO D,NADEU C.Enhanced SVM trainingfor robust speech activity detection[C]//2007 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP'07).Vol,4,IEEE,2007.
[23]ZAZO R,SAINATH T N,SIMKO G,et al.Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection[C]//Interspeech.2016:3668-3672.
[24]CHANG S Y,LI B,SIMKO G,et al.Temporal modeling using dilated convolution and gating for voice-activity-detection[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5549-5553.
[25]YIN R,BREDIN H,BARRAS C.Speaker change detection in broadcast tv using bidirectional long short-term memory networks[C]//Interspeech.2017.
[26]HRÚZ M,ZAJÍC Z.Convolutional neural network for speakerchange detection in telephone speaker diarization system[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2017:4945-4949.
[27]WISNIEWKSI G,BREDIN H,GELLY G,et al.Combiningspeaker turn embedding and incremental structure prediction for low-latency speaker diarization[C]//ISCA.2017.
[28]YIN R,BREDIN H,BARRAS C.Neural speech turn segmentation and affinity propagation for speaker diarization[C]//An-nual Conference of the International Speech Communication Association.2018.
[29]LIN Q,YIN R,LI M,et al.Lstm based similarity measurement with spectral clustering for speaker diarization[J].arXiv:1907.10393,2019.
[30]SELL G,SNYDER D,MCCREE A,et al.Diarization is Hard:Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge[C]//Interspeech.2018:2808-2812.
[31]LIN Q,HOU Y,LI M.Self-attentive similarity measurementstrategies in speaker diarization[C]//Interspeech.2020:284-288.
[32]SELL G,GARCIA-ROMERO D.Speaker diarization with PLDA i-vector scoring and unsupervised calibration[C]//2014 IEEE Spoken Language Technology Workshop(SLT).IEEE,2014:413-417.
[33]DEHAK N,KENNY P J,DEHAK R,et al.Front-end factoranalysis for speaker verification[J].IEEE Transactions on Audio,Speech,and Language Processing,2010,19(4):788-798.
[34]GARCIA-ROMERO D,ESPY-WILSON C Y.Analysis of i-vector length normalization in speaker recognition systems[C]//Twelfth Annual Conference of the International Speech Communication Association.2011.
[35]SNYDER D,GARCIA-ROMERO D,SELL G,et al.X-vectors:Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5329-5333.
[36]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[37]DENG L H,DENG F,ZHANG G X,et al.Multi-Scale End-to-End Speaker Recognition System Based on Improved Res2Net[J].Computer Engineering and Applications,2023,59(24):110-120.
[38]SENOUSSAOUI M,KENNY P,STAFYLAKIS T,et al.Astudy of the cosine distance-based mean shift for telephone speech diarization[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2013,22(1):217-227.
[39]KENNY P,STAFYLAKIS T,OUELLET P,et al.PLDA for speaker verification with utterances of arbitrary duration[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:7649-7653.
[40]HAMERLY G,ELKAN C.Learning the k in k-means[C]//Proceedings of the 16th International Conference on Neural Information Processing Systems.2003:281-288.
[41]GOWDA K C,KRISHNA G.Agglomerative clustering using the concept of mutual nearest neighbourhood[J].Pattern Recognition,1978,10(2):105-112.
[42]ZHANG A,WANG Q,ZHU Z,et al.Fully supervised speaker diarization[C]//2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2019:6301-6305.
[43]LI S,CAO F.Analysis and Trend Research on End-to-EndFramework Models of Intelligent Speech Technology [J].Computer Science,2022,49(S1):331-336.
[44]WANG Q,DOWNEY C,WAN L,et al.Speaker diarization with LSTM[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018:5239-5243.
[45]YU D,KOLBÆK M,TAN Z H,et al.Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2017:241-245.
[46]FUJITA Y,KANDA N,HORIGUCHI S,et al.End-to-end neural speaker diarization with permutation-free objectives[J].ar-Xiv:1909.05952,2019.
[47]HORIGUCHI S,FUJITA Y,WATANABE S,et al.End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors[J].arXiv:2005.09921,2020.
[48]KINOSHITA K,DELCROIX M,TAWARA N.Integrating end-to-end neural and clustering-based diarization:Getting the best of both worlds[J].arXiv:2010.13366,2020.
[49]PRZYBOCKI M,MARTIN A.2000 NIST Speaker Recognition Evaluation(LDC2001S97)[S].Philadelphia,PA:Linguistic Data Consortium,2001.
[50]MEDENNIKOV I,KORENEVSKY M,PRISYACH T,et al.Target-speaker voice activity detection:A novel approach for multi-speaker diarization in a dinner party scenario[J].arXiv:2005.07272,2020.
[51]CHUNG J S,NAGRANI A,COTO E,et al.VoxSRC 2019:The first VoxCeleb speaker recognition challenge[J].arXiv:1912.02522,2019.
[52]NAGRANI A,CHUNG J S,HUH J,et al.Voxsrc 2020:Thesecond voxceleb speaker recognition challenge[J].arXiv:2012.06867,2020.
[53]BROWN A,HUH J,CHUNG J S,et al.Voxsrc 2021:The third voxceleb speaker recognition challenge[J].arXiv:2201.04583,2022.
[54]HUH J,BROWN A,JUNG J W,et al.Voxsrc 2022:The fourth voxceleb speaker recognition challenge[J].arXiv:2302.10248,2023.
[55]RYANT N,CHURCH K,CIERI C,et al.First DIHARD challenge evaluation plan[J/OL].https://catalog.ldc.upenn.edu/docs/LDC2019S12/first_dihard_eval_plan_v1.3.pdf.
[56]RYANT N,CHURCH K,CIERI C,et al.The Second DIHARD Diarization Challenge:Dataset,Task,and Baselines[C]//Interspeech.2019.
[57]RYANT N,CHURCH K,CIERI C,et al.Third DIHARD challenge evaluation plan[J].arXiv:2006.05815,2020.
[58]WATANABE S,MANDEL M,BARKER J,et al.CHiME-6challenge:Tackling multispeaker speech recognition for unsegmented recordings[J].arXiv:2004.09249,2020.
[59]YU F,ZHANG S,FU Y,et al.M2MeT:The ICASSP 2022multi-channel multi-party meeting transcription challenge[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:6167-6171.
[60]WANG W,CAI D,LIN Q,et al.The dku-dukeece-lenovo system for the diarization task of the 2021 voxceleb speaker recognition challenge[J].arXiv:2109.02002,2021.
[61]WANG W,QIN X,CHENG M,et al.The dku-dukeece diarization system for the voxceleb speaker recognition challenge 2022[J].arXiv:2210.01677,2022.
[62]WANG W,LIN Q,CAI D,et al.Similarity measurement of segment-level speaker embeddings in speaker diarization[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,30:2645-2658.
[63]WANG W,QIN X,LI M.Cross-channel attention-based targetspeaker voice activity detection:Experimental results for the m2met challenge[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:9171-9175.
[64]CHEN Y,WANG S,QIAN Y,et al.End-to-end speaker-depen-dent voice activity detection[J].arXiv:2009.09906,2020.
[65]DING S,WANG Q,CHANG S Y,et al.Personal VAD:Speaker-conditioned voice activity detection[J].arXiv:1908.04284,2019.
[66]CHENG M,WANG W,ZHANG Y,et al.Target-speaker voice activity detection via sequence-to-sequence prediction[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[67]CHUNG J S,HUH J,NAGRANI A,et al.Spot the conversation:speaker diarisation in the wild[J].arXiv:2007.01216,2020.
[68]FLEMOTOMOS N,PAPADOPOULOS P,GIBSON J,et al.Combined Speaker Clustering and Role Recognition in Conve-rsational Speech[C]//Interspeech.2018:1378-1382.
[69]FLEMOTOMOS N,NARAYANAN S.Multimodal clustering with role induced constraints for speaker diarization[J].arXiv:2204.00657,2022.
[70]LIU S,LI F,ZHANG H,et al.Dab-detr:Dynamic anchor boxes are better queries for detr[J].arXiv:2201.12329,2022.
[71]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[72]DENG J,GUO J,XUE N,et al.Arcface:Additive angular margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4690-4699.
[73]QIN X,CAI D,LI M.Robust multi-channel far-field speakerverification under different in-domain data availability scenarios[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2022,31:71-85.
[74]CHUNG J S,NAGRANI A,ZISSERMAN A.Voxceleb2:Deep speaker recognition[J].arXiv:1806.05622,2018.
[75]QIN X,LI M,BU H,et al.The interspeech 2020 far-field spea-ker verification challenge[J].arXiv:2005.08046,2020.
[76]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[77]黄清,方木云.一种基于 HMM 算法改进的语音识别系统[J].重庆工商大学学报(自然科学版),2022,39(5):56-61
[78]YAO Z,WU D,WANG X,et al.Wenet:Production orientedstreaming and non-streaming end-to-end speech recognition toolkit[J].arXiv:2102.01547,2021.
[79]FU Y,CHENG L,LV S,et al.Aishell-4:An open source dataset for speech enhancement,separation,recognition and speaker diarization in conference scenario[J].arXiv:2104.03603,2021.
[80]WANG Z,WU S,CHEN H,et al.The multimodal information based speech processing(misp) 2022 challenge:Audio-visual diarization and recognition[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[81]GULATI A,QIN J,CHIU C C,et al.Conformer:Convolution-augmented transformer for speech recognition[J].arXiv:2005.08100,2020.
[82]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems December.2017:6000-6010.
[83]SNYDER D,CHEN G,POVEY D.Musan:A music,speech,and noise corpus[J].arXiv:1510.08484,2015.
[84]UNWIN A,KLEINMAN K.The iris data set:In search of the source of virginica[J].Significance,2021,18(6):26-29.
[85]BREDIN H,YIN R,CORIA J M,et al.pyannote.audio:neural building blocks for speaker diarization[J].arXiv:1911.01255,2019.
[86]LANDINI F,PROFANT J,DIEZ M,et al.Bayesian HMM clustering of x-vector sequences(VBx) in speaker diarization:Theory,implementation and analysis on standard tasks[J].Computer Speech & Language,2022,71:101254.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!