计算机科学 ›› 2025, Vol. 52 ›› Issue (6): 219-227.doi: 10.11896/jsjkx.240400150

• 计算机图形学&多媒体 • 上一篇    下一篇

基于语音语料对齐与自适应融合的抑郁症识别

沈心旸1, 王善敏2, 孙玉宝1   

  1. 1 南京信息工程大学计算机学院 南京 210044
    2 南京航空航天大学计算机科学与技术学院 南京 211106
  • 收稿日期:2024-04-20 修回日期:2024-08-08 出版日期:2025-06-15 发布日期:2025-06-11
  • 通讯作者: 孙玉宝(sunyb@nuist.edu.cn)
  • 作者简介:(calmsxy@sina.com)
  • 基金资助:
    国家重点研发计划(2022YFC2405600);国家自然科学基金(62276139)

Depression Recognition Based on Speech Corpus Alignment and Adaptive Fusion

SHEN Xinyang1, WANG Shanmin2, SUN Yubao1   

  1. 1 School of Computer Science,Nanjing University of Information Science & Technology,Nanjing 210044,China
    2 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
  • Received:2024-04-20 Revised:2024-08-08 Online:2025-06-15 Published:2025-06-11
  • About author:SHEN Xinyang,born in 1997,postgra-duate.His main research interest is speech depression recognition.
    SUN Yubao,born in 1983,professor,Ph.D supervisor,is a member of CCF(No.3755S).His main research in-terests include cross-modal media analysis and affective computing.
  • Supported by:
    National Key R&D Program of China(2022YFC2405600) and National Natural Science Foundation of China(62276139).

摘要: 抑郁症已成为全球性的重大公共卫生问题,语音抑郁症识别旨在以易推广、低成本的方式对抑郁症进行识别。现有研究通常将长序列语音信号划分成多个片段作为独立样本参与训练,未能充分利用语音信号不同位置间的相互关系,忽视了与识别目标无关的片段对结果产生的干扰。为解决以上问题,提出了一种基于语音语料对齐与自适应融合的抑郁症识别方法,对输入语音进行语料拆分后,通过多头注意力机制进行语料间关联性建模,并通过片段重要性挖掘模块自动学习语音中不同片段的重要系数,有效融合局部和全局特征进行识别,提升了识别准确率。所提方法在MODMA数据集和SEARCH数据集上的加权准确率、未加权准确率、F1分数分别达到了82.59%,82.17%,82.23%和74.44%,68.33%,69.25%,实验结果表明,所提方法能够通过语音信号对抑郁症进行准确识别。

关键词: 多头注意力机制, 关联性建模, 重要性挖掘, 特征融合, 抑郁症识别

Abstract: Depression has become a significant global public health issue.Speech-based depression recognition aims to recognize depression in an easily scalable and cost-effective manner.Prior studies often divide long speech into multiple slices,and optimize models with them independently or further establish their relationship via temporal modules.They fail to make the most of the intra- and inter-relationships between segmented speech,concurrently introducing some task-irrelevant information.This paper proposes a depression recognition method based on speech corpus alignment and adaptive fusion.After segmenting the input speech,multi-granularity feature correlation is established through a multi-head attention mechanism,and the segment importance mining module is used to automatically learn the importance of different segments.This method effectively integrates local and global features,significantly improving recognizing performance.The proposed method achieves a weighted accuracy of 82.59%,an unweighted accuracy of 82.17%,and an F1 score of 82.23%,respectively,on the MODMA database.On the SEARCH database,the weighted accuracy,unweighted accuracy,and F1 score are 74.44%,68.33%,and 69.25%,respectively.The experiments demonstrate that the proposed model can accurately recognize depression,outperforming existing works.

Key words: Multi-head attention mechanism, Correlation modeling, Importance mining, Feature fusion, Depression recognition

中图分类号: 

  • TP391
[1]DEAN J,KESHAVAN M.The neurobiology of depression:An integrated view[J].Asian Journal of Psychiatry,2017,27:101-111.
[2]CASSANO P,FAVA M.Depression and public health:an overview[J].Journal of Psychosomatic Research,2002,53(4):849-857.
[3]PAYKEL E S.Basic concepts of depression[J].Dialogues inClinical Neuroscience,2008,10(3):279-289.
[4]PAMPALLONA S,BOLLINI P,TIBALDI G,et al.Combined pharmacotherapy and psychological treatment for depression:a systematic review[J].Archives of General Psychiatry,2004,61(7):714-719.
[5]HALFIN A.Depression:the benefits of early and appropriatetreatment[J].American Journal of Managed Care,2007,13(4):S92.
[6]MAURER D M,RAYMOND T J,DAVIS B N.Depression:screening and diagnosis[J].American Family Physician,2018,98(8):508-515.
[7]O'CONNOR E,ROSSOM R C,HENNINGER M,et al.Primary care screening for and treatment of depression in pregnant and postpartum women:evidence report and systematic review for the US Preventive Services Task Force[J].Jama,2016,315(4):388-406.
[8]COHN J F,KRUEZ T S,MATTHEWS I,et al.Detecting depression from facial actions and vocal prosody[C]//2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.IEEE,2009:1-7.
[9]CUMMINS N,SCHERER S,KRAJEWSKI J,et al.A review of depression and suicide risk assessment using speech analysis[J].Speech Communication,2015,71:10-49.
[10]FRANCE D J,SHIAVI R G,SILVERMAN S,et al.Acoustical properties of speech as indicators of depression and suicidal risk[J].IEEE Transactions on Biomedical Engineering,2000,47(7):829-837.
[11]CUMMINS N,SETHU V,EPPS J,et al.Analysis of acoustic space variability in speech affected by depression[J].Speech Communication,2015,75:27-49.
[12]DU M,LIU S,WANG T,et al.Depression recognition using a proposed speech chain model fusing speech production and perception features[J].Journal of Affective Disorders,2023,323:299-308.
[13]MA X,YANG H,CHEN Q,et al.Depaudionet:An efficientdeep model for audio based depression classification[C]//Proceedings of the 6th International Workshop on Audio/visual Emotion Challenge.2016:35-42.
[14]WANG H,LIU Y,ZHEN X,et al.Depression speech recognition with a three-dimensional convolutional network[J].Frontiers in Human Neuroscience,2021,15:713823.
[15]ZHAO Y,LIANG Z,DU J,et al.Multi-head attention-basedlong short-term memory for depression detection from speech[J].Frontiers in Neurorobotics,2021,15:684037.
[16]DUMPALA S H,REMPEL S,DIKAIOS K,et al.Estimating severity of depression from acoustic features and embeddings of natural speech[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:7278-7282.
[17]ALGHIFARI M F,GUNAWAN T S,NORDIN M A W,et al.On the optimum speech segment length for depression detection[C]//2019 IEEE International Conference on Smart Instrumentation,Measurement and Application(ICSIMA).IEEE,2019:1-5.
[18]ZHANG P,WU M,DINKEL H,et al.Depa:Self-supervised audio embedding for depression detection[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:135-143.
[19]SALEKIN A,EBERLE J W,GLENN J J,et al.A weakly supervised learning framework for detecting social anxiety and depression[J].Proceedings of the ACM on Interactive,Mobile,Wearable and Ubiquitous Technologies,2018,2(2):1-26.
[20]ALGHOWINEM S,GOECKE R,WAGNER M,et al.Detecting depression:a comparison between spontaneous and read speech[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:7547-7551.
[21]LIU Z,XIONG H,LI X,et al.Comparing thin-slicing of speech for clinical depression detection[C]//2018 IEEE International Conference on Systems,Man,and Cybernetics(SMC).IEEE,2018:1885-1891.
[22]BRAUNSCHWEILER N,DODDIPATLA R,KEIZER S,et al.Factors in emotion recognition with deep learning models using speech and text on multiple corpora[J].IEEE Signal Processing Letters,2022,29:722-726.
[23]SCHULLER B,VLASENKO B,EYBEN F,et al.Acoustic emotion recognition:A benchmark comparison of performances[C]//2009 IEEE Workshop on Automatic Speech Recognition &Understanding.IEEE,2009:552-557.
[24]YANG Y,FAIRBAIRN C,COHN J F.Detecting depression severity from vocal prosody[J].IEEE Transactions on Affective Computing,2012,4(2):142-150.
[25]TEASDALE J D,FOGARTY S J,WILLIAMS J M G.Speech rate as a measure of short-term variation in depression[J].British Journal of Social and Clinical Psychology,1980,19(3):271-278.
[26]LONG H,GUO Z,WU X,et al.Detecting depression in speech:Comparison and combination between different speech types[C]//2017 IEEE International Conference on Bioinformatics and Biomedicine(BIBM).IEEE,2017:1052-1058.
[27]JIANG H,HU B,LIU Z,et al.Detecting depression using an ensemble logistic regression model based on multiple speech features[J].Computational and Mathematical Methods in Medicine,2018,2018(1):6508319.
[28]KUCHIBHOTLA S,DOGGA S S,THOTA N G V,et al.Depression detection from speech emotions using MFCC based recurrent neural network[C]//2023 2nd International Conference on Vision Towards Emerging Trends in Communication and Networking Technologies(ViTECoN).IEEE,2023:1-5.
[29]TAO F,GE X,MA W,et al.Multi-Local Attention for Speech-Based Depression Detection[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP2023 ).IEEE,2023:1-5.
[30]ZHANG X,ZHANG X,CHEN W,et al.Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments[J].Scientific Reports,2024,14(1):9543.
[31]ZUO L,MAK M W,TU Y.Promoting Independence of Depression and Speaker Features for Speaker Disentanglement in Speech-Based Depression Detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:10191-10195.
[32]XUE J,QIN R,ZHOU X,et al.Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:6790-6794.
[33]WU W,ZHANG C,WOODLAND P C.Self-supervised representations in speech-based depression detection[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[34]ZHAO Z,BAO Z,ZHANG Z,et al.Hierarchical attention transfer networks for depression assessment from speech[C]//ICASSP 2020-2020 IEEE international conference on acoustics,speech and signal processing(ICASSP).IEEE,2020:7159-7163.
[35]SRIMADHUR N S,LALITHA S.An end-to-end model for detection and assessment of depression levels using speech[J].Procedia Computer Science,2020,171:12-21.
[36]YANG W,LIU J,CAO P,et al.Attention guided learnable time-domain filterbanks for speech depression detection[J].Neural Networks,2023,165:135-149.
[37]YOON S,MAENG S,KIM R,et al.Strategy for developing a speech recognition model specialized for patients with depression or Parkinson's disease with small size speech database[J].Biomedical Engineering Letters,2024,14(5):1049-1055.
[38]GUPTA S,AGARWAL G,AGARWAL S,et al.Depression detection using cascaded attention based deep learning framework using speech data[J].Multimedia Tools and Applications,2024,83(25):66135-66173.
[39]CHEN W,XING X,XU X,et al.SpeechFormer:A hierarchical efficient framework incorporating the characteristics of speech[J].arXiv:2203.03812,2022.
[40]CHEN W,XING X,XU X,et al.Speechformer++:A hierarchical efficient framework for paralinguistic speech processing[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,31:775-788.
[41]PAN Y,SHANG Y,WANG W,et al.Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech[J].Biomedical Signal Processing and Control,2024,89:105704.
[42]MCAULIFFE M,SOCOLOF M,MIHUC S,et al.Montrealforced aligner:Trainable text-speech alignment using kaldi[C]//Interspeech.2017:498-502.
[43]BAEVSKI A,ZHOU Y,MOHAMED A,et al.wav2vec 2.0:A framework for self-supervised learning of speech representations[J].Advances in Neural Information Processing Systems,2020,33:12449-12460.
[44]SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outra-geously large neural networks:The sparsely-gated mixture-of-experts layer[J].arXiv:1701.06538,2017.
[45]EIGEN D,RANZATO M A,SUTSKEVER I.Learning factored representations in a deep mixture of experts[J].arXiv:1312.4314,2013.
[46]BENGIO E,BACON P L,PINEAU J,et al.Conditional computation in neural networks for faster models[J].arXiv:1511.06297,2015.
[47]CAI H,GAO Y,SUN S,et al.Modma dataset:a multi-modal open dataset for mental-disorder analysis[J].arXiv:2002.09283,2020.
[48]ZHANG R,WANG Y,WOMER F,et al.School-based Evaluation Advancing Response for Child Health(SEARCH):a mixed longitudinal cohort study from multifacetedperspectives in Jiang-su,China[J].BMJ Ment Health,2023,26(1).
[49]ZHAO S,MA B,WATCHARASUPAT K N,et al.FRCRN:Boosting feature representation using frequency recurrence for monaural speech enhancement[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:9281-9285.
[50]GUPTA S,FAHAD M S,DEEPAK A.Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition[J].Multimedia Tools and Applications,2020,79:23347-23365.
[51]LUNA-JIMÉNEZ C,KLEINLEIN R,GRIOL D,et al.A proposal for multimodal emotion recognition using aural transfor-mers and action units on RAVDESS dataset[J].Applied Sciences,2021,12(1):327.
[52]KANOUJIA S,KARUPPANAN P.Depression Detection inSpeech Using ML and DL Algorithm[C]//2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation(IATMSI).IEEE,2024,2:1-5.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!