计算机科学 ›› 2025, Vol. 52 ›› Issue (6): 219-227.doi: 10.11896/jsjkx.240400150
沈心旸1, 王善敏2, 孙玉宝1
SHEN Xinyang1, WANG Shanmin2, SUN Yubao1
摘要: 抑郁症已成为全球性的重大公共卫生问题,语音抑郁症识别旨在以易推广、低成本的方式对抑郁症进行识别。现有研究通常将长序列语音信号划分成多个片段作为独立样本参与训练,未能充分利用语音信号不同位置间的相互关系,忽视了与识别目标无关的片段对结果产生的干扰。为解决以上问题,提出了一种基于语音语料对齐与自适应融合的抑郁症识别方法,对输入语音进行语料拆分后,通过多头注意力机制进行语料间关联性建模,并通过片段重要性挖掘模块自动学习语音中不同片段的重要系数,有效融合局部和全局特征进行识别,提升了识别准确率。所提方法在MODMA数据集和SEARCH数据集上的加权准确率、未加权准确率、F1分数分别达到了82.59%,82.17%,82.23%和74.44%,68.33%,69.25%,实验结果表明,所提方法能够通过语音信号对抑郁症进行准确识别。
中图分类号:
[1]DEAN J,KESHAVAN M.The neurobiology of depression:An integrated view[J].Asian Journal of Psychiatry,2017,27:101-111. [2]CASSANO P,FAVA M.Depression and public health:an overview[J].Journal of Psychosomatic Research,2002,53(4):849-857. [3]PAYKEL E S.Basic concepts of depression[J].Dialogues inClinical Neuroscience,2008,10(3):279-289. [4]PAMPALLONA S,BOLLINI P,TIBALDI G,et al.Combined pharmacotherapy and psychological treatment for depression:a systematic review[J].Archives of General Psychiatry,2004,61(7):714-719. [5]HALFIN A.Depression:the benefits of early and appropriatetreatment[J].American Journal of Managed Care,2007,13(4):S92. [6]MAURER D M,RAYMOND T J,DAVIS B N.Depression:screening and diagnosis[J].American Family Physician,2018,98(8):508-515. [7]O'CONNOR E,ROSSOM R C,HENNINGER M,et al.Primary care screening for and treatment of depression in pregnant and postpartum women:evidence report and systematic review for the US Preventive Services Task Force[J].Jama,2016,315(4):388-406. [8]COHN J F,KRUEZ T S,MATTHEWS I,et al.Detecting depression from facial actions and vocal prosody[C]//2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.IEEE,2009:1-7. [9]CUMMINS N,SCHERER S,KRAJEWSKI J,et al.A review of depression and suicide risk assessment using speech analysis[J].Speech Communication,2015,71:10-49. [10]FRANCE D J,SHIAVI R G,SILVERMAN S,et al.Acoustical properties of speech as indicators of depression and suicidal risk[J].IEEE Transactions on Biomedical Engineering,2000,47(7):829-837. [11]CUMMINS N,SETHU V,EPPS J,et al.Analysis of acoustic space variability in speech affected by depression[J].Speech Communication,2015,75:27-49. [12]DU M,LIU S,WANG T,et al.Depression recognition using a proposed speech chain model fusing speech production and perception features[J].Journal of Affective Disorders,2023,323:299-308. [13]MA X,YANG H,CHEN Q,et al.Depaudionet:An efficientdeep model for audio based depression classification[C]//Proceedings of the 6th International Workshop on Audio/visual Emotion Challenge.2016:35-42. [14]WANG H,LIU Y,ZHEN X,et al.Depression speech recognition with a three-dimensional convolutional network[J].Frontiers in Human Neuroscience,2021,15:713823. [15]ZHAO Y,LIANG Z,DU J,et al.Multi-head attention-basedlong short-term memory for depression detection from speech[J].Frontiers in Neurorobotics,2021,15:684037. [16]DUMPALA S H,REMPEL S,DIKAIOS K,et al.Estimating severity of depression from acoustic features and embeddings of natural speech[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:7278-7282. [17]ALGHIFARI M F,GUNAWAN T S,NORDIN M A W,et al.On the optimum speech segment length for depression detection[C]//2019 IEEE International Conference on Smart Instrumentation,Measurement and Application(ICSIMA).IEEE,2019:1-5. [18]ZHANG P,WU M,DINKEL H,et al.Depa:Self-supervised audio embedding for depression detection[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:135-143. [19]SALEKIN A,EBERLE J W,GLENN J J,et al.A weakly supervised learning framework for detecting social anxiety and depression[J].Proceedings of the ACM on Interactive,Mobile,Wearable and Ubiquitous Technologies,2018,2(2):1-26. [20]ALGHOWINEM S,GOECKE R,WAGNER M,et al.Detecting depression:a comparison between spontaneous and read speech[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:7547-7551. [21]LIU Z,XIONG H,LI X,et al.Comparing thin-slicing of speech for clinical depression detection[C]//2018 IEEE International Conference on Systems,Man,and Cybernetics(SMC).IEEE,2018:1885-1891. [22]BRAUNSCHWEILER N,DODDIPATLA R,KEIZER S,et al.Factors in emotion recognition with deep learning models using speech and text on multiple corpora[J].IEEE Signal Processing Letters,2022,29:722-726. [23]SCHULLER B,VLASENKO B,EYBEN F,et al.Acoustic emotion recognition:A benchmark comparison of performances[C]//2009 IEEE Workshop on Automatic Speech Recognition &Understanding.IEEE,2009:552-557. [24]YANG Y,FAIRBAIRN C,COHN J F.Detecting depression severity from vocal prosody[J].IEEE Transactions on Affective Computing,2012,4(2):142-150. [25]TEASDALE J D,FOGARTY S J,WILLIAMS J M G.Speech rate as a measure of short-term variation in depression[J].British Journal of Social and Clinical Psychology,1980,19(3):271-278. [26]LONG H,GUO Z,WU X,et al.Detecting depression in speech:Comparison and combination between different speech types[C]//2017 IEEE International Conference on Bioinformatics and Biomedicine(BIBM).IEEE,2017:1052-1058. [27]JIANG H,HU B,LIU Z,et al.Detecting depression using an ensemble logistic regression model based on multiple speech features[J].Computational and Mathematical Methods in Medicine,2018,2018(1):6508319. [28]KUCHIBHOTLA S,DOGGA S S,THOTA N G V,et al.Depression detection from speech emotions using MFCC based recurrent neural network[C]//2023 2nd International Conference on Vision Towards Emerging Trends in Communication and Networking Technologies(ViTECoN).IEEE,2023:1-5. [29]TAO F,GE X,MA W,et al.Multi-Local Attention for Speech-Based Depression Detection[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP2023 ).IEEE,2023:1-5. [30]ZHANG X,ZHANG X,CHEN W,et al.Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments[J].Scientific Reports,2024,14(1):9543. [31]ZUO L,MAK M W,TU Y.Promoting Independence of Depression and Speaker Features for Speaker Disentanglement in Speech-Based Depression Detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:10191-10195. [32]XUE J,QIN R,ZHOU X,et al.Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:6790-6794. [33]WU W,ZHANG C,WOODLAND P C.Self-supervised representations in speech-based depression detection[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5. [34]ZHAO Z,BAO Z,ZHANG Z,et al.Hierarchical attention transfer networks for depression assessment from speech[C]//ICASSP 2020-2020 IEEE international conference on acoustics,speech and signal processing(ICASSP).IEEE,2020:7159-7163. [35]SRIMADHUR N S,LALITHA S.An end-to-end model for detection and assessment of depression levels using speech[J].Procedia Computer Science,2020,171:12-21. [36]YANG W,LIU J,CAO P,et al.Attention guided learnable time-domain filterbanks for speech depression detection[J].Neural Networks,2023,165:135-149. [37]YOON S,MAENG S,KIM R,et al.Strategy for developing a speech recognition model specialized for patients with depression or Parkinson's disease with small size speech database[J].Biomedical Engineering Letters,2024,14(5):1049-1055. [38]GUPTA S,AGARWAL G,AGARWAL S,et al.Depression detection using cascaded attention based deep learning framework using speech data[J].Multimedia Tools and Applications,2024,83(25):66135-66173. [39]CHEN W,XING X,XU X,et al.SpeechFormer:A hierarchical efficient framework incorporating the characteristics of speech[J].arXiv:2203.03812,2022. [40]CHEN W,XING X,XU X,et al.Speechformer++:A hierarchical efficient framework for paralinguistic speech processing[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,31:775-788. [41]PAN Y,SHANG Y,WANG W,et al.Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech[J].Biomedical Signal Processing and Control,2024,89:105704. [42]MCAULIFFE M,SOCOLOF M,MIHUC S,et al.Montrealforced aligner:Trainable text-speech alignment using kaldi[C]//Interspeech.2017:498-502. [43]BAEVSKI A,ZHOU Y,MOHAMED A,et al.wav2vec 2.0:A framework for self-supervised learning of speech representations[J].Advances in Neural Information Processing Systems,2020,33:12449-12460. [44]SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outra-geously large neural networks:The sparsely-gated mixture-of-experts layer[J].arXiv:1701.06538,2017. [45]EIGEN D,RANZATO M A,SUTSKEVER I.Learning factored representations in a deep mixture of experts[J].arXiv:1312.4314,2013. [46]BENGIO E,BACON P L,PINEAU J,et al.Conditional computation in neural networks for faster models[J].arXiv:1511.06297,2015. [47]CAI H,GAO Y,SUN S,et al.Modma dataset:a multi-modal open dataset for mental-disorder analysis[J].arXiv:2002.09283,2020. [48]ZHANG R,WANG Y,WOMER F,et al.School-based Evaluation Advancing Response for Child Health(SEARCH):a mixed longitudinal cohort study from multifacetedperspectives in Jiang-su,China[J].BMJ Ment Health,2023,26(1). [49]ZHAO S,MA B,WATCHARASUPAT K N,et al.FRCRN:Boosting feature representation using frequency recurrence for monaural speech enhancement[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:9281-9285. [50]GUPTA S,FAHAD M S,DEEPAK A.Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition[J].Multimedia Tools and Applications,2020,79:23347-23365. [51]LUNA-JIMÉNEZ C,KLEINLEIN R,GRIOL D,et al.A proposal for multimodal emotion recognition using aural transfor-mers and action units on RAVDESS dataset[J].Applied Sciences,2021,12(1):327. [52]KANOUJIA S,KARUPPANAN P.Depression Detection inSpeech Using ML and DL Algorithm[C]//2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation(IATMSI).IEEE,2024,2:1-5. |
|