Computer Science ›› 2025, Vol. 52 ›› Issue (6): 219-227.doi: 10.11896/jsjkx.240400150

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Depression Recognition Based on Speech Corpus Alignment and Adaptive Fusion

SHEN Xinyang1, WANG Shanmin2, SUN Yubao1   

  1. 1 School of Computer Science,Nanjing University of Information Science & Technology,Nanjing 210044,China
    2 College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
  • Received:2024-04-20 Revised:2024-08-08 Online:2025-06-15 Published:2025-06-11
  • About author:SHEN Xinyang,born in 1997,postgra-duate.His main research interest is speech depression recognition.
    SUN Yubao,born in 1983,professor,Ph.D supervisor,is a member of CCF(No.3755S).His main research in-terests include cross-modal media analysis and affective computing.
  • Supported by:
    National Key R&D Program of China(2022YFC2405600) and National Natural Science Foundation of China(62276139).

Abstract: Depression has become a significant global public health issue.Speech-based depression recognition aims to recognize depression in an easily scalable and cost-effective manner.Prior studies often divide long speech into multiple slices,and optimize models with them independently or further establish their relationship via temporal modules.They fail to make the most of the intra- and inter-relationships between segmented speech,concurrently introducing some task-irrelevant information.This paper proposes a depression recognition method based on speech corpus alignment and adaptive fusion.After segmenting the input speech,multi-granularity feature correlation is established through a multi-head attention mechanism,and the segment importance mining module is used to automatically learn the importance of different segments.This method effectively integrates local and global features,significantly improving recognizing performance.The proposed method achieves a weighted accuracy of 82.59%,an unweighted accuracy of 82.17%,and an F1 score of 82.23%,respectively,on the MODMA database.On the SEARCH database,the weighted accuracy,unweighted accuracy,and F1 score are 74.44%,68.33%,and 69.25%,respectively.The experiments demonstrate that the proposed model can accurately recognize depression,outperforming existing works.

Key words: Multi-head attention mechanism, Correlation modeling, Importance mining, Feature fusion, Depression recognition

CLC Number: 

  • TP391
[1]DEAN J,KESHAVAN M.The neurobiology of depression:An integrated view[J].Asian Journal of Psychiatry,2017,27:101-111.
[2]CASSANO P,FAVA M.Depression and public health:an overview[J].Journal of Psychosomatic Research,2002,53(4):849-857.
[3]PAYKEL E S.Basic concepts of depression[J].Dialogues inClinical Neuroscience,2008,10(3):279-289.
[4]PAMPALLONA S,BOLLINI P,TIBALDI G,et al.Combined pharmacotherapy and psychological treatment for depression:a systematic review[J].Archives of General Psychiatry,2004,61(7):714-719.
[5]HALFIN A.Depression:the benefits of early and appropriatetreatment[J].American Journal of Managed Care,2007,13(4):S92.
[6]MAURER D M,RAYMOND T J,DAVIS B N.Depression:screening and diagnosis[J].American Family Physician,2018,98(8):508-515.
[7]O'CONNOR E,ROSSOM R C,HENNINGER M,et al.Primary care screening for and treatment of depression in pregnant and postpartum women:evidence report and systematic review for the US Preventive Services Task Force[J].Jama,2016,315(4):388-406.
[8]COHN J F,KRUEZ T S,MATTHEWS I,et al.Detecting depression from facial actions and vocal prosody[C]//2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.IEEE,2009:1-7.
[9]CUMMINS N,SCHERER S,KRAJEWSKI J,et al.A review of depression and suicide risk assessment using speech analysis[J].Speech Communication,2015,71:10-49.
[10]FRANCE D J,SHIAVI R G,SILVERMAN S,et al.Acoustical properties of speech as indicators of depression and suicidal risk[J].IEEE Transactions on Biomedical Engineering,2000,47(7):829-837.
[11]CUMMINS N,SETHU V,EPPS J,et al.Analysis of acoustic space variability in speech affected by depression[J].Speech Communication,2015,75:27-49.
[12]DU M,LIU S,WANG T,et al.Depression recognition using a proposed speech chain model fusing speech production and perception features[J].Journal of Affective Disorders,2023,323:299-308.
[13]MA X,YANG H,CHEN Q,et al.Depaudionet:An efficientdeep model for audio based depression classification[C]//Proceedings of the 6th International Workshop on Audio/visual Emotion Challenge.2016:35-42.
[14]WANG H,LIU Y,ZHEN X,et al.Depression speech recognition with a three-dimensional convolutional network[J].Frontiers in Human Neuroscience,2021,15:713823.
[15]ZHAO Y,LIANG Z,DU J,et al.Multi-head attention-basedlong short-term memory for depression detection from speech[J].Frontiers in Neurorobotics,2021,15:684037.
[16]DUMPALA S H,REMPEL S,DIKAIOS K,et al.Estimating severity of depression from acoustic features and embeddings of natural speech[C]//2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2021).IEEE,2021:7278-7282.
[17]ALGHIFARI M F,GUNAWAN T S,NORDIN M A W,et al.On the optimum speech segment length for depression detection[C]//2019 IEEE International Conference on Smart Instrumentation,Measurement and Application(ICSIMA).IEEE,2019:1-5.
[18]ZHANG P,WU M,DINKEL H,et al.Depa:Self-supervised audio embedding for depression detection[C]//Proceedings of the 29th ACM International Conference on Multimedia.2021:135-143.
[19]SALEKIN A,EBERLE J W,GLENN J J,et al.A weakly supervised learning framework for detecting social anxiety and depression[J].Proceedings of the ACM on Interactive,Mobile,Wearable and Ubiquitous Technologies,2018,2(2):1-26.
[20]ALGHOWINEM S,GOECKE R,WAGNER M,et al.Detecting depression:a comparison between spontaneous and read speech[C]//2013 IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:7547-7551.
[21]LIU Z,XIONG H,LI X,et al.Comparing thin-slicing of speech for clinical depression detection[C]//2018 IEEE International Conference on Systems,Man,and Cybernetics(SMC).IEEE,2018:1885-1891.
[22]BRAUNSCHWEILER N,DODDIPATLA R,KEIZER S,et al.Factors in emotion recognition with deep learning models using speech and text on multiple corpora[J].IEEE Signal Processing Letters,2022,29:722-726.
[23]SCHULLER B,VLASENKO B,EYBEN F,et al.Acoustic emotion recognition:A benchmark comparison of performances[C]//2009 IEEE Workshop on Automatic Speech Recognition &Understanding.IEEE,2009:552-557.
[24]YANG Y,FAIRBAIRN C,COHN J F.Detecting depression severity from vocal prosody[J].IEEE Transactions on Affective Computing,2012,4(2):142-150.
[25]TEASDALE J D,FOGARTY S J,WILLIAMS J M G.Speech rate as a measure of short-term variation in depression[J].British Journal of Social and Clinical Psychology,1980,19(3):271-278.
[26]LONG H,GUO Z,WU X,et al.Detecting depression in speech:Comparison and combination between different speech types[C]//2017 IEEE International Conference on Bioinformatics and Biomedicine(BIBM).IEEE,2017:1052-1058.
[27]JIANG H,HU B,LIU Z,et al.Detecting depression using an ensemble logistic regression model based on multiple speech features[J].Computational and Mathematical Methods in Medicine,2018,2018(1):6508319.
[28]KUCHIBHOTLA S,DOGGA S S,THOTA N G V,et al.Depression detection from speech emotions using MFCC based recurrent neural network[C]//2023 2nd International Conference on Vision Towards Emerging Trends in Communication and Networking Technologies(ViTECoN).IEEE,2023:1-5.
[29]TAO F,GE X,MA W,et al.Multi-Local Attention for Speech-Based Depression Detection[C]//2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP2023 ).IEEE,2023:1-5.
[30]ZHANG X,ZHANG X,CHEN W,et al.Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments[J].Scientific Reports,2024,14(1):9543.
[31]ZUO L,MAK M W,TU Y.Promoting Independence of Depression and Speaker Features for Speaker Disentanglement in Speech-Based Depression Detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:10191-10195.
[32]XUE J,QIN R,ZHOU X,et al.Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:6790-6794.
[33]WU W,ZHANG C,WOODLAND P C.Self-supervised representations in speech-based depression detection[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[34]ZHAO Z,BAO Z,ZHANG Z,et al.Hierarchical attention transfer networks for depression assessment from speech[C]//ICASSP 2020-2020 IEEE international conference on acoustics,speech and signal processing(ICASSP).IEEE,2020:7159-7163.
[35]SRIMADHUR N S,LALITHA S.An end-to-end model for detection and assessment of depression levels using speech[J].Procedia Computer Science,2020,171:12-21.
[36]YANG W,LIU J,CAO P,et al.Attention guided learnable time-domain filterbanks for speech depression detection[J].Neural Networks,2023,165:135-149.
[37]YOON S,MAENG S,KIM R,et al.Strategy for developing a speech recognition model specialized for patients with depression or Parkinson's disease with small size speech database[J].Biomedical Engineering Letters,2024,14(5):1049-1055.
[38]GUPTA S,AGARWAL G,AGARWAL S,et al.Depression detection using cascaded attention based deep learning framework using speech data[J].Multimedia Tools and Applications,2024,83(25):66135-66173.
[39]CHEN W,XING X,XU X,et al.SpeechFormer:A hierarchical efficient framework incorporating the characteristics of speech[J].arXiv:2203.03812,2022.
[40]CHEN W,XING X,XU X,et al.Speechformer++:A hierarchical efficient framework for paralinguistic speech processing[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,31:775-788.
[41]PAN Y,SHANG Y,WANG W,et al.Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech[J].Biomedical Signal Processing and Control,2024,89:105704.
[42]MCAULIFFE M,SOCOLOF M,MIHUC S,et al.Montrealforced aligner:Trainable text-speech alignment using kaldi[C]//Interspeech.2017:498-502.
[43]BAEVSKI A,ZHOU Y,MOHAMED A,et al.wav2vec 2.0:A framework for self-supervised learning of speech representations[J].Advances in Neural Information Processing Systems,2020,33:12449-12460.
[44]SHAZEER N,MIRHOSEINI A,MAZIARZ K,et al.Outra-geously large neural networks:The sparsely-gated mixture-of-experts layer[J].arXiv:1701.06538,2017.
[45]EIGEN D,RANZATO M A,SUTSKEVER I.Learning factored representations in a deep mixture of experts[J].arXiv:1312.4314,2013.
[46]BENGIO E,BACON P L,PINEAU J,et al.Conditional computation in neural networks for faster models[J].arXiv:1511.06297,2015.
[47]CAI H,GAO Y,SUN S,et al.Modma dataset:a multi-modal open dataset for mental-disorder analysis[J].arXiv:2002.09283,2020.
[48]ZHANG R,WANG Y,WOMER F,et al.School-based Evaluation Advancing Response for Child Health(SEARCH):a mixed longitudinal cohort study from multifacetedperspectives in Jiang-su,China[J].BMJ Ment Health,2023,26(1).
[49]ZHAO S,MA B,WATCHARASUPAT K N,et al.FRCRN:Boosting feature representation using frequency recurrence for monaural speech enhancement[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:9281-9285.
[50]GUPTA S,FAHAD M S,DEEPAK A.Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition[J].Multimedia Tools and Applications,2020,79:23347-23365.
[51]LUNA-JIMÉNEZ C,KLEINLEIN R,GRIOL D,et al.A proposal for multimodal emotion recognition using aural transfor-mers and action units on RAVDESS dataset[J].Applied Sciences,2021,12(1):327.
[52]KANOUJIA S,KARUPPANAN P.Depression Detection inSpeech Using ML and DL Algorithm[C]//2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation(IATMSI).IEEE,2024,2:1-5.
[1] LI Weirong, YIN Jibin. FB-TimesNet:An Improved Multimodal Emotion Recognition Method Based on TimesNet [J]. Computer Science, 2025, 52(6A): 240900046-8.
[2] ZHENG Chuangrui, DENG Xiuqin, CHEN Lei. Traffic Prediction Model Based on Decoupled Adaptive Dynamic Graph Convolution [J]. Computer Science, 2025, 52(6A): 240400149-8.
[3] ZHANG Yongyu, GUO Chenjuan, WEI Hanyue. Deep Learning Stock Price Probability Prediction Based on Multi-modal Feature Wavelet Decomposition [J]. Computer Science, 2025, 52(6A): 240600140-11.
[4] XU Yutao, TANG Shouguo. Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module [J]. Computer Science, 2025, 52(6A): 240800086-7.
[5] WANG Rui, TANG Zhanjun. Multi-feature Fusion and Ensemble Learning-based Wind Turbine Blade Defect Detection Method [J]. Computer Science, 2025, 52(6A): 240900138-8.
[6] LI Mingjie, HU Yi, YI Zhengming. Flame Image Enhancement with Few Samples Based on Style Weight Modulation Technique [J]. Computer Science, 2025, 52(6A): 240500129-7.
[7] WANG Rong , ZOU Shuping, HAO Pengfei, GUO Jiawei, SHU Peng. Sand Dust Image Enhancement Method Based on Multi-cascaded Attention Interaction [J]. Computer Science, 2025, 52(6A): 240800048-7.
[8] JIN Lu, LIU Mingkun, ZHANG Chunhong, CHEN Kefei, LUO Yaqiong, LI Bo. Pedestrian Re-identification Based on Spatial Transformation and Multi-scale Feature Fusion [J]. Computer Science, 2025, 52(6A): 240800156-7.
[9] SHI Xincheng, WANG Baohui, YU Litao, DU Hui. Study on Segmentation Algorithm of Lower Limb Bone Anatomical Structure Based on 3D CTImages [J]. Computer Science, 2025, 52(6A): 240500119-9.
[10] GUO Yecai, HU Xiaowei, MAO Xiangnan. Multi-scale Feature Fusion Residual Denoising Network Based on Cascade [J]. Computer Science, 2025, 52(6): 239-246.
[11] GENG Sheng, DING Weiping, JU Hengrong, HUANG Jiashuang, JIANG Shu, WANG Haipeng. FDiff-Fusion:Medical Image Diffusion Fusion Network Segmentation Model Driven Based onFuzzy Logic [J]. Computer Science, 2025, 52(6): 274-285.
[12] JIANG Wenwen, XIA Ying. Improved U-Net Multi-scale Feature Fusion Semantic Segmentation Network for RemoteSensing Images [J]. Computer Science, 2025, 52(5): 212-219.
[13] LI Xiwang, CAO Peisong, WU Yuying, GUO Shuming, SHE Wei. Study on Security Risk Relation Extraction Based on Multi-view IB [J]. Computer Science, 2025, 52(5): 330-336.
[14] LI Xiaolan, MA Yong. Study on Lightweight Flame Detection Algorithm with Progressive Adaptive Feature Fusion [J]. Computer Science, 2025, 52(4): 64-73.
[15] DENG Ceyu, LI Duantengchuan, HU Yiren, WANG Xiaoguang, LI Zhifei. Joint Inter-word and Inter-sentence Multi-relationship Modeling for Review-basedRecommendation Algorithm [J]. Computer Science, 2025, 52(4): 119-128.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!