Computer Science ›› 2026, Vol. 53 ›› Issue (1): 180-186.doi: 10.11896/jsjkx.241200006

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Multi-task Speech Emotion Recognition Incorporating Gender Information

YAO Jia, LI Dongdong, WANG Zhe   

  1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2024-12-02 Revised:2025-04-09 Published:2026-01-08
  • About author:YAO Jia,born in 2001,postgraduate.Her main research interests include deep learning and speech emotion re-cognition.
    LI Dongdong,born in 1981,Ph.D,associate professor,is a member of CCF(No.15173M).Her main research interests include speech processing and emotion computing.
  • Supported by:
    National Natural Science Foundation of China(62276098).

Abstract: Existing methods for speech emotion recognition usually rely on deep learning models to extract acoustic features,but most of them focus only on modelling generic features,failing to fully explore a priori knowledge in the data that is closely related to emotion.To this end,this paper proposes an end-to-end multi-task learning framework that utilizes the self-supervised pre-training model WavLM to extract speech features rich in emotional information,and introduces gender recognition as an auxiliary task to account for the influence of gender differences on emotion recognition.To address the learning imbalance issue caused by the fixed weight calculation in traditional multi-task learning frameworks,this paper proposes a Temperature-aware Dynamic Weight Averaging(TA-DWA) method.This method balances the learning speeds of different tasks by dynamically adjusting the temperature coefficient and achieves more reasonable weight allocation by incorporating the rate of change in task losses.Experimental results on the IEMOCAP and EMODB datasets demonstrate that the proposed approach significantly improves emotion recognition accuracy.These findings validate the effectiveness of using gender recognition as an auxiliary task and highlight the advantages of the dynamic weighting strategy in multi-task learning.

Key words: Speech emotion recognition, Multi-task learning, Dynamic weight assignment, Self-supervised models

CLC Number: 

  • TP301
[1]GEORGE S M,ILYAS P M.A review on speech emotion recognition:a survey,recent advances,challenges,and the influence of noise[J].Neurocomputing,2024,568:127015.
[2]EL AYADI M,KAMEL M S,KARRAY F.Survey on speechemotion recognition:Features,classification schemes,and databases[J].Pattern Recognition,2011,44(3):572-587.
[3]HASHEM A,ARIF M,ALGHAMDI M.Speech emotion recognition approaches:A systematic review[J].Speech Communication,2023,154:102974.
[4]CHEN Z,LIN M,WANG Z,et al.Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms[J].Knowledge-Based Systems,2023,281:111077.
[5]BAEVSKI A,ZHOU Y,MOHAMED A,et al.wav2vec 2.0:A framework for self-supervised learning of speech representations[J].Advances in Neural Information Processing Systems,2020,33:12449-12460.
[6]HSU W N,BOLTE B,TSAI Y H H,et al.Hubert:Self-supervised speech representation learning by masked prediction of hidden units[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3451-3460.
[7]PEPINO L,RIERA P,FERRER L.Emotion recognition from speech using wav2vec 2.0 embeddings[J].arXiv:2104.03502,2021.
[8]CHEN L W,RUDNICKY A.Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[9]LAUSEN A,SCHACHT A.Gender differences in the recognition of vocal emotions[J].Frontiers in Psychology,2018,9:882.
[10]THAKARE C,CHAURASIA N K,RATHOD D,et al.Gender aware cnn for speech emotion recognition[J].Health Informa-tics:A Computational Perspective in Healthcare,2021,932:367-377.
[11]ZHANG Y,YANG Q.A survey on multi-task learning[J].IEEE Transactions on Knowledge and Data Engineering,2021,34(12):5586-5609.
[12]NEDIYANCHATH A,PARAMASIVAM P,YENIGALLA P.Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:7179-7183.
[13]LIU Z T,HAN M T,WU B H,et al.Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning[J].Applied Acoustics,2023,202:109178.
[14]AHN C S,RANA R,BUSSO C,et al.Multitask Transformer for Cross-Corpus Speech Emotion Recognition[J].IEEE Transactions on Affective Computing,2025,16(3):1581-1591.
[15]CAI X,YUAN J,ZHENG R,et al.Speech emotion recognition with multi-task learning[C]//Interspeech.2021:4508-4512.
[16]SHARMA M.Multi-lingual multi-task speech emotion recognition using wav2vec 2.0[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:6907-6911.
[17]CHEN S,WANG C,CHEN Z,et al.Wavlm:Large-scale self-supervised pre-training for full stack speech processing[J].IEEE Journal of Selected Topics in Signal Processing,2022,16(6):1505-1518.
[18]LIM J,KIM K.Wav2vec-VC:Voice Conversion via Hidden Representations of Wav2vec 2.0[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:10326-10330.
[19]LIU S,JOHNS E,DAVISON A J.End-to-end multi-task lear-ning with attention[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2019:1871-1880.
[20]CHEN Z,BADRINARAYANAN V,LEE C Y,et al.Gradnorm:Gradient normalization for adaptive loss balancing in deep multitask networks[C]//International Conference on Machine Learning.PMLR,2018:794-803.
[21]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactiveemotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42:335-359.
[22]BURKHARDT F,PAESCHKE A,ROLFES M,et al.A data-base of German emotional speech[C]//Interspeech.2005:1517-1520.
[23]KENDALL A,GAL Y,CIPOLLA R.Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7482-7491.
[24]LIU B,LIU X,JIN X,et al.Conflict-averse gradient descent for multi-task learning[J].Advances in Neural Information Proces-sing Systems,2021,34:18878-18890.
[25]LATIF S,RANA R,KHALIFA S,et al.Multitask learningfrom augmented auxiliary data for improving speech emotion recognition[J].IEEE Transactions on Affective Computing,2022,14(4):3164-3176.
[26]GAO Y,LIU J X,WANG L,et al.Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:6314-6318.
[27]ZOU H,SI Y,CHEN C,et al.Speech emotion recognition with co-attention based multi-level acoustic information[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:7367-7371.
[28]CHEN Z,LI J,LIU H,et al.Learning multi-scale features for speech emotion recognition with connection attention mechanism[J].Expert Systems with Applications,2023,214:118943.
[29]AFTAB A,MORSALI A,GHAEMMAGHAMI S,et al.Light-sernet:A lightweight fully convolutional neural network for speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:6912-6916.
[30]YE J,WEN X C,WEI Y,et al.Temporal modeling matters:A novel temporal emotional modeling approach for speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2023:1-5.
[1] WANG Yicheng, NING Tai, LIU Xinyu, LUO Ye. Position-aware Based Multi-modality Lung Cancer Survival Prediction Method [J]. Computer Science, 2025, 52(6A): 240500089-8.
[2] PENG Linna, ZHANG Hongyun, MIAO Duoqian. Complex Organ Segmentation Based on Edge Constraints and Enhanced Swin Unetr [J]. Computer Science, 2025, 52(4): 177-184.
[3] ZHAO Qian, GUO Bin, LIU Yubo, SUN Zhuo, WANG Hao, CHEN Mengqi. Generation of Enrich Semantic Video Dialogue Based on Hierarchical Visual Attention [J]. Computer Science, 2025, 52(1): 315-322.
[4] ZHANG Haoyan, DUAN Liguo, WANG Qinchen, GAO Hao. Long Text Multi-entity Sentiment Analysis Based on Multi-task Joint Training [J]. Computer Science, 2024, 51(6): 309-316.
[5] LIU Zeyu, LIU Jianwei. Video and Image Salient Object Detection Based on Multi-task Learning [J]. Computer Science, 2024, 51(4): 217-228.
[6] ZHANG Jiahao, ZHANG Zhaohui, YAN Qi, WANG Pengwei. Speech Emotion Recognition Based on Voice Rhythm Differences [J]. Computer Science, 2024, 51(4): 262-269.
[7] ZHANG Xue, TIAN Lan, ZENG Ming, LIU Junhui, ZONG Shaoguo. Multitask Classification Algorithm of ECG Signals Based on Radient Magnitude Direction Adjustment [J]. Computer Science, 2024, 51(12): 174-180.
[8] FU Mingrui, LI Weijiang. Multi-task Emotion-Cause Pair Extraction Method Based on Position-aware Interaction Network [J]. Computer Science, 2024, 51(11A): 231000086-9.
[9] WANG Kunyang, LIU Yang, YE Ning, ZHANG Kai. Road Extraction from Complex Urban Remote Sensing Images Based on Multi-task Learning [J]. Computer Science, 2024, 51(11A): 240300095-8.
[10] XU Bei, XU Peng. Emotion Elicited Question Generation Model in Dialogue Scenarios [J]. Computer Science, 2024, 51(11): 265-272.
[11] ZHANG Xiaoyun, ZHAO Hui. Study on Multi-task Student Emotion Recognition Methods Based on Facial Action Units [J]. Computer Science, 2024, 51(10): 105-111.
[12] CUI Lin, CUI Chenlu, LIU Zhengwei, XUE Kai. Speech Emotion Recognition Based on Improved MFCC and Parallel Hybrid Model [J]. Computer Science, 2023, 50(6A): 220800211-7.
[13] LUO Huilan, YE Ju. Study of Multi-task Learning with Joint Semantic Segmentation and Depth Estimation [J]. Computer Science, 2023, 50(6A): 220100111-10.
[14] ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[15] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!