计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 180-186.doi: 10.11896/jsjkx.241200006

• 计算机图形学&多媒体 • 上一篇    下一篇

结合性别信息的多任务语音情感识别

姚佳, 李冬冬, 王喆   

  1. 华东理工大学信息科学与工程学院 上海 200237
  • 收稿日期:2024-12-02 修回日期:2025-04-09 发布日期:2026-01-08
  • 通讯作者: 李冬冬(ldd@ecust.edu.cn)
  • 作者简介:(yaojia27@163.com)
  • 基金资助:
    国家自然科学基金(62276098)

Multi-task Speech Emotion Recognition Incorporating Gender Information

YAO Jia, LI Dongdong, WANG Zhe   

  1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2024-12-02 Revised:2025-04-09 Online:2026-01-08
  • About author:YAO Jia,born in 2001,postgraduate.Her main research interests include deep learning and speech emotion re-cognition.
    LI Dongdong,born in 1981,Ph.D,associate professor,is a member of CCF(No.15173M).Her main research interests include speech processing and emotion computing.
  • Supported by:
    National Natural Science Foundation of China(62276098).

摘要: 现有的语音情感识别方法通常依赖深度学习模型提取声学特征,但大多仅关注通用特征的建模,未能充分挖掘数据中与情感密切相关的先验知识。为此,提出了一种端到端的多任务学习框架,利用自监督预训练模型WavLM提取包含丰富情感信息的语音特征,并将性别识别作为辅助任务,以捕捉性别差异对情感识别的潜在影响。针对传统多任务学习框架中固定权重计算损失导致的学习不均衡问题,进一步提出了一种自适应温度系数的动态权重平均方法(Temperature-aware Dynamic Weight Averaging,TA-DWA)。该方法通过动态调整温度系数平衡不同任务的学习速度,并结合任务损失变化率实现更合理的权重分配。实验结果表明,在IEMOCAP和EMODB数据集上,所提方法显著提高了情感识别准确率,验证了性别识别作为辅助任务的有效性以及动态权重策略在多任务学习中的优势。

关键词: 语音情感识别, 多任务学习, 动态权重分配, 自监督模型

Abstract: Existing methods for speech emotion recognition usually rely on deep learning models to extract acoustic features,but most of them focus only on modelling generic features,failing to fully explore a priori knowledge in the data that is closely related to emotion.To this end,this paper proposes an end-to-end multi-task learning framework that utilizes the self-supervised pre-training model WavLM to extract speech features rich in emotional information,and introduces gender recognition as an auxiliary task to account for the influence of gender differences on emotion recognition.To address the learning imbalance issue caused by the fixed weight calculation in traditional multi-task learning frameworks,this paper proposes a Temperature-aware Dynamic Weight Averaging(TA-DWA) method.This method balances the learning speeds of different tasks by dynamically adjusting the temperature coefficient and achieves more reasonable weight allocation by incorporating the rate of change in task losses.Experimental results on the IEMOCAP and EMODB datasets demonstrate that the proposed approach significantly improves emotion recognition accuracy.These findings validate the effectiveness of using gender recognition as an auxiliary task and highlight the advantages of the dynamic weighting strategy in multi-task learning.

Key words: Speech emotion recognition, Multi-task learning, Dynamic weight assignment, Self-supervised models

中图分类号: 

  • TP301
[1]GEORGE S M,ILYAS P M.A review on speech emotion recognition:a survey,recent advances,challenges,and the influence of noise[J].Neurocomputing,2024,568:127015.
[2]EL AYADI M,KAMEL M S,KARRAY F.Survey on speechemotion recognition:Features,classification schemes,and databases[J].Pattern Recognition,2011,44(3):572-587.
[3]HASHEM A,ARIF M,ALGHAMDI M.Speech emotion recognition approaches:A systematic review[J].Speech Communication,2023,154:102974.
[4]CHEN Z,LIN M,WANG Z,et al.Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms[J].Knowledge-Based Systems,2023,281:111077.
[5]BAEVSKI A,ZHOU Y,MOHAMED A,et al.wav2vec 2.0:A framework for self-supervised learning of speech representations[J].Advances in Neural Information Processing Systems,2020,33:12449-12460.
[6]HSU W N,BOLTE B,TSAI Y H H,et al.Hubert:Self-supervised speech representation learning by masked prediction of hidden units[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3451-3460.
[7]PEPINO L,RIERA P,FERRER L.Emotion recognition from speech using wav2vec 2.0 embeddings[J].arXiv:2104.03502,2021.
[8]CHEN L W,RUDNICKY A.Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5.
[9]LAUSEN A,SCHACHT A.Gender differences in the recognition of vocal emotions[J].Frontiers in Psychology,2018,9:882.
[10]THAKARE C,CHAURASIA N K,RATHOD D,et al.Gender aware cnn for speech emotion recognition[J].Health Informa-tics:A Computational Perspective in Healthcare,2021,932:367-377.
[11]ZHANG Y,YANG Q.A survey on multi-task learning[J].IEEE Transactions on Knowledge and Data Engineering,2021,34(12):5586-5609.
[12]NEDIYANCHATH A,PARAMASIVAM P,YENIGALLA P.Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:7179-7183.
[13]LIU Z T,HAN M T,WU B H,et al.Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning[J].Applied Acoustics,2023,202:109178.
[14]AHN C S,RANA R,BUSSO C,et al.Multitask Transformer for Cross-Corpus Speech Emotion Recognition[J].IEEE Transactions on Affective Computing,2025,16(3):1581-1591.
[15]CAI X,YUAN J,ZHENG R,et al.Speech emotion recognition with multi-task learning[C]//Interspeech.2021:4508-4512.
[16]SHARMA M.Multi-lingual multi-task speech emotion recognition using wav2vec 2.0[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:6907-6911.
[17]CHEN S,WANG C,CHEN Z,et al.Wavlm:Large-scale self-supervised pre-training for full stack speech processing[J].IEEE Journal of Selected Topics in Signal Processing,2022,16(6):1505-1518.
[18]LIM J,KIM K.Wav2vec-VC:Voice Conversion via Hidden Representations of Wav2vec 2.0[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:10326-10330.
[19]LIU S,JOHNS E,DAVISON A J.End-to-end multi-task lear-ning with attention[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2019:1871-1880.
[20]CHEN Z,BADRINARAYANAN V,LEE C Y,et al.Gradnorm:Gradient normalization for adaptive loss balancing in deep multitask networks[C]//International Conference on Machine Learning.PMLR,2018:794-803.
[21]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactiveemotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42:335-359.
[22]BURKHARDT F,PAESCHKE A,ROLFES M,et al.A data-base of German emotional speech[C]//Interspeech.2005:1517-1520.
[23]KENDALL A,GAL Y,CIPOLLA R.Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7482-7491.
[24]LIU B,LIU X,JIN X,et al.Conflict-averse gradient descent for multi-task learning[J].Advances in Neural Information Proces-sing Systems,2021,34:18878-18890.
[25]LATIF S,RANA R,KHALIFA S,et al.Multitask learningfrom augmented auxiliary data for improving speech emotion recognition[J].IEEE Transactions on Affective Computing,2022,14(4):3164-3176.
[26]GAO Y,LIU J X,WANG L,et al.Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:6314-6318.
[27]ZOU H,SI Y,CHEN C,et al.Speech emotion recognition with co-attention based multi-level acoustic information[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:7367-7371.
[28]CHEN Z,LI J,LIU H,et al.Learning multi-scale features for speech emotion recognition with connection attention mechanism[J].Expert Systems with Applications,2023,214:118943.
[29]AFTAB A,MORSALI A,GHAEMMAGHAMI S,et al.Light-sernet:A lightweight fully convolutional neural network for speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:6912-6916.
[30]YE J,WEN X C,WEI Y,et al.Temporal modeling matters:A novel temporal emotional modeling approach for speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2023:1-5.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!