计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 180-186.doi: 10.11896/jsjkx.241200006
姚佳, 李冬冬, 王喆
YAO Jia, LI Dongdong, WANG Zhe
摘要: 现有的语音情感识别方法通常依赖深度学习模型提取声学特征,但大多仅关注通用特征的建模,未能充分挖掘数据中与情感密切相关的先验知识。为此,提出了一种端到端的多任务学习框架,利用自监督预训练模型WavLM提取包含丰富情感信息的语音特征,并将性别识别作为辅助任务,以捕捉性别差异对情感识别的潜在影响。针对传统多任务学习框架中固定权重计算损失导致的学习不均衡问题,进一步提出了一种自适应温度系数的动态权重平均方法(Temperature-aware Dynamic Weight Averaging,TA-DWA)。该方法通过动态调整温度系数平衡不同任务的学习速度,并结合任务损失变化率实现更合理的权重分配。实验结果表明,在IEMOCAP和EMODB数据集上,所提方法显著提高了情感识别准确率,验证了性别识别作为辅助任务的有效性以及动态权重策略在多任务学习中的优势。
中图分类号:
| [1]GEORGE S M,ILYAS P M.A review on speech emotion recognition:a survey,recent advances,challenges,and the influence of noise[J].Neurocomputing,2024,568:127015. [2]EL AYADI M,KAMEL M S,KARRAY F.Survey on speechemotion recognition:Features,classification schemes,and databases[J].Pattern Recognition,2011,44(3):572-587. [3]HASHEM A,ARIF M,ALGHAMDI M.Speech emotion recognition approaches:A systematic review[J].Speech Communication,2023,154:102974. [4]CHEN Z,LIN M,WANG Z,et al.Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms[J].Knowledge-Based Systems,2023,281:111077. [5]BAEVSKI A,ZHOU Y,MOHAMED A,et al.wav2vec 2.0:A framework for self-supervised learning of speech representations[J].Advances in Neural Information Processing Systems,2020,33:12449-12460. [6]HSU W N,BOLTE B,TSAI Y H H,et al.Hubert:Self-supervised speech representation learning by masked prediction of hidden units[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3451-3460. [7]PEPINO L,RIERA P,FERRER L.Emotion recognition from speech using wav2vec 2.0 embeddings[J].arXiv:2104.03502,2021. [8]CHEN L W,RUDNICKY A.Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2023:1-5. [9]LAUSEN A,SCHACHT A.Gender differences in the recognition of vocal emotions[J].Frontiers in Psychology,2018,9:882. [10]THAKARE C,CHAURASIA N K,RATHOD D,et al.Gender aware cnn for speech emotion recognition[J].Health Informa-tics:A Computational Perspective in Healthcare,2021,932:367-377. [11]ZHANG Y,YANG Q.A survey on multi-task learning[J].IEEE Transactions on Knowledge and Data Engineering,2021,34(12):5586-5609. [12]NEDIYANCHATH A,PARAMASIVAM P,YENIGALLA P.Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020:7179-7183. [13]LIU Z T,HAN M T,WU B H,et al.Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning[J].Applied Acoustics,2023,202:109178. [14]AHN C S,RANA R,BUSSO C,et al.Multitask Transformer for Cross-Corpus Speech Emotion Recognition[J].IEEE Transactions on Affective Computing,2025,16(3):1581-1591. [15]CAI X,YUAN J,ZHENG R,et al.Speech emotion recognition with multi-task learning[C]//Interspeech.2021:4508-4512. [16]SHARMA M.Multi-lingual multi-task speech emotion recognition using wav2vec 2.0[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:6907-6911. [17]CHEN S,WANG C,CHEN Z,et al.Wavlm:Large-scale self-supervised pre-training for full stack speech processing[J].IEEE Journal of Selected Topics in Signal Processing,2022,16(6):1505-1518. [18]LIM J,KIM K.Wav2vec-VC:Voice Conversion via Hidden Representations of Wav2vec 2.0[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2024:10326-10330. [19]LIU S,JOHNS E,DAVISON A J.End-to-end multi-task lear-ning with attention[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2019:1871-1880. [20]CHEN Z,BADRINARAYANAN V,LEE C Y,et al.Gradnorm:Gradient normalization for adaptive loss balancing in deep multitask networks[C]//International Conference on Machine Learning.PMLR,2018:794-803. [21]BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:Interactiveemotional dyadic motion capture database[J].Language Resources and Evaluation,2008,42:335-359. [22]BURKHARDT F,PAESCHKE A,ROLFES M,et al.A data-base of German emotional speech[C]//Interspeech.2005:1517-1520. [23]KENDALL A,GAL Y,CIPOLLA R.Multi-task learning using uncertainty to weigh losses for scene geometry and semantics[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7482-7491. [24]LIU B,LIU X,JIN X,et al.Conflict-averse gradient descent for multi-task learning[J].Advances in Neural Information Proces-sing Systems,2021,34:18878-18890. [25]LATIF S,RANA R,KHALIFA S,et al.Multitask learningfrom augmented auxiliary data for improving speech emotion recognition[J].IEEE Transactions on Affective Computing,2022,14(4):3164-3176. [26]GAO Y,LIU J X,WANG L,et al.Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2021:6314-6318. [27]ZOU H,SI Y,CHEN C,et al.Speech emotion recognition with co-attention based multi-level acoustic information[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2022:7367-7371. [28]CHEN Z,LI J,LIU H,et al.Learning multi-scale features for speech emotion recognition with connection attention mechanism[J].Expert Systems with Applications,2023,214:118943. [29]AFTAB A,MORSALI A,GHAEMMAGHAMI S,et al.Light-sernet:A lightweight fully convolutional neural network for speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:6912-6916. [30]YE J,WEN X C,WEI Y,et al.Temporal modeling matters:A novel temporal emotional modeling approach for speech emotion recognition[C]//IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2023:1-5. |
|
||