计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 99-108.doi: 10.11896/jsjkx.250600162

• 智能教育技术 • 上一篇    下一篇

面向英语口语情感评价的多模态连续情感识别

王丽燕1, 张倩2, 郭圆圆2, 陈海丰2, 李健2   

  1. 1 陕西科技大学文化与教育学院 西安 710021
    2 陕西科技大学电子信息与人工智能学院 西安 710021
  • 收稿日期:2025-06-24 修回日期:2025-10-27 发布日期:2026-05-08
  • 通讯作者: 李健(lijianjsj@sust.edu.cn)
  • 作者简介:(wangliyan@sust.edu.cn)
  • 基金资助:
    国家自然科学基金(62306172);陕西科技大学国际化教育教学改革研究项目(GJ22YB09);陕西科技大学教学改革项目(23Y081)

Multimodal Continuous Emotion Recognition for English Spoken Emotion Evaluation

WANG Liyan1, ZHANG Qian2, GUO Yuanyuan2, CHEN Haifeng2, LI Jian2   

  1. 1 School of Culture and Education, Shaanxi University of Science and Technology, Xi’an 710021, China
    2 School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China
  • Received:2025-06-24 Revised:2025-10-27 Online:2026-05-08
  • About author:WANG Liyan,born in 1978,master,lecturer.Her main research interests include corpus linguistics and educational informatization.
    LI Jian,born in 1975,Ph.D,professor,is a member of CCF(No.43408M).His main research interests include compu-ter vision and educational informatization.
  • Supported by:
    National Natural Science Foundation of China(62306172),International Education and Teaching Reform Research Program of Shaanxi University of Science and Technology(GJ22YB09) and Teaching Reform Program of Shaanxi University of Science and Technology(23Y081).

摘要: 英语口语在英语学习中占据重要地位。针对现有的英语口语情感表达评价数据集稀缺及模态信息利用不足的问题,构建了一个名为英语口语多模态情感数据集(English Spoken Multimodal Emotion Dataset,ESMED)的新型数据集,并对其进行连续情感(唤醒度、愉悦度)标注和情感质量评分。此外,提出了一个面向英语口语情感评价的创新网络模型,该模型首先通过感知重采样和多模态融合模块对连续情感信息进行压缩与融合,用于预测唤醒度和愉悦度。随后通过可学习的瓶颈层与联合解码层对特征进行特定变换,并通过情感质量评价模块将唤醒度、愉悦度与变换后的特征联合解码,得到最终量化后的情感质量分值。实验结果表明,在ESMED数据集上的一致性相关系数(CCC)达到0.500 3,平均绝对误差(MAE)为0.635 4,证明了该方法的有效性和准确性。

关键词: 情感识别, 感知重采样, 多模态融合, 联合解码, 情感质量评价

Abstract: Spoken English occupies a crucial position in English learning.Addressing the scarcity of existing datasets for evaluating emotional expression in spoken English and the inadequate utilization of multimodal information,this paper introduces a novel dataset named the English spoken multimodal emotion dataset(ESMED).This dataset is annotated with continuous emotions(arousal,valence) and emotional quality scores.Additionally,an innovative network model for evaluating spoken English emotions is proposed.The model initially compresses and fuses continuous emotional information through perception resampling and multimodal fusion modules to predict arousal and valence.Subsequently,it performs specific transformations on the features through learnable bottleneck and joint decoding layers.The emotional quality evaluation module then jointly decodes arousal,valence,and transformed features to obtain the final quantified emotional quality score.Experimental results demonstrate that the proposed model achieves a concordance correlation coefficient(CCC) of 0.500 3 and a mean absolute error(MAE) of 0.635 4 on the ESMED dataset,verifying the effectiveness and accuracy of the proposed method.

Key words: Emotion recognition, Perceiver resampling, Multimodal fusion, Joint decoding, Emotion quality evaluation

中图分类号: 

  • G434
[1]PENG R Z,HU Q Q.The influence of foreign language anxietyand pleasure on learning engagement-Based on quadratic response surface regression analysis[J].Foreign Language World,2025(1):64-72.
[2]CAO X M,YE X L,LUO J T,et al.Research on psychological barriers in human-computer collaborative learning supported by intelligent agents-A multimodal data comparative analysis based on an English oral communication experiment[J].Mo-dern Educational Technology,2025,35(4):102-109.
[3]LI C C,LI W,JIANG G Y.Research on emotions in second language learning:Review and prospects[J].Modern Foreign Languages,2024,47(1):63-75.
[4]MA G Y,ZHAO H X.Application of emotional teaching methodin ideological and political education in universities and its impact on students’ learning attitudes[J].Jilin Education,2024(35):6-8.
[5]YIN K,ZHOU L.The relative importance of peace of mind,grit,and classroom environment in predicting willingness to communicate among learners in multi-ethnic regions:a latent dominance analysis[J].BMC Psychology,2025,13(1):1-17.
[6]GRIMM M,KROSCHEL K,NARAYANAN S.The Vera amMittag German audio-visual emotional speech database[C]//Proceedings of the 2008 IEEE International Conference on Multimedia and Expo.IEEE,2008:23-26.
[7]MCKEOWN G,VALSTAR M,COWIE R,et al.The semainedatabase:Annotated multimodal records of emotionally colored conversations between a person and a limited agent[J].IEEE Transactions on Affective Computing,2011,3(1):5-17.
[8]RINGEVAL F,SONDEREGGER A,SAUER J,et al.Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions[C]//2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Re-cognition(FG).IEEE,2013:1-8.
[9]GEORGAKIS C,PANAGAKIS Y,ZAFEIRIOU S,et al.Theconflict escalation resolution(confer)database[J].Image and Vision Computing,2017,65:37-48.
[10]ZHANG W,JI X,CHEN K,et al.Learning a facial expression embedding disentangled from identity[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:6759-6768.
[11]WANG C,XUE J,LU K,et al.Light attention embedding for facial expression recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,32(4):1834-1847.
[12]LYU Z,POIESI F,DONG Q,et al.Deep learning for intelligent human-computer interaction[J].Applied Sciences,2022,12(22):11457.
[13]AKHAND M A H,ROY S,SIDDIQUE N,et al.Facial emotion recognition using transfer learning in the deep CNN[J].Electronics,2021,10(9):1036.
[14]AMIRIPARIAN S,CHRIST L,KÖNIG A,et al.MuSe 2022Challenge:Multimodal Humour,Emotional Reactions,and Stress[C]//The 30th ACM International Conference on Multimedia.New York:ACM,2022:7389-7391.
[15]YU W,XU H,MENG F,et al.Ch-sims:A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.ACL,2020:3718-3727.
[16]CHEN Q P.Research on the Speech Emotion Analysis Modelfor English Short Passage Reading[D].Guilin:Guilin University of Electronic Technology,2023.
[17]WU J H,ZHOU W T,CAO C.An empirical study on the empowerment of oral English teaching by generative artificial intelligence technology[J].China Educational Technology,2024(4):105-111.
[18]LUO Y Y.Research on the intelligent evaluation method of spoken English based on multi-feature fusion[J].Computer-Assisted Foreign Language Education in China,2023(2):49-55,112.
[19]WANG X.Research on the Tibetan speech emotion recognition method based on multi-feature fusion[D].Lhasa:Tibet University,2023.
[20]BOCCIGNONE G,CONTE D,CUCULO V,et al.AMHUSE:Amultimodal dataset for HUmour SEnsing[C]//Proceedings of the 19th ACM International Conference on Multimodal Interaction.ACM,2017:438-445.
[21]LI J,ZHANG Q,CHEN H F,et al.Continuous emotion recognition based on perceptual resampling and multimodal fusion[J].Journal of Computer Applications,2023,40(12):3816-3820.
[22]CHUNG J,GULCEHRE C,CHO K H,et al.Empirical Evaluation of Gated Recurrent NeuralNetworks on Sequence Modeling[J].arXiv:1412.3555,2014.
[23]BAI S,KOLTER J Z,KOLTUN V.An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling[J].arXiv:1803.01271,2018.
[24]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!