面向英语口语情感评价的多模态连续情感识别

doi:10.11896/jsjkx.250600162

Computer Science ›› 2026, Vol. 53 ›› Issue (5): 99-108.doi: 10.11896/jsjkx.250600162

• Intelligent Education Technology • Previous Articles Next Articles

Multimodal Continuous Emotion Recognition for English Spoken Emotion Evaluation

WANG Liyan¹, ZHANG Qian², GUO Yuanyuan², CHEN Haifeng², LI Jian²

1 School of Culture and Education, Shaanxi University of Science and Technology, Xi’an 710021, China
2 School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China

Received:2025-06-24 Revised:2025-10-27 Published:2026-05-08
About author:WANG Liyan,born in 1978,master,lecturer.Her main research interests include corpus linguistics and educational informatization.
LI Jian,born in 1975,Ph.D,professor,is a member of CCF(No.43408M).His main research interests include compu-ter vision and educational informatization.
Supported by:
National Natural Science Foundation of China(62306172),International Education and Teaching Reform Research Program of Shaanxi University of Science and Technology(GJ22YB09) and Teaching Reform Program of Shaanxi University of Science and Technology(23Y081).

Abstract

Abstract: Spoken English occupies a crucial position in English learning.Addressing the scarcity of existing datasets for evaluating emotional expression in spoken English and the inadequate utilization of multimodal information,this paper introduces a novel dataset named the English spoken multimodal emotion dataset(ESMED).This dataset is annotated with continuous emotions(arousal,valence) and emotional quality scores.Additionally,an innovative network model for evaluating spoken English emotions is proposed.The model initially compresses and fuses continuous emotional information through perception resampling and multimodal fusion modules to predict arousal and valence.Subsequently,it performs specific transformations on the features through learnable bottleneck and joint decoding layers.The emotional quality evaluation module then jointly decodes arousal,valence,and transformed features to obtain the final quantified emotional quality score.Experimental results demonstrate that the proposed model achieves a concordance correlation coefficient(CCC) of 0.500 3 and a mean absolute error(MAE) of 0.635 4 on the ESMED dataset,verifying the effectiveness and accuracy of the proposed method.

Key words: Emotion recognition, Perceiver resampling, Multimodal fusion, Joint decoding, Emotion quality evaluation

CLC Number:

G434

WANG Liyan, ZHANG Qian, GUO Yuanyuan, CHEN Haifeng, LI Jian. Multimodal Continuous Emotion Recognition for English Spoken Emotion Evaluation[J].Computer Science, 2026, 53(5): 99-108.

References

[1]PENG R Z,HU Q Q.The influence of foreign language anxietyand pleasure on learning engagement－Based on quadratic response surface regression analysis[J].Foreign Language World,2025(1):64-72.
[2]CAO X M,YE X L,LUO J T,et al.Research on psychological barriers in human-computer collaborative learning supported by intelligent agents－A multimodal data comparative analysis based on an English oral communication experiment[J].Mo-dern Educational Technology,2025,35(4):102-109.
[3]LI C C,LI W,JIANG G Y.Research on emotions in second language learning:Review and prospects[J].Modern Foreign Languages,2024,47(1):63-75.
[4]MA G Y,ZHAO H X.Application of emotional teaching methodin ideological and political education in universities and its impact on students’ learning attitudes[J].Jilin Education,2024(35):6-8.
[5]YIN K,ZHOU L.The relative importance of peace of mind,grit,and classroom environment in predicting willingness to communicate among learners in multi-ethnic regions:a latent dominance analysis[J].BMC Psychology,2025,13(1):1-17.
[6]GRIMM M,KROSCHEL K,NARAYANAN S.The Vera amMittag German audio-visual emotional speech database[C]//Proceedings of the 2008 IEEE International Conference on Multimedia and Expo.IEEE,2008:23-26.
[7]MCKEOWN G,VALSTAR M,COWIE R,et al.The semainedatabase:Annotated multimodal records of emotionally colored conversations between a person and a limited agent[J].IEEE Transactions on Affective Computing,2011,3(1):5-17.
[8]RINGEVAL F,SONDEREGGER A,SAUER J,et al.Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions[C]//2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Re-cognition(FG).IEEE,2013:1-8.
[9]GEORGAKIS C,PANAGAKIS Y,ZAFEIRIOU S,et al.Theconflict escalation resolution(confer)database[J].Image and Vision Computing,2017,65:37-48.
[10]ZHANG W,JI X,CHEN K,et al.Learning a facial expression embedding disentangled from identity[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:6759-6768.
[11]WANG C,XUE J,LU K,et al.Light attention embedding for facial expression recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,32(4):1834-1847.
[12]LYU Z,POIESI F,DONG Q,et al.Deep learning for intelligent human-computer interaction[J].Applied Sciences,2022,12(22):11457.
[13]AKHAND M A H,ROY S,SIDDIQUE N,et al.Facial emotion recognition using transfer learning in the deep CNN[J].Electronics,2021,10(9):1036.
[14]AMIRIPARIAN S,CHRIST L,KÖNIG A,et al.MuSe 2022Challenge:Multimodal Humour,Emotional Reactions,and Stress[C]//The 30th ACM International Conference on Multimedia.New York:ACM,2022:7389-7391.
[15]YU W,XU H,MENG F,et al.Ch-sims:A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.ACL,2020:3718-3727.
[16]CHEN Q P.Research on the Speech Emotion Analysis Modelfor English Short Passage Reading[D].Guilin:Guilin University of Electronic Technology,2023.
[17]WU J H,ZHOU W T,CAO C.An empirical study on the empowerment of oral English teaching by generative artificial intelligence technology[J].China Educational Technology,2024(4):105-111.
[18]LUO Y Y.Research on the intelligent evaluation method of spoken English based on multi-feature fusion[J].Computer-Assisted Foreign Language Education in China,2023(2):49-55,112.
[19]WANG X.Research on the Tibetan speech emotion recognition method based on multi-feature fusion[D].Lhasa:Tibet University,2023.
[20]BOCCIGNONE G,CONTE D,CUCULO V,et al.AMHUSE:Amultimodal dataset for HUmour SEnsing[C]//Proceedings of the 19th ACM International Conference on Multimodal Interaction.ACM,2017:438-445.
[21]LI J,ZHANG Q,CHEN H F,et al.Continuous emotion recognition based on perceptual resampling and multimodal fusion[J].Journal of Computer Applications,2023,40(12):3816-3820.
[22]CHUNG J,GULCEHRE C,CHO K H,et al.Empirical Evaluation of Gated Recurrent NeuralNetworks on Sequence Modeling[J].arXiv:1412.3555,2014.
[23]BAI S,KOLTER J Z,KOLTUN V.An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling[J].arXiv:1803.01271,2018.
[24]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.

Related Articles 15

[1]	XU Weihua, HU Kaiping. Robust Incremental Fuzzy Concept-cognitive Emotion Recognition Method Based on Three-wayDecision [J]. Computer Science, 2026, 53(5): 257-267.
[2]	YAO Jia, LI Dongdong, WANG Zhe. Multi-task Speech Emotion Recognition Incorporating Gender Information [J]. Computer Science, 2026, 53(1): 180-186.
[3]	LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[4]	GAO Long, LI Yang, WANG Suge. Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation [J]. Computer Science, 2025, 52(9): 313-319.
[5]	FANG Chunying, HE Yuankun, WU Anxin. Emotion Recognition Based on Brain Network Connectivity and EEG Microstates [J]. Computer Science, 2025, 52(7): 201-209.
[6]	LI Weirong, YIN Jibin. FB-TimesNet:An Improved Multimodal Emotion Recognition Method Based on TimesNet [J]. Computer Science, 2025, 52(6A): 240900046-8.
[7]	ZHANG Jiaxiang, PAN Min, ZHANG Rui. Study on EEG Emotion Recognition Method Based on Self-supervised Graph Network [J]. Computer Science, 2025, 52(5): 122-127.
[8]	LI Zongmin, RONG Guangcai, BAI Yun, XU Chang , XIAN Shiyang. 3D Object Detection with Dynamic Weight Graph Convolution [J]. Computer Science, 2025, 52(3): 104-111.
[9]	ZHANG Fan, LI Ang. Multi-modal Fusion Based Object Detection for All-day and Multi-scenario Environments [J]. Computer Science, 2025, 52(11A): 241100137-10.
[10]	LI Xiaoyu, QIAN Yi, WEN Yimin, MIU Yuqing. Multi-level Feature Fusion Image Emotion Recognition Based on Color Enhancement [J]. Computer Science, 2025, 52(11): 157-165.
[11]	ZHANG Jiahao, ZHANG Zhaohui, YAN Qi, WANG Pengwei. Speech Emotion Recognition Based on Voice Rhythm Differences [J]. Computer Science, 2024, 51(4): 262-269.
[12]	LI Zhengping, LI Hanwen, WANG Lijun. High-generalization Ability EEG Emotion Recognition Model with Differential Entropy [J]. Computer Science, 2024, 51(11A): 231200066-7.
[13]	LI Jianqiu, LIU Wanping, HUANG Dong, ZHANG Qiong. Multimodal Fusion Based Dynamic Malware Detection [J]. Computer Science, 2024, 51(11A): 240200098-7.
[14]	ZHOU Yan, XU Yewen, PU Lei, XU Xuemiao, LIU Xiangyu, ZHOU Yuexia. Research Progress of Image 3D Object Detection in Autonomous Driving Scenario [J]. Computer Science, 2024, 51(11): 133-147.
[15]	ZHANG Xiaoyun, ZHAO Hui. Study on Multi-task Student Emotion Recognition Methods Based on Facial Action Units [J]. Computer Science, 2024, 51(10): 105-111.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!