融合跨模态注意力与角色交互的学生课堂专注度研究

doi:10.11896/jsjkx.250300026

计算机科学 ›› 2026, Vol. 53 ›› Issue (2): 67-77.doi: 10.11896/jsjkx.250300026

• 基于图机器学习的教育数据挖掘 • 上一篇下一篇

融合跨模态注意力与角色交互的学生课堂专注度研究

卓铁农¹, 英迪², 赵晖²

1 新疆大学软件学院乌鲁木齐 830046
2 新疆大学计算机科学与技术学院乌鲁木齐 830046

收稿日期:2025-03-05 修回日期:2025-08-26 发布日期:2026-02-10
通讯作者: 赵晖(zhaohui@xju.edu.cn)
作者简介:(shaoyang1906@163.com)
基金资助:
新疆维吾尔自治区重点研发计划(2023B01032);国家自然科学基金(62166041)

Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction

ZHUO Tienong¹, YING Di², ZHAO Hui²

1 School of Software,Xinjiang University,Urumqi 830046,China
2 School of Computer Science and Technology,Xinjiang University,Urumqi 830046,China

Received:2025-03-05 Revised:2025-08-26 Online:2026-02-10
About author:ZHUO Tienong,born in 1995,master.His main research interest is digital image processing.
ZHUO Hui,born in 1972,Ph.D,professor,Ph.D supervisor, is a member of CCF(No.25440S).Her main research interests include artificial intelligence,natural language processing,emotion computing,speech and digital image processing.
Supported by:
Key R&D Program of Xinjiang Uygur Autonomous Region(2023B01032)and National Natural Science Foundation of China(62166041).

摘要/Abstract

摘要： 随着智慧教育的不断发展,学校可以通过检测学生课堂的专注度对学生的学习情况与教师的教学质量进行评估,从而优化教学体系。以往的研究多侧重于单模态、单角色的特征提取,但教学课堂是一个多模态、多角色且角色之间相互影响的复杂场景,因此从多模态多角色角度去探讨学生课堂的专注度具有重大意义。然而,多模态之间如何有效建模时间相关性与语义交互性,以及多角色之间如何相互影响是实现学生课堂专注度评判的重大挑战。针对以上问题,构建了一个包含教师音频和学生视频的学生课堂专注度数据集,并提出了基于多模态多角色的长短时上下文学生课堂专注度评估模型(Long-Short Context Model,LSCM)。其中多模态是指学生的视频与教师的音频,多角色是指学生与学生、学生与教师。该模型主要包含长时上下文模块和短时上下文模块两个模块。长时上下文模块通过音频自注意机制和视觉自注意机制提取单一学生的长时行为特征,并利用视听交叉注意机制增强音频与视觉信息的关联性;短时上下文模块则聚焦于局部时间片段,以刻画课堂环境中多个学生专注度的动态变化。最后,模型输出视频中各个学生的专注度类别。实验表明,该方法通过有效挖掘多模态数据的互补性及角色间的关联性,使专注度检测准确率较现有方法显著提高,验证了多模态融合与角色交互建模的有效性。

关键词: 多模态, 学生专注度, 教学课堂, 角色交互, 注意力机制

Abstract: With the continuous development of innovative education,schools can assess students’ learning and teachers’ teaching quality by detecting students’ concentration in the classroom to optimize the teaching system.Previous studies have focused on single-modality and single-role feature extraction.However,the teaching classroom is a complex scene with multimodal,multiple roles,and interactions between the roles,so it is of great significance to explore students’ classroom attentiveness from the perspective of multimodal and multiple roles.However,how to effectively model the temporal relevance and semantic interaction between multimodal and how the multiple roles interact is a significant challenge in realizing the judgment of students’ classroom concentration.To address the above problems,a student classroom concentration dataset containing teacher’s audio and student’s video is constructed,and a Long-Short Context Model(LSCM) based on multimodal and multi-role assessment of students’ classroom concentration is proposed,in which multimodal refers to the student’s video and the teacher’s audio.Multi-role refers to the student-to-student and student-to-teacher.The model contains two main modules:the long-term context module and the short-term context module.Specifically,the long-term context module extracts the long-time behavioral characteristics of a single student through the audio self-attention mechanism and the visual self-attention mechanism.The audio-visual cross-attention mechanism enhances the correlation between the audio and visual information.In contrast,the short-term context module focuses on localized time segments to portray the dynamic changes in the attentiveness of multiple students in the classroom environment.Finally,the model outputs the concentration categories of each student in the video.Experiments show that this method significantly improves concentration detection accuracy compared with existing methods by effectively exploiting the complementary nature of multimodal data and the correlation between roles.It also verifies the effectiveness of multimodal fusion and role interaction modeling.

Key words: Multimodal, Student concentration, Teaching classroom, Role interaction, Attention mechanism

中图分类号:

TP391.1

卓铁农, 英迪, 赵晖. 融合跨模态注意力与角色交互的学生课堂专注度研究[J]. 计算机科学, 2026, 53(2): 67-77. https://doi.org/10.11896/jsjkx.250300026

ZHUO Tienong, YING Di, ZHAO Hui. Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction[J]. Computer Science, 2026, 53(2): 67-77. https://doi.org/10.11896/jsjkx.250300026

参考文献

[1]ZHONG M C,ZHANG J L,LAN Y B,et al.Study on OnlineEducation Focus Degree Based on Face Detection and Fuzzy Comprehensive Evaluation[J].Computer Science,2020,47(S2):196-203.
[2]ZALETELJ J,KOSIR A.Predicting Students’ Attention in the Classroom from Kinect Facial and Body Features[J].EURASIP Journal on Image and Video Processing,2017,2017:80.
[3]DUAN J L.Evaluation and Evaluation System of Students’ Attentiveness Based on Machine Vision[D].Hangzhou:Zhejiang Gongshang University,2018.
[4]ZUO G C,WANG H D,CHEN L S,et al.Evaluation of Modern Apprenticeship Learning Effect Based on Face Recognition Technology[J].Intelligent Computer and Applications,2019,9(2):116-118.
[5]HE X L,GAO Q,LI Y Y,et al.Spontaneous Learning FacialExpression Recognition Based on Deep Learning[J].Computer Applications and Software,2019,36(3):180-186.
[6]WANG Y K,SUN Y J,PU D B,et al.Multi modal based online learning focus evaluation[J].Journal of Changchun Normal University,2024,43(8):59-66.
[7]SINATRA G M,HEDDY B C,LOMBARDI D.The challenges of defining and measuring student engagement in science[J].Educational psychologist,2015,50(1):1-13.
[8]TYLER R W.Basic Principles of Curriculum and Instruction[M].Chicago:University of Chicago Press,1949:1-128.
[9]PACE C R.Measuring the Quality of Student Effort[J].Current Issues in Higher Education,1980,2(3):10-16.
[10]NSSE.Nsse:Evidence-based improvement in highereducation[EB/OL].https://nsse.indiana.edu/nsse/about-nsse/index.html.
[11]KAUR A,MUSTAFA A,MEHTA L,et al.Prediction and localization of student engagement in the wild[C]//2018 Digital Image Computing:Techniques and Applications(DICTA).2018:1-8.
[12]MOHAMAD N O,DRAS M,HAMEY L,et al.Automatic recognition of student engagementusing deep learning and facial expression[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Springer,2019:273-289.
[13]BATRA S,WANG H,NAG A,et al.Dmcnet:Diversified model combination network for understanding engagement from video screengrabs[J].Systems and Soft Computing,2022,4:200039.
[14]WHITEHILL J,SERPELL Z,LIN Y C,et al.The faces of engagement:Automatic recognition of student engagementfrom facial expressions[J].IEEE Transactions on Affective Computing,2014,5(1):86-98.
[15]SUKUMARAN A,MANOHARAN A.Multimodal engagement recognition from image traits using deep learning techniques[J].IEEE Access,2024,12:25228-25244.
[16]SANTONI M M,BASARUDDIN T,JUNNS K,et al.Automatic detection of students’engagement during online learning:A bagging ensemble deep learning approach[J].IEEE Access,2024,12:96063-96073.
[17]CHEN Y,ZHOU J,GAO Q,et al.Mdnn:Predicting student engagement via gaze direction and facial expression in collaborative learning[J].Computer Modeling in Engineering & Sciences,2023,136(1):381-401.
[18]BUONO P,DE C B,D’ERRICO F,et al.Assessing student en-gagement from eacialbehavior in on-line learning[J].Multimedia Tools and Applications,2023,82(9):12859-12877.
[19]IKRAM S,AHMAD H,MAHMOOD N,et al.Recognition ofstudent engagement state in a classroom environment using deep and efficient transfer learning algorithm[J].Applied Sciences,2023,13(15):8637.
[20]LAI S,WU F T.Recognition of Learning Concentration Based on Multimodal Physiological Signals[J].Modern Educational Technology,2023,33(6):101-108.
[21]DENG F Q,ZHONG J M,LI N N,et al.Text-guided Graph Temporal Modeling for few-shot video classification[J].Engineering Applications of Artificial Intelligence,2024,137:109076.
[22]ABEDI A,KHAN S S.Improving state-of-the-art in detectingstudent engagement with resnet and tcn hybrid network[C]//2021 18th Conference on Robots and Vision(CRV).2021:151-157.
[23]DAS R,DEV S.Enhancing frame-level student engagement classification through knowledge trans fer techniques[J].Applied Intelligence,2024,54(2):2261-2276.
[24]HERNANDEZ J,LIU Z,HULTEN G,et al.Measuring the engagement level of tv viewers[C]//2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Re-cognition(FG).IEEE,2013:1-7.
[25]GUPTA A,D’CUNHA A,AWASTHI K,et al.Daisee:To-wards user engagement recognition in the wild[J].arXiv:1609.01885,2016.
[26]ZHU X,LYU S,WANG X,et al.Tph-yolov5:Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2778-2788.
[27]TAND D,BOURDEV L,FERGUS R,et al,Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4489-4497.
[28]DONAHUE J,ANNE H L,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:2625-2634.
[29]QIU Z,YAO T,MEI T.Learning spatio-temporal representation with pseudo-3d residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5533-5541.
[30]XU H,DAS A,SAENKO K.R-c3d:Region convolutional 3dnetwork for temporal activity detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5783-5792.
[31]ABEDI A,KHAN S S.Improving state-of-the-art in detectingstudent engagement with resnet and tcn hybrid network[C]//2021 18th Conference on Robots and Vision(CRV).IEEE,2021:151-157.
[32]NEIMARK D,BAR O,ZOHAR M,et al.Video transformernetwork[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3163-3172.
[33]LI Y,WU C Y,FAN H,et al.Mvitv2:Improved multiscale vision transformers for classification and detection[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4804-4814.
[34]LIU Z,NING J,CAO Y,et al.Video swin transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3202-3211.
[35]YOUSAF K,NAWAZ T,HABIB A.Using two-stream effi-cientnet-bilstm network for multiclass classification of disturbing youtube videos[J].Multimedia Tools and Applications,2024,83(12):36519-36546.
[36]XIAO F,LEEY J,GRAUMAN K,et al.Audiovisual slowfast networks for video recognition[J].arXiv:2001.08740,2020.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

融合跨模态注意力与角色交互的学生课堂专注度研究

Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0

融合跨模态注意力与角色交互的学生课堂专注度研究

Research on Student Classroom Concentration Integrating Cross-modal Attention and RoleInteraction

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0

Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction