Computer Science ›› 2026, Vol. 53 ›› Issue (2): 67-77.doi: 10.11896/jsjkx.250300026

• Educational Data Mining Based on Graph Machine Learning • Previous Articles     Next Articles

Research on Student Classroom Concentration Integrating Cross-modal Attention and Role
Interaction

ZHUO Tienong1, YING Di2, ZHAO Hui2   

  1. 1 School of Software,Xinjiang University,Urumqi 830046,China
    2 School of Computer Science and Technology,Xinjiang University,Urumqi 830046,China
  • Received:2025-03-05 Revised:2025-08-26 Published:2026-02-10
  • About author:ZHUO Tienong,born in 1995,master.His main research interest is digital image processing.
    ZHUO Hui,born in 1972,Ph.D,professor,Ph.D supervisor, is a member of CCF(No.25440S).Her main research interests include artificial intelligence,natural language processing,emotion computing,speech and digital image processing.
  • Supported by:
    Key R&D Program of Xinjiang Uygur Autonomous Region(2023B01032)and National Natural Science Foundation of China(62166041).

Abstract: With the continuous development of innovative education,schools can assess students’ learning and teachers’ teaching quality by detecting students’ concentration in the classroom to optimize the teaching system.Previous studies have focused on single-modality and single-role feature extraction.However,the teaching classroom is a complex scene with multimodal,multiple roles,and interactions between the roles,so it is of great significance to explore students’ classroom attentiveness from the perspective of multimodal and multiple roles.However,how to effectively model the temporal relevance and semantic interaction between multimodal and how the multiple roles interact is a significant challenge in realizing the judgment of students’ classroom concentration.To address the above problems,a student classroom concentration dataset containing teacher’s audio and student’s video is constructed,and a Long-Short Context Model(LSCM) based on multimodal and multi-role assessment of students’ classroom concentration is proposed,in which multimodal refers to the student’s video and the teacher’s audio.Multi-role refers to the student-to-student and student-to-teacher.The model contains two main modules:the long-term context module and the short-term context module.Specifically,the long-term context module extracts the long-time behavioral characteristics of a single student through the audio self-attention mechanism and the visual self-attention mechanism.The audio-visual cross-attention mechanism enhances the correlation between the audio and visual information.In contrast,the short-term context module focuses on localized time segments to portray the dynamic changes in the attentiveness of multiple students in the classroom environment.Finally,the model outputs the concentration categories of each student in the video.Experiments show that this method significantly improves concentration detection accuracy compared with existing methods by effectively exploiting the complementary nature of multimodal data and the correlation between roles.It also verifies the effectiveness of multimodal fusion and role interaction modeling.

Key words: Multimodal, Student concentration, Teaching classroom, Role interaction, Attention mechanism

CLC Number: 

  • TP391.1
[1]ZHONG M C,ZHANG J L,LAN Y B,et al.Study on OnlineEducation Focus Degree Based on Face Detection and Fuzzy Comprehensive Evaluation[J].Computer Science,2020,47(S2):196-203.
[2]ZALETELJ J,KOSIR A.Predicting Students’ Attention in the Classroom from Kinect Facial and Body Features[J].EURASIP Journal on Image and Video Processing,2017,2017:80.
[3]DUAN J L.Evaluation and Evaluation System of Students’ Attentiveness Based on Machine Vision[D].Hangzhou:Zhejiang Gongshang University,2018.
[4]ZUO G C,WANG H D,CHEN L S,et al.Evaluation of Modern Apprenticeship Learning Effect Based on Face Recognition Technology[J].Intelligent Computer and Applications,2019,9(2):116-118.
[5]HE X L,GAO Q,LI Y Y,et al.Spontaneous Learning FacialExpression Recognition Based on Deep Learning[J].Computer Applications and Software,2019,36(3):180-186.
[6]WANG Y K,SUN Y J,PU D B,et al.Multi modal based online learning focus evaluation[J].Journal of Changchun Normal University,2024,43(8):59-66.
[7]SINATRA G M,HEDDY B C,LOMBARDI D.The challenges of defining and measuring student engagement in science[J].Educational psychologist,2015,50(1):1-13.
[8]TYLER R W.Basic Principles of Curriculum and Instruction[M].Chicago:University of Chicago Press,1949:1-128.
[9]PACE C R.Measuring the Quality of Student Effort[J].Current Issues in Higher Education,1980,2(3):10-16.
[10]NSSE.Nsse:Evidence-based improvement in highereducation[EB/OL].https://nsse.indiana.edu/nsse/about-nsse/index.html.
[11]KAUR A,MUSTAFA A,MEHTA L,et al.Prediction and localization of student engagement in the wild[C]//2018 Digital Image Computing:Techniques and Applications(DICTA).2018:1-8.
[12]MOHAMAD N O,DRAS M,HAMEY L,et al.Automatic recognition of student engagementusing deep learning and facial expression[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Springer,2019:273-289.
[13]BATRA S,WANG H,NAG A,et al.Dmcnet:Diversified model combination network for understanding engagement from video screengrabs[J].Systems and Soft Computing,2022,4:200039.
[14]WHITEHILL J,SERPELL Z,LIN Y C,et al.The faces of engagement:Automatic recognition of student engagementfrom facial expressions[J].IEEE Transactions on Affective Computing,2014,5(1):86-98.
[15]SUKUMARAN A,MANOHARAN A.Multimodal engagement recognition from image traits using deep learning techniques[J].IEEE Access,2024,12:25228-25244.
[16]SANTONI M M,BASARUDDIN T,JUNNS K,et al.Automatic detection of students’engagement during online learning:A bagging ensemble deep learning approach[J].IEEE Access,2024,12:96063-96073.
[17]CHEN Y,ZHOU J,GAO Q,et al.Mdnn:Predicting student engagement via gaze direction and facial expression in collaborative learning[J].Computer Modeling in Engineering & Sciences,2023,136(1):381-401.
[18]BUONO P,DE C B,D’ERRICO F,et al.Assessing student en-gagement from eacialbehavior in on-line learning[J].Multimedia Tools and Applications,2023,82(9):12859-12877.
[19]IKRAM S,AHMAD H,MAHMOOD N,et al.Recognition ofstudent engagement state in a classroom environment using deep and efficient transfer learning algorithm[J].Applied Sciences,2023,13(15):8637.
[20]LAI S,WU F T.Recognition of Learning Concentration Based on Multimodal Physiological Signals[J].Modern Educational Technology,2023,33(6):101-108.
[21]DENG F Q,ZHONG J M,LI N N,et al.Text-guided Graph Temporal Modeling for few-shot video classification[J].Engineering Applications of Artificial Intelligence,2024,137:109076.
[22]ABEDI A,KHAN S S.Improving state-of-the-art in detectingstudent engagement with resnet and tcn hybrid network[C]//2021 18th Conference on Robots and Vision(CRV).2021:151-157.
[23]DAS R,DEV S.Enhancing frame-level student engagement classification through knowledge trans fer techniques[J].Applied Intelligence,2024,54(2):2261-2276.
[24]HERNANDEZ J,LIU Z,HULTEN G,et al.Measuring the engagement level of tv viewers[C]//2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Re-cognition(FG).IEEE,2013:1-7.
[25]GUPTA A,D’CUNHA A,AWASTHI K,et al.Daisee:To-wards user engagement recognition in the wild[J].arXiv:1609.01885,2016.
[26]ZHU X,LYU S,WANG X,et al.Tph-yolov5:Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:2778-2788.
[27]TAND D,BOURDEV L,FERGUS R,et al,Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4489-4497.
[28]DONAHUE J,ANNE H L,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:2625-2634.
[29]QIU Z,YAO T,MEI T.Learning spatio-temporal representation with pseudo-3d residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5533-5541.
[30]XU H,DAS A,SAENKO K.R-c3d:Region convolutional 3dnetwork for temporal activity detection[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5783-5792.
[31]ABEDI A,KHAN S S.Improving state-of-the-art in detectingstudent engagement with resnet and tcn hybrid network[C]//2021 18th Conference on Robots and Vision(CRV).IEEE,2021:151-157.
[32]NEIMARK D,BAR O,ZOHAR M,et al.Video transformernetwork[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3163-3172.
[33]LI Y,WU C Y,FAN H,et al.Mvitv2:Improved multiscale vision transformers for classification and detection[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:4804-4814.
[34]LIU Z,NING J,CAO Y,et al.Video swin transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3202-3211.
[35]YOUSAF K,NAWAZ T,HABIB A.Using two-stream effi-cientnet-bilstm network for multiclass classification of disturbing youtube videos[J].Multimedia Tools and Applications,2024,83(12):36519-36546.
[36]XIAO F,LEEY J,GRAUMAN K,et al.Audiovisual slowfast networks for video recognition[J].arXiv:2001.08740,2020.
[1] CHANG Xuanwei, DUAN Liguo, CHEN Jiahao, CUI Juanjuan, LI Aiping. Method for Span-level Sentiment Triplet Extraction by Deeply Integrating Syntactic and Semantic
Features
[J]. Computer Science, 2026, 53(2): 322-330.
[2] ZHANG Jing, PAN Jinghao, JIANG Wenchao. Background Structure-aware Few-shot Knowledge Graph Completion [J]. Computer Science, 2026, 53(2): 331-341.
[3] CHEN Haitao, LIANG Junwei, CHEN Chen, WANG Yufan, ZHOU Yu. Multimodal Physical Education Data Fusion via Graph Alignment for Action Recognition [J]. Computer Science, 2026, 53(2): 89-98.
[4] XU Jingtao, YANG Yan, JIANG Yongquan. Time-Frequency Attention Based Model for Time Series Anomaly Detection [J]. Computer Science, 2026, 53(2): 161-169.
[5] HUANG Jing, WANG Teng, LIU Jian, HU Kai, PENG Xin, HUANG Yamin, WEN Yuanqiao. Multimodal Visual Detection for Underwater Sonar Target Images [J]. Computer Science, 2026, 53(2): 227-235.
[6] HAN Lei, SHANG Haoyu, QIAN Xiaoyan, GU Yan, LIU Qingsong, WANG Chuang. Constrained Multi-loss Video Anomaly Detection with Dual-branch Feature Fusion [J]. Computer Science, 2026, 53(2): 236-244.
[7] GUO Xingxing, XIAO Yannan, WEN Peizhi, XU Zhi, HUANG Wenming. Attention-based Audio-driven Digital Face Video Generation Method [J]. Computer Science, 2026, 53(2): 245-252.
[8] JI Sai, QIAO Liwei, SUN Yajie. Semantic-guided Hybrid Cross-feature Fusion Method for Infrared and Visible Light Images [J]. Computer Science, 2026, 53(2): 253-263.
[9] BU Yunyang, QI Binting, BU Fanliang. Multimodal Sentiment Analysis for Interactive Fusion of Dual Perspectives Under Cross-modalInconsistent Perception [J]. Computer Science, 2026, 53(1): 187-194.
[10] LYU Jinggang, GAO Shuo, LI Yuzhi, ZHOU Jin. Facial Expression Recognition with Channel Attention Guided Global-Local Semantic Cooperation [J]. Computer Science, 2026, 53(1): 195-205.
[11] FAN Jiabin, WANG Baohui, CHEN Jixuan. Method for Symbol Detection in Substation Layout Diagrams Based on Text-Image MultimodalFusion [J]. Computer Science, 2026, 53(1): 206-215.
[12] WANG Haoyan, LI Chongshou, LI Tianrui. Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network [J]. Computer Science, 2026, 53(1): 231-240.
[13] CHEN Qian, CHENG Kaixuan, GUO Xin, ZHANG Xiaoxia, WANG Suge, LI Yanhong. Bidirectional Prompt-Tuning for Event Argument Extraction with Topic and Entity Embeddings [J]. Computer Science, 2026, 53(1): 278-284.
[14] LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[15] PENG Jiao, HE Yue, SHANG Xiaoran, HU Saier, ZHANG Bo, CHANG Yongjuan, OU Zhonghong, LU Yanyan, JIANG dan, LIU Yaduo. Text-Dynamic Image Cross-modal Retrieval Algorithm Based on Progressive Prototype Matching [J]. Computer Science, 2025, 52(9): 276-281.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!