计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 30-40.doi: 10.11896/jsjkx.250600132

• 智能教育技术 • 上一篇    下一篇

多模态技术在教育领域的应用优势、案例与实践挑战

李梦阁1,2,3, 王刚1,2, 白文昊4, 雷雪3   

  1. 1 西安财经大学信息学院 西安 710100
    2 智财协同可信计算陕西省高等学校重点实验室 西安 710100
    3 陕西师范大学计算机科学学院 西安 710119
    4 比亚迪汽车有限公司 西安 710100
  • 收稿日期:2025-06-20 修回日期:2025-09-09 发布日期:2026-05-08
  • 通讯作者: 王刚(gangw@xaufe.edu.cn)
  • 作者简介:(192339@snnu.edu.cn)
  • 基金资助:
    国家自然科学基金(62377031)

Application Advantages,Cases and Practical Challenges of Multimodal Technology in the Field of Education

LI Mengge1,2,3, WANG Gang1,2, BAI Wenhao4, LEI Xue3   

  1. 1 School of Information, Xi’an University of Finance and Economics, Xi’an 710100, China
    2 Intellectual Property Collaborative Trustworthy Computing Key Laboratory of Shaanxi Province Universities, Xi’an 710100, China
    3 School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
    4 BYD Automobile Co., Ltd., Xi’an 710100, China
  • Received:2025-06-20 Revised:2025-09-09 Online:2026-05-08
  • About author:LI Mengge,born in 1997,Ph.D,lectu-rer.Her main research interests include intelligent educational technology and mobile crowdsensing.
    GANG Wang,born in 1974,Ph.D,professor.His main research interests include cloud computing,big data,IoTs,trust security and service computing.
  • Supported by:
    National Natural Science Foundation of China(62377031).

摘要: 教育是国家发展和民族振兴的基石,然而传统教育模式在教学手段、资源分配和评估体系等方面存在诸多局限性,如教学方式单一、教育资源不均衡以及评估方法片面等。随着人工智能技术的飞速发展,多模态技术作为一种融合多种数据形式(如图像、声音、文本)的新兴技术手段,为解决这些问题提供了新的可能性。多模态技术通过智能课堂、个性化学习系统等应用,能够全面感知和理解学习环境,从而打破传统教育模式的局限,提升学习体验、促进教育公平并实现个性化学习评估。对此,首先概述多模态技术的定义、内涵及其核心算法,探讨其在人工智能领域的发展脉络与重要地位;其次,从多个维度详细分析多模态技术在教育领域中的应用优势,并结合具体案例进行深入探讨;最后,讨论多模态技术在教育应用中面临的数据隐私、技术成本和伦理问题等挑战。通过对多模态技术在教育领域的应用和挑战的深入研究,旨在为教育创新提供理论依据和实践指导,推动教育向更加智能化、个性化和公平化的方向发展。

关键词: 多模态技术, 教育创新, 应用优势, 挑战

Abstract: Education is the cornerstone of national development and national rejuvenation.However,the traditional education model has many limitations in teaching methods,resource allocation and evaluation systems,such as monotonous teaching me-thods,unbalanced educational resources and one-sided evaluation methods.With the rapid development of artificial intelligence technology,multimodal technology,as an emerging technology that integrates multiple data forms(such as images,sounds,and texts),provides new possibilities for solving these problems.Through applications such as smart classrooms and personalized learning systems,multimodal technology can comprehensively perceive and understand the learning environment,thereby breaking the limitations of the traditional education model,enhancing the learning experience,promoting educational equity and achieving personalized learning evaluation.Firstly,this paper outlines the definition,connotation and core algorithms of multimodal techno-logy,and explores its development trajectory and important position in the field of artificial intelligence.Secondly,this paper analyzes the application advantages of multimodal technology in the field of education in detail from multiple dimensions,and conducts in-depth discussions based on specific cases.Finally,this paper discusses the challenges faced by multimodal technology in educational applications,such as data privacy,technical costs and ethical issues.Through in-depth research on the applications and challenges of multimodal technology in education,this paper aims to provide theoretical basis and practical guidance for educatio-nal innovation,and promote the development of education towards a more intelligent,personalized and equitable direction.

Key words: Multimodal technology, Educational innovation, Application advantages, Challenges

中图分类号: 

  • TP391
[1]WANG J H,WANG C Y,TENG J,et al.Challenges,break-throughs,and solutions in building a strong education system[J].China Educational Technology,2025(4):1-12.
[2]JU H M,FANG Y Y,LIU Z S,et al.Path analysis of the impact of technology on education:a perspective from the discrimination of related concepts from ‘educational informatization’ to ‘educational digital transformation’[J].China Educational Techno-logy,2025(4):48-56.
[3]GUO S,ZHENG Y,ZHAI X.Artificial intelligence in education research during 2013-2023:a review based on bibliometric analysis[J].Education and Information Technologies,2024,29(13):16387-16409.
[4]FU X,YUE J,FAIZAN M,et al.SHMT:an SRAM and HBM hybrid computing-in-memory architecture with optimized KV cache for multimodal transformer[J].IEEE Transactions on Circuits and Systems I:Regular Papers,2025,72(6):2712-2725.
[5]ZHANG Z C,WANG J,ZHANG Y,et al.OrthoGPT:a multimodal orthopedic large model for precision diagnosis and treatment[J].Chinese Journal of Intelligent Science and Technology,2024,6(3):338-346.
[6]GUO W M.History of educational change from a technological dimension:new era and new paradigm in educational research[J].Distance Education in China,2025,45(2):54-70.
[7]WANG Y Y,WU G Z,ZHENG Y H.Generative artificial intelligence empowering educational information science and techno-logy research:new opportunities,new trends,and new issues[J].Modern Distance Education Research,2024,36(6):46-54.
[8]SUN L J,CAO M M,ZHANG Y C.Construction and path ana-lysis of a metacognition-oriented multimodal teaching model[J].Digital Education,2024,10(6):60-67.
[9]BIE D R,GUO Y R.New trends in the innovative development of higher education in the era of artificial intelligence[J].China Higher Education,2024(Z1):39-44.
[10]GUO S Q,WANG J Y.Educational intelligence:a new path for technology-enabled rural education equity[J].China Educational Technology,2025(2):67-74,83.
[11]WANG M K,CHEN Z Z,SHI Y W,et al.Design and application effect of an intelligent technology-supported multimodal interactive teaching evaluation framework[J].Modern Educational Technology,2024,34(9):91-101.
[12]HE X Y,TIAN S,CUI L,et al.Preprocessing and edge extraction methods for spinal ultrasound images[J].Application Research of Computers,2020,37(S2):297-299,304.
[13]SONG J F,ZHANG W Y,HAN L,et al.A multi-stage intelligent color restoration algorithm for black-and-white images[J].Computer Science,2024,51(5):92-99.
[14]HARRIS Z S.Distributional Structure[J].Word,1954,10(2/3):146-162.
[15]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the Advances in Neural Information Processing Systems.2013:3111-3119.
[16]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the Confe-rence on Empirical Methods in Natural Language Processing.2014:1532-1543.
[17]RUMELHART D E,HINTON G E,WILLIAMS R J.Learning representations by back-propagating errors[J].Nature,1986,323(6088):533-536.
[18]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[19]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[20]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of 31st Internaitonal Conference on Neural Information Processing Systems.2017:5998-6008.
[21]DEVLIN J,CHANG M W,LEE K,et al.Bert:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics.2019:4171-4186.
[22]LECUN Y,BOSER B,DENKER J S,et al.Backpropagation applied to handwritten zip code recognition[J].Neural Computation,1989,1(4):541-551.
[23]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Proceedings of the Advances in Neural Information Processing Systems.2012:1097-1105.
[24]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409.1556,2014.
[25]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[26]XIE S,GIRSHICK R,DOLLÁR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1492-1500.
[27]ZHU Y,NEWSAM S.Densenet for dense flow[C]//Procee-dings of the IEEE International Conference on Image Proces-sing.2017:790-794.
[28]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[29]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE International Conference on Computer Vision.2021:10012-10022.
[30]CARION N,MASSA F,SYNNEVE G,et al.End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision.2020:213-229.
[31]ZHANG Y M,SUN J.Cough sound-based COVID-19 diagnosis algorithm using dual-input neural network of dynamic and static features[J].Acta Electronica Sinica,2023,51(1):202-212.
[32]CHOI K,FAZEKAS G,SANDLER M,et al.Convolutional recurrent neural networks for music classification[C]//Procee-dings of the IEEE International Conference on Acoustics,Speech and Signal Processing.2017:2392-2396.
[33]JIANG L Y,JU J H,XU J,et al.Lightweight music score recognition method based on improved CRNN[J].Acta Electronica Sinica,2023,51(11):3167-3175.
[34]ZHANG H C,LI L X,LIU D J.A review of multimodal data fusion research[J].Journal of Computer Science and Exploration,2024,18(10):2501-2520.
[35]DUAN Z T,HUANG J C,ZHU X L.Research survey on key technologies of video multimodal emotion analysis[J].Journal of Computer Science and Exploration,2025,19(3):539-558.
[36]LIU H C,SONG L J.A review of feature fusion techniques in multimodal MRI brain tumor segmentation methods[J].Computer Engineering and Applications,2024,60(23):28-48.
[37]WANG F X,MAO C L,YU Z T,et al.Fusion of dual attention mechanisms for Burmese image text recognition[J].Journal of Chinese Information Processing,2025,39(1):47-55.
[38]YUAN F Y,MEI H Y,WEN M W,et al.Feature fusion session recommendation method based on enhanced graph neural networks[J].Computer Engineering and Design,2025,46(2):546-553.
[39]LI M,ZHUANG X,BAI L,et al.Multimodal graph learningbased on 3D haar semi-tight framelet for student engagement prediction[J].Information Fusion,2024,105:102224.
[40]RADFORD A,KIM J W,HALLACY C,et al.Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Lear-ning.2021:8748-8763.
[41]RAMESH A,PAVLOV M,GOH G,et al.Zero-shot text-to-image generation[C]//Proceedings of the International Conference on Machine Learning.2021:8821-8831.
[42]LAURENÇON H,TRONCHON L,CORD M,et al.What matters when building vision-language models?[J].arXiv:2405.02246,2024.
[43]WEI R,QI X M,HE Y T,et al.Multimodal MRI disease prognosis combining knowledge distillation and mutual information[J].Journal of Image and Graphics,2025,30(4):1170-1182.
[44]ZHANG X Y,GUO J L,LI J,et al.Interpretable multimodalperception for intelligent driving based on information theory[J].Science China:Information Sciences,2024,54(6):1419-1440.
[45]LU X L,LI Z H.An Internet of Things(IoT) Device Identification Method Integrating Multi-Modal IoT Device Fingerprints and Ensemble Learning[J].Computer Science,2024,51(9):371-382.
[46]CHANG S,FENG Y.Blockchain smart contract vulnerabilitydetection method based on multimodal deep learning[J].Journal of Chinese Computer Systems,2025,46(4):958-965.
[47]SU X H,MIAO Q G,CHEN W Y.Personalized teaching model for improving programming ability based on AI empowerment and industry-education integration[J].China University Tea-ching,2023(6):4-9.
[48]WANG S,WANG F,ZHU Z,et al.Artificial intelligence in education:a systematic literature review[J].Expert Systems with Applications,2024,252:124167.
[49]SUN L J,CAO M M,ZHANG Y C.Construction and path ana-lysis of a metacognition-oriented multimodal teaching model[J].Digital Education,2024,10(6):60-67.
[50]Songshu Ai Li Haoyang:How will AI education reshape the future of learning?[EB/OL].(2024-10-15)[2025-02-16].https://baijiahao.baidu.com/s?id=1813884533695963123&wfr=spider&for=pc.
[51]CAI S,JIAO X Y,YANG Y,et al.Practice of multimodal smart classroom in 5G environment[J].Modern Distance Education Research,2021,33(5):103-112.
[52]CHEN D,ZHANG R.Building multimodal knowledge baseswith multimodal computational sequences and generative adversarial networks[J].IEEE Transactions on Multimedia,2023,26:2027-2040.
[53]LI M,ZHOU S,CHEN Y,et al.EduCross:Dual adversarial bipartite hypergraph learning for cross-modal retrieval in multimodal educational slides[J].Information Fusion,2024,109:102428.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!