计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 51-57.doi: 10.11896/jsjkx.250800033

• 大语言模型技术研究及应用 • 上一篇    下一篇

基于多模态大模型辅助视频动作生成的预训练世界模型

万盛华, 徐兴业, 甘乐, 詹德川   

  1. 南京大学人工智能学院 南京 210023;
    南京大学计算机软件新技术国家重点实验室 南京 210023
  • 收稿日期:2025-08-11 修回日期:2025-10-20 发布日期:2026-01-08
  • 通讯作者: 詹德川(zhandc@nju.edu.cn)
  • 作者简介:(wansh@lamda.nju.edu.cn)
  • 基金资助:
    国家自然科学基金青年学生基础研究项目(博士研究生)(624B200197);国家重点研发计划(2022ZD0114805)

Pre-training World Models from Videos with Generated Actions by Multi-modal Large Models

WAN Shenghua, XU Xingye, GAN Le, ZHAN Dechuan   

  1. School of Artifcial Intelligence, Nanjing University, Nanjing 210023, China;
    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Received:2025-08-11 Revised:2025-10-20 Online:2026-01-08
  • About author:WAN Shenghua,born in 1999,doctoral candidate,is a student member of CCF(No.I7496G).His main research in-terests include reinforcement learning and world models.
    ZHAN Dechuan,born in 1982,Ph.D,professor,is a member of CCF(No.20015M).His main research interests include machine learning and data mi-ning.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(Ph.D Candidate)(624B200197) and National Key Research and Development Program of China(2022ZD0114805).

摘要: 预训练世界模型是提升强化学习样本效率的关键技术,但现有方法因视频数据缺乏显式动作标注,难以捕捉状态转移的因果机制。对此,提出多模态大模型辅助的视频动作生成预训练框架(MLM-generated Action-based Pre-training from videos for world models,MAPO),通过整合视觉语言模型的语义理解能力与动力学建模需求,突破传统预训练范式在动作语义缺失方面的局限性。具体地,MAPO在预训练阶段利用多模态大模型(QWEN2_5-VL-7B)解析视频帧序列,生成细粒度语义动作描述,构建具有因果解释性的动作-状态关联;设计上下文量化编码机制,解耦场景静态特征与动态控制因素,增强跨模态表征能力。在微调阶段,通过双网络协同架构实现预训练动力学特征与真实环境动作的端到端对齐。实验表明,MAPO在DeepMind Control Suite和Meta-World的8项任务中的平均回报较最优基线获得稳定提升,尤其在长时程任务中展现出卓越的性能。该研究为跨模态世界模型训练提供了新范式,揭示了语义动作生成在因果推理中的关键作用。

关键词: 世界模型, 强化学习, 视频预训练, 多模态大模型, 语义动作生成

Abstract: Pre-training of world models is key to improving the sample efficiency of reinforcement learning.However,existing methods struggle to capture the causal mechanisms of state transitions due to the lack of explicit action labels in video data.This paper presents MAPO(Multimodal-large-model-generated Action-based pre-training from videOs for world models),a novel pre-training framework.It leverages the semantic understanding of visual-language models and meets the needs of kinematic mode-ling,overcoming the limitations of traditional pre-training methods in the absence of action semantics.MAPO uses the multimodal large model(QWEN2_5-VL-7B) to analyze video frame sequences and generate fine-grained semantic action descriptions during pre-training.This establishes action-state associations with causal explanations.It also designs a context quantization encoding mechanism to separate static scene features from dynamic control factors,improving cross-modal representation.During fine-tu-ning,MAPO uses a dual-network collaborative architecture to align the pre-trained kinematic features with real-environment actions.Experiments show MAPO steadily improves average returns over baselines in 8 tasks on DeepMind Control Suite and Meta-World,especially in long-horizon tasks.This study offers a new cross-modal world model training approach,highlighting the importance of semantic action generation in causal reasoning.

Key words: World models, Reinforcement learning, Video pre-training, Multi-modal large models, Semantic action generation

中图分类号: 

  • TP183
[1]MOERLAND T M,BROEKENS J,PLAAT A,et al.Model-based reinforcement learning:A survey[J].Foundations and Trends© in Machine Learning,2023,16(1):1-118.
[2]LUO F,XU T,LAI H,et al.A survey on model-based reinforcement learning[J].Science China(Information Sciences),2024(2):067.
[3]HA D,SCHMIDHUBER J.Recurrent world models facilitatepolicy evolution[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:2455-2467.
[4]HAFNER D,LILLICRAP T P,BA J,et al.Dream to control:Learning behaviors by latent imagination[C]//International Conference on Learning Representations.2020.
[5]HAFNER D,LILLICRAP T P,NOROUZI M,et al.Mastering atari with discrete world models[C]//International Conference on Learning Representations.2021.
[6]HU A,RUSSELL L,YEO H,et al.GAIA-1:A generative world model for autonomous driving[J].arXiv:2309.17080,2023.
[7]BAI C,XU H,LI X.Embodied-AI with large models:research and challenges[J].Science China(Information Sciences),2024,54:2035-2082.
[8]HANSEN N,SU H,WANG X.TD-MPC2:Scalable,robustworld models for continuous control[C]//The Twelfth Intarnational Conference on Learning Representations.Vienna,Austria,2024.
[9]XU Y,PARKER-HOLDER J,PACCHIANO A,et al.Learning general world models in a handful of reward-free deployments[J].Advances in Neural Information Processing Systems,2022,35:26820-26838.
[10]FENG Y,HANSEN N,XIONG Z,et al.Finetuning offline worldmodels in the real world[C]//Conference on Robot Learning.PMLR,2023:425-445.
[11]SHAH S,DEY D,LOVETT C,et al.Airsim:High-fidelity visu-al and physical simulation for autonomous vehicles[C]//Field and Service Robotics:Results of the 11th International Confe-rence.Springer International Publishing,2018:621-635.
[12]CHEN X,JIANG S,XU F,et al.Cross-modal domain adaptation for cost-efficient visual reinforcement learning[J].Advances in Neural Information Processing Systems,2021,34:12520-12532.
[13]LIN Q,YU C,WU X,et al.Survey on Sim-to-real Transfer Reinforcement Learning in Robot Systems[J].Journal of Software,2024,35(2):711-738.
[14]MA W,LI S,CAI L,et al.Learning modality knowledge alignment for cross-modality transfer[C]//Proceedings of the 41st International Conference on Machine Learning.2024:33777-33793.
[15]SEO Y,LEE K,JAMESS S L,et al.Reinforcement learning with action-free pre-training from videos[C]//Proceedings of Inte-rnational Conference on Machine Learning.PMLR,2022:19561-19579.
[16]ZHANG L,KAN M,SHAN S,et al.PreLAR:World model pre-training with learnable action representation[C]//European Conference on Computer Vision.Cham:Springer Nature Swit-zerland,2024:185-201.
[17]KINGMA D P,WELLING M.Auto-encoding variational bayes[C]//International Conference on Learning Representations.2014.
[18]WU J,YIN S,FENG N,et al.iVideoGPT:Interactive videogpts are scalable world models[J].Advances in Neural Information Processing Systems,2024,37:68082-68119.
[19]MICHELI V,ALONSO E,FLEURET F.Transformers aresample-efficient world models[C]//The Eleventh International Conference on Learning Representations.2023.
[20]ZHANG W,WANG G,SUN J,et al.Storm:Efficient stochastic transformer based world models for reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:27147-27166.
[21]ROBINE J,HOFTMANN M,UELWER T,et al.Transformer-based World Models Are Happy With 100k Interactions[C]//The Eleventh International Conference on Learning Representations.2023.
[22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st Internatioan Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[23]BELLEMARE M G,NADDAF Y,VENESS J,et al.The arcade learning environment:An evaluation platform for general agents[J].Journal of Artificial Intelligence Research,2013,47:253-279.
[24]DENG F,PARK J,AHN S.Facing off world model backbones:Rnns,transformers,and s4[J].Advances in Neural Information Processing Systems,2023,36:72904-72930.
[25]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of International Conference on Machine Learning.PMLR,2015:2256-2265.
[26]ALONSO E,JELLEY A,MICHELI V,et al.Diffusion for world modeling:Visual details matter in atari[J].Advances in Neural Information Processing Systems,2024,37:58757-58791.
[27]DING Z,ZHANG A,TIAN Y,et al.Diffusion world model:Future modeling beyond step-by-step rollout for offline reinforcement learning[J].arXiv:2402.03570,2024.
[28]WU J,MA H,DENG C,et al.Pre-training contextualized world models with in-the-wild videos for reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:39719-39743.
[29]LU C,SCHROECKER Y,GU A,et al.Structured state spacemodels for in-context reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:47016-47031.
[30]KAELBLING L P,LITTMAN M L,CASSANDRA A R.Planning and acting in partially observable stochastic domains[J].Artificial Intelligence,1998,101(1/2):99-134.
[31]QWEN TEAM.Qwen2.5-VL[EB/OL].[2025-01-31].https://qwenlm.github.io/blog/qwen2.5-vl/.
[32]VAN DEN OORD A,VINYALS O.Neural discrete representation learning[C]//NIPS.2017.
[33]TASSA Y,DORON Y,MULDAL A,et al.Deepmind control suite[J].arXiv:1801.00690,2018.
[34]YU T,QUILLEN D,HE Z,et al.Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning.PMLR,2020:1094-1100.
[35]GOYAL R,EBRAHIMI KAHOU S,MICHALSKI V,et al.The “something something” video database for learning and evaluating visual common sense[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5842-5850.
[36]IONESCU C,PAPAVA D,OLARU V,et al.Human3.6m:Large scale datasets and predictive methods for 3d human sensing in natural environments[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(7):1325-1339.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!