Computer Science ›› 2026, Vol. 53 ›› Issue (1): 51-57.doi: 10.11896/jsjkx.250800033

• Research and Application of Large Language Model Technology • Previous Articles     Next Articles

Pre-training World Models from Videos with Generated Actions by Multi-modal Large Models

WAN Shenghua, XU Xingye, GAN Le, ZHAN Dechuan   

  1. School of Artifcial Intelligence, Nanjing University, Nanjing 210023, China;
    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
  • Received:2025-08-11 Revised:2025-10-20 Published:2026-01-08
  • About author:WAN Shenghua,born in 1999,doctoral candidate,is a student member of CCF(No.I7496G).His main research in-terests include reinforcement learning and world models.
    ZHAN Dechuan,born in 1982,Ph.D,professor,is a member of CCF(No.20015M).His main research interests include machine learning and data mi-ning.
  • Supported by:
    Young Scientists Fund of the National Natural Science Foundation of China(Ph.D Candidate)(624B200197) and National Key Research and Development Program of China(2022ZD0114805).

Abstract: Pre-training of world models is key to improving the sample efficiency of reinforcement learning.However,existing methods struggle to capture the causal mechanisms of state transitions due to the lack of explicit action labels in video data.This paper presents MAPO(Multimodal-large-model-generated Action-based pre-training from videOs for world models),a novel pre-training framework.It leverages the semantic understanding of visual-language models and meets the needs of kinematic mode-ling,overcoming the limitations of traditional pre-training methods in the absence of action semantics.MAPO uses the multimodal large model(QWEN2_5-VL-7B) to analyze video frame sequences and generate fine-grained semantic action descriptions during pre-training.This establishes action-state associations with causal explanations.It also designs a context quantization encoding mechanism to separate static scene features from dynamic control factors,improving cross-modal representation.During fine-tu-ning,MAPO uses a dual-network collaborative architecture to align the pre-trained kinematic features with real-environment actions.Experiments show MAPO steadily improves average returns over baselines in 8 tasks on DeepMind Control Suite and Meta-World,especially in long-horizon tasks.This study offers a new cross-modal world model training approach,highlighting the importance of semantic action generation in causal reasoning.

Key words: World models, Reinforcement learning, Video pre-training, Multi-modal large models, Semantic action generation

CLC Number: 

  • TP183
[1]MOERLAND T M,BROEKENS J,PLAAT A,et al.Model-based reinforcement learning:A survey[J].Foundations and Trends© in Machine Learning,2023,16(1):1-118.
[2]LUO F,XU T,LAI H,et al.A survey on model-based reinforcement learning[J].Science China(Information Sciences),2024(2):067.
[3]HA D,SCHMIDHUBER J.Recurrent world models facilitatepolicy evolution[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:2455-2467.
[4]HAFNER D,LILLICRAP T P,BA J,et al.Dream to control:Learning behaviors by latent imagination[C]//International Conference on Learning Representations.2020.
[5]HAFNER D,LILLICRAP T P,NOROUZI M,et al.Mastering atari with discrete world models[C]//International Conference on Learning Representations.2021.
[6]HU A,RUSSELL L,YEO H,et al.GAIA-1:A generative world model for autonomous driving[J].arXiv:2309.17080,2023.
[7]BAI C,XU H,LI X.Embodied-AI with large models:research and challenges[J].Science China(Information Sciences),2024,54:2035-2082.
[8]HANSEN N,SU H,WANG X.TD-MPC2:Scalable,robustworld models for continuous control[C]//The Twelfth Intarnational Conference on Learning Representations.Vienna,Austria,2024.
[9]XU Y,PARKER-HOLDER J,PACCHIANO A,et al.Learning general world models in a handful of reward-free deployments[J].Advances in Neural Information Processing Systems,2022,35:26820-26838.
[10]FENG Y,HANSEN N,XIONG Z,et al.Finetuning offline worldmodels in the real world[C]//Conference on Robot Learning.PMLR,2023:425-445.
[11]SHAH S,DEY D,LOVETT C,et al.Airsim:High-fidelity visu-al and physical simulation for autonomous vehicles[C]//Field and Service Robotics:Results of the 11th International Confe-rence.Springer International Publishing,2018:621-635.
[12]CHEN X,JIANG S,XU F,et al.Cross-modal domain adaptation for cost-efficient visual reinforcement learning[J].Advances in Neural Information Processing Systems,2021,34:12520-12532.
[13]LIN Q,YU C,WU X,et al.Survey on Sim-to-real Transfer Reinforcement Learning in Robot Systems[J].Journal of Software,2024,35(2):711-738.
[14]MA W,LI S,CAI L,et al.Learning modality knowledge alignment for cross-modality transfer[C]//Proceedings of the 41st International Conference on Machine Learning.2024:33777-33793.
[15]SEO Y,LEE K,JAMESS S L,et al.Reinforcement learning with action-free pre-training from videos[C]//Proceedings of Inte-rnational Conference on Machine Learning.PMLR,2022:19561-19579.
[16]ZHANG L,KAN M,SHAN S,et al.PreLAR:World model pre-training with learnable action representation[C]//European Conference on Computer Vision.Cham:Springer Nature Swit-zerland,2024:185-201.
[17]KINGMA D P,WELLING M.Auto-encoding variational bayes[C]//International Conference on Learning Representations.2014.
[18]WU J,YIN S,FENG N,et al.iVideoGPT:Interactive videogpts are scalable world models[J].Advances in Neural Information Processing Systems,2024,37:68082-68119.
[19]MICHELI V,ALONSO E,FLEURET F.Transformers aresample-efficient world models[C]//The Eleventh International Conference on Learning Representations.2023.
[20]ZHANG W,WANG G,SUN J,et al.Storm:Efficient stochastic transformer based world models for reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:27147-27166.
[21]ROBINE J,HOFTMANN M,UELWER T,et al.Transformer-based World Models Are Happy With 100k Interactions[C]//The Eleventh International Conference on Learning Representations.2023.
[22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st Internatioan Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[23]BELLEMARE M G,NADDAF Y,VENESS J,et al.The arcade learning environment:An evaluation platform for general agents[J].Journal of Artificial Intelligence Research,2013,47:253-279.
[24]DENG F,PARK J,AHN S.Facing off world model backbones:Rnns,transformers,and s4[J].Advances in Neural Information Processing Systems,2023,36:72904-72930.
[25]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of International Conference on Machine Learning.PMLR,2015:2256-2265.
[26]ALONSO E,JELLEY A,MICHELI V,et al.Diffusion for world modeling:Visual details matter in atari[J].Advances in Neural Information Processing Systems,2024,37:58757-58791.
[27]DING Z,ZHANG A,TIAN Y,et al.Diffusion world model:Future modeling beyond step-by-step rollout for offline reinforcement learning[J].arXiv:2402.03570,2024.
[28]WU J,MA H,DENG C,et al.Pre-training contextualized world models with in-the-wild videos for reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:39719-39743.
[29]LU C,SCHROECKER Y,GU A,et al.Structured state spacemodels for in-context reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:47016-47031.
[30]KAELBLING L P,LITTMAN M L,CASSANDRA A R.Planning and acting in partially observable stochastic domains[J].Artificial Intelligence,1998,101(1/2):99-134.
[31]QWEN TEAM.Qwen2.5-VL[EB/OL].[2025-01-31].https://qwenlm.github.io/blog/qwen2.5-vl/.
[32]VAN DEN OORD A,VINYALS O.Neural discrete representation learning[C]//NIPS.2017.
[33]TASSA Y,DORON Y,MULDAL A,et al.Deepmind control suite[J].arXiv:1801.00690,2018.
[34]YU T,QUILLEN D,HE Z,et al.Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning.PMLR,2020:1094-1100.
[35]GOYAL R,EBRAHIMI KAHOU S,MICHALSKI V,et al.The “something something” video database for learning and evaluating visual common sense[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5842-5850.
[36]IONESCU C,PAPAVA D,OLARU V,et al.Human3.6m:Large scale datasets and predictive methods for 3d human sensing in natural environments[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(7):1325-1339.
[1] DUAN Pengting, WEN Chao, WANG Baoping, WANG Zhenni. Collaborative Semantics Fusion for Multi-agent Behavior Decision-making [J]. Computer Science, 2026, 53(1): 252-261.
[2] WANG Haoyan, LI Chongshou, LI Tianrui. Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network [J]. Computer Science, 2026, 53(1): 231-240.
[3] ZHU Shihao, PENG Kexing, MA Tinghuai. Graph Attention-based Grouped Multi-agent Reinforcement Learning Method [J]. Computer Science, 2025, 52(9): 330-336.
[4] CHEN Jintao, LIN Bing, LIN Song, CHEN Jing, CHEN Xing. Dynamic Pricing and Energy Scheduling Strategy for Photovoltaic Storage Charging Stations Based on Multi-agent Deep Reinforcement Learning [J]. Computer Science, 2025, 52(9): 337-345.
[5] ZHANG Yongliang, LI Ziwen, XU Jiahao, JIANG Yuchen, CUI Ying. Congestion-aware and Cached Communication for Multi-agent Pathfinding [J]. Computer Science, 2025, 52(8): 317-325.
[6] HUO Dan, YU Fuping, SHEN Di, HAN Xueyan. Research on Multi-machine Conflict Resolution Based on Deep Reinforcement Learning [J]. Computer Science, 2025, 52(7): 271-278.
[7] PIAO Mingjie, ZHANG Dongdong, LU Hu, LI Rupeng, GE Xiaoli. Study on Multi-agent Supply Chain Inventory Management Method Based on Improved Transformer [J]. Computer Science, 2025, 52(6A): 240500054-10.
[8] XU Dan, WANG Jiangtao. Design of Autonomous Decision for Trajectory Optimization of Intelligent Morphing Aircraft [J]. Computer Science, 2025, 52(6A): 240600068-7.
[9] WU Zongming, CAO Jijun, TANG Qiang. Online Parallel SDN Routing Optimization Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2025, 52(6A): 240900018-9.
[10] ZHAO Chanchan, YANG Xingchen, SHI Bao, LYU Fei, LIU Libin. Optimization Strategy of Task Offloading Based on Meta Reinforcement Learning [J]. Computer Science, 2025, 52(6A): 240800050-8.
[11] ZHAO Xuejian, YE Hao, LI Hao, SUN Zhixin. Multi-AGV Path Planning Algorithm Based on Improved DDPG [J]. Computer Science, 2025, 52(6): 306-315.
[12] WANG Chenyuan, ZHANG Yanmei, YUAN Guan. Class Integration Test Order Generation Approach Fused with Deep Reinforcement Learning andGraph Convolutional Neural Network [J]. Computer Science, 2025, 52(6): 58-65.
[13] LI Yuanbo, HU Hongchao, YANG Xiaohan, GUO Wei, LIU Wenyan. Intrusion Tolerance Scheduling Algorithm for Microservice Workflow Based on Deep Reinforcement Learning [J]. Computer Science, 2025, 52(5): 375-383.
[14] HAN Lin, WANG Yifan, LI Jianan, GAO Wei. Automatic Scheduling Search Optimization Method Based on TVM [J]. Computer Science, 2025, 52(3): 268-276.
[15] ZHENG Longhai, XIAO Bohuai, YAO Zewei, CHEN Xing, MO Yuchang. Graph Reinforcement Learning Based Multi-edge Cooperative Load Balancing Method [J]. Computer Science, 2025, 52(3): 338-348.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!