计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 51-57.doi: 10.11896/jsjkx.250800033
万盛华, 徐兴业, 甘乐, 詹德川
WAN Shenghua, XU Xingye, GAN Le, ZHAN Dechuan
摘要: 预训练世界模型是提升强化学习样本效率的关键技术,但现有方法因视频数据缺乏显式动作标注,难以捕捉状态转移的因果机制。对此,提出多模态大模型辅助的视频动作生成预训练框架(MLM-generated Action-based Pre-training from videos for world models,MAPO),通过整合视觉语言模型的语义理解能力与动力学建模需求,突破传统预训练范式在动作语义缺失方面的局限性。具体地,MAPO在预训练阶段利用多模态大模型(QWEN2_5-VL-7B)解析视频帧序列,生成细粒度语义动作描述,构建具有因果解释性的动作-状态关联;设计上下文量化编码机制,解耦场景静态特征与动态控制因素,增强跨模态表征能力。在微调阶段,通过双网络协同架构实现预训练动力学特征与真实环境动作的端到端对齐。实验表明,MAPO在DeepMind Control Suite和Meta-World的8项任务中的平均回报较最优基线获得稳定提升,尤其在长时程任务中展现出卓越的性能。该研究为跨模态世界模型训练提供了新范式,揭示了语义动作生成在因果推理中的关键作用。
中图分类号:
| [1]MOERLAND T M,BROEKENS J,PLAAT A,et al.Model-based reinforcement learning:A survey[J].Foundations and Trends© in Machine Learning,2023,16(1):1-118. [2]LUO F,XU T,LAI H,et al.A survey on model-based reinforcement learning[J].Science China(Information Sciences),2024(2):067. [3]HA D,SCHMIDHUBER J.Recurrent world models facilitatepolicy evolution[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems.2018:2455-2467. [4]HAFNER D,LILLICRAP T P,BA J,et al.Dream to control:Learning behaviors by latent imagination[C]//International Conference on Learning Representations.2020. [5]HAFNER D,LILLICRAP T P,NOROUZI M,et al.Mastering atari with discrete world models[C]//International Conference on Learning Representations.2021. [6]HU A,RUSSELL L,YEO H,et al.GAIA-1:A generative world model for autonomous driving[J].arXiv:2309.17080,2023. [7]BAI C,XU H,LI X.Embodied-AI with large models:research and challenges[J].Science China(Information Sciences),2024,54:2035-2082. [8]HANSEN N,SU H,WANG X.TD-MPC2:Scalable,robustworld models for continuous control[C]//The Twelfth Intarnational Conference on Learning Representations.Vienna,Austria,2024. [9]XU Y,PARKER-HOLDER J,PACCHIANO A,et al.Learning general world models in a handful of reward-free deployments[J].Advances in Neural Information Processing Systems,2022,35:26820-26838. [10]FENG Y,HANSEN N,XIONG Z,et al.Finetuning offline worldmodels in the real world[C]//Conference on Robot Learning.PMLR,2023:425-445. [11]SHAH S,DEY D,LOVETT C,et al.Airsim:High-fidelity visu-al and physical simulation for autonomous vehicles[C]//Field and Service Robotics:Results of the 11th International Confe-rence.Springer International Publishing,2018:621-635. [12]CHEN X,JIANG S,XU F,et al.Cross-modal domain adaptation for cost-efficient visual reinforcement learning[J].Advances in Neural Information Processing Systems,2021,34:12520-12532. [13]LIN Q,YU C,WU X,et al.Survey on Sim-to-real Transfer Reinforcement Learning in Robot Systems[J].Journal of Software,2024,35(2):711-738. [14]MA W,LI S,CAI L,et al.Learning modality knowledge alignment for cross-modality transfer[C]//Proceedings of the 41st International Conference on Machine Learning.2024:33777-33793. [15]SEO Y,LEE K,JAMESS S L,et al.Reinforcement learning with action-free pre-training from videos[C]//Proceedings of Inte-rnational Conference on Machine Learning.PMLR,2022:19561-19579. [16]ZHANG L,KAN M,SHAN S,et al.PreLAR:World model pre-training with learnable action representation[C]//European Conference on Computer Vision.Cham:Springer Nature Swit-zerland,2024:185-201. [17]KINGMA D P,WELLING M.Auto-encoding variational bayes[C]//International Conference on Learning Representations.2014. [18]WU J,YIN S,FENG N,et al.iVideoGPT:Interactive videogpts are scalable world models[J].Advances in Neural Information Processing Systems,2024,37:68082-68119. [19]MICHELI V,ALONSO E,FLEURET F.Transformers aresample-efficient world models[C]//The Eleventh International Conference on Learning Representations.2023. [20]ZHANG W,WANG G,SUN J,et al.Storm:Efficient stochastic transformer based world models for reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:27147-27166. [21]ROBINE J,HOFTMANN M,UELWER T,et al.Transformer-based World Models Are Happy With 100k Interactions[C]//The Eleventh International Conference on Learning Representations.2023. [22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st Internatioan Confe-rence on Neural Information Processing Systems.2017:6000-6010. [23]BELLEMARE M G,NADDAF Y,VENESS J,et al.The arcade learning environment:An evaluation platform for general agents[J].Journal of Artificial Intelligence Research,2013,47:253-279. [24]DENG F,PARK J,AHN S.Facing off world model backbones:Rnns,transformers,and s4[J].Advances in Neural Information Processing Systems,2023,36:72904-72930. [25]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of International Conference on Machine Learning.PMLR,2015:2256-2265. [26]ALONSO E,JELLEY A,MICHELI V,et al.Diffusion for world modeling:Visual details matter in atari[J].Advances in Neural Information Processing Systems,2024,37:58757-58791. [27]DING Z,ZHANG A,TIAN Y,et al.Diffusion world model:Future modeling beyond step-by-step rollout for offline reinforcement learning[J].arXiv:2402.03570,2024. [28]WU J,MA H,DENG C,et al.Pre-training contextualized world models with in-the-wild videos for reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:39719-39743. [29]LU C,SCHROECKER Y,GU A,et al.Structured state spacemodels for in-context reinforcement learning[J].Advances in Neural Information Processing Systems,2023,36:47016-47031. [30]KAELBLING L P,LITTMAN M L,CASSANDRA A R.Planning and acting in partially observable stochastic domains[J].Artificial Intelligence,1998,101(1/2):99-134. [31]QWEN TEAM.Qwen2.5-VL[EB/OL].[2025-01-31].https://qwenlm.github.io/blog/qwen2.5-vl/. [32]VAN DEN OORD A,VINYALS O.Neural discrete representation learning[C]//NIPS.2017. [33]TASSA Y,DORON Y,MULDAL A,et al.Deepmind control suite[J].arXiv:1801.00690,2018. [34]YU T,QUILLEN D,HE Z,et al.Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning.PMLR,2020:1094-1100. [35]GOYAL R,EBRAHIMI KAHOU S,MICHALSKI V,et al.The “something something” video database for learning and evaluating visual common sense[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:5842-5850. [36]IONESCU C,PAPAVA D,OLARU V,et al.Human3.6m:Large scale datasets and predictive methods for 3d human sensing in natural environments[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,36(7):1325-1339. |
|
||