计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 366-376.doi: 10.11896/jsjkx.250700198
潘嘉豪, 冯翔, 虞慧群
PAN Jiahao, FENG Xiang, YU Huiqun
摘要: 近年来,强化学习在众多领域中取得了显著的成功。然而,在动态环境或多任务场景中,传统方法往往难以有效适应复杂变化,表现出一定的局限性。为解决这一问题,提出了一种名为“优先级加权软模块化”的多任务强化学习方法(SM-PHT),旨在提升智能体在多任务环境下的适应性与泛化能力。SM-PHT融合了3项关键技术:优先级加权知识蒸馏、分层缓存机制和任务嵌入策略。优先级加权知识蒸馏通过加权方法整合多个高性能模型的知识,提高了学生网络的鲁棒性与稳定性。分层缓存机制分别管理低层次经验数据与高层次模型参数,提升了学习效率。任务嵌入策略则通过捕捉环境特征,增强任务表示,来促进跨任务知识迁移。实验结果表明,在Meta-World MT10基准测试中,SM-PHT的成功率达到当前最优方法的两倍,平均奖励提高30%;在更具挑战性的MT50任务中,成功率与平均奖励均提升约10%。上述指标显示该方法在复杂多任务场景中具有良好的稳定性与泛化能力,展示了其在实际多任务强化学习应用中的潜力。
中图分类号:
| [1]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489. [2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359. [3]SAUNDERS W,SASTRY G,STUHLMUELLER A,et al.Trial without error:Towards safe reinforcement learning via human intervention[J].arXiv:1707.05173,2017. [4]PENG Z,LI Q,LIU C,et al.Safe driving via expert guided policy optimization[C]//Conference on Robot Learning.PMLR,2022:1554-1563. [5]LILLICRAP T P.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015. [6]GU S,HOLLY E,LILLICRAP T,et al.Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates[C]//2017 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2017:3389-3396. [7]YU H,LIANG Y,ZHANG,et al.Terrain-Adaptive Imitation Learning Method Based on Multi-Task Reinforcement Learning[J].Journal of Data Acquisition and Processing,2024,39(5):1182-1191. [8]YU T,QUILLEN D,HE Z,et al.Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning.PMLR,2020:1094-1100. [9]LUO Y T,XUE Z C.Multi-Task Assisted Driving StrategyLearning Method for Autonomous Driving[J].Journal of South China University of Technology(Natural Science Edition),2024,52(10):31-40. [10]ISHIHARA K,KANERVISTO A,MIURA J,et al.Multi-task learning with attention for end-to-end autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2902-2911. [11]PEREZ-LIEBANA D,LIU J,KHALIFA A,et al.General video game ai:A multitrack framework for evaluating agents,games,and content generation algorithms[J].IEEE Transactions on Games,2019,11(3):195-214. [12]ZHANG J,GUO B,DING X,et al.An adaptive multi-objective multi-task scheduling methodby hierarchical deep reinforcement learning[J].Applied Soft Computing,2024,154:111342. [13]LIU W,TANG X,ZHAO C.Distractor-aware tracking withmulti-task and dynamic feature learning[J].Journal of Circuits,Systems and Computers,2021,30(2):2150031. [14]RUSU A A,RABINOWITZ N C,DESJARDINS G,et al.Progressive neural networks[J].arXiv:1606.04671,2016. [15]ALJUNDI R,CHAKRAVARTY P,TUYTELAARS T.Expert gate:Lifelong learning with a network of experts[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3366-3375. [16]SHEN S,HOU L,ZHOU Y,et al.Mixture-of-experts meets instruction tuning:A winning combination for large language models[J].arXiv:2305.14705,2023. [17]ROSENBAUM C,KLINGER T,RIEMER M.Routing net-works:Adaptive selection of non-linear functions for multi-task learning[J].arXiv:1711.01239,2017. [18]YANG R,XU H,WU Y,et al.Multi-task reinforcement lear-ning with soft modularization[J].Advances in Neural Information Processing Systems,2020,33:4767-4777. [19]HINTON G,VINYALS O,DEAN J.Distilling the knowledge ina neural network[J].arXiv:1503.02531,2015. [20]KUMARAN D,HASSABIS D,MCCLELLAND J L.Whatlearning systems do intelligent agents need? Complementary learning systems theory updated[J].Trends in Cognitive Scien-ces,2016,20(7):512-534. |
| [1] | 陈红休, 曾霞, 刘志明, 赵恒军. 基于预训练语言模型和合一算法的自动定理证明 Automatic Theorem Proving Based on Pre-trained Language Models and Unification 计算机科学, 2026, 53(4): 40-47. https://doi.org/10.11896/jsjkx.251000066 |
| [2] | 吴严生, 曹心怡, 樊卫北. 基于 DQN 增强遗传算法的 Plateaued 函数高效构造研究 Research on Efficient Construction of Plateaued Functions Based on DQN-enhanced Genetic Algorithm 计算机科学, 2026, 53(4): 57-65. https://doi.org/10.11896/jsjkx.251100083 |
| [3] | 黄贝贝, 刘进锋. 融合稀疏编码的因果解耦表征学习 Causal Disentangled Representation Learning with Integrated Sparse Coding 计算机科学, 2026, 53(4): 66-77. https://doi.org/10.11896/jsjkx.251000012 |
| [4] | 张雪芹, 王智能, 李晋生, 陆一松, 罗飞. 基于深度学习和多特征融合的时序社交网络关键节点识别 Key Node Identification in Temporal Social Networks Based on Deep Learning and Multi-feature Fusion 计算机科学, 2026, 53(4): 143-154. https://doi.org/10.11896/jsjkx.250300147 |
| [5] | 秦海棋, 米据生. 复杂网络下的概念认知学习与增量学习 Concept-cognitive Learning and Incremental Learning in Complex Networks 计算机科学, 2026, 53(4): 208-214. https://doi.org/10.11896/jsjkx.250600216 |
| [6] | 柳家起, 汪玉杰, 相国督, 俞奎, 曹付元. 基于深度强化学习的长期因果效应估计 Long-term Causal Effect Estimation Based on Deep Reinforcement Learning 计算机科学, 2026, 53(4): 235-244. https://doi.org/10.11896/jsjkx.250600043 |
| [7] | 华彧, 周效成, 沈项军, 刘志锋, 周从华. 基于相位保持的频域MinMax框架图增强方法 Phase-preserved MinMax Framework for Graph Augmentation in Frequency Domain 计算机科学, 2026, 53(4): 245-251. https://doi.org/10.11896/jsjkx.250700069 |
| [8] | 葛泽庆, 黄圣君. 针对多标记表格数据的半监督学习方法 Semi-supervised Learning Method for Multi-label Tabular Data 计算机科学, 2026, 53(3): 151-157. https://doi.org/10.11896/jsjkx.250600149 |
| [9] | 王一鸣, 焦敏, 赵素云, 陈红, 李翠平. 基于指示词表征学习的半监督聚类方法 Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering 计算机科学, 2026, 53(3): 158-165. https://doi.org/10.11896/jsjkx.250600063 |
| [10] | 赵斌贝, 朱力, 赵红礼, 李雨彤. 计算机视觉在轨道交通中的应用 Computer Vision Applications in Rail Transit Systems 计算机科学, 2026, 53(3): 214-224. https://doi.org/10.11896/jsjkx.250400009 |
| [11] | 贾书恒, 付慧敏. 优化概率选择求解SAT问题 Optimizing Probabilistic Choice for Solving SAT Problems 计算机科学, 2026, 53(3): 366-374. https://doi.org/10.11896/jsjkx.241200211 |
| [12] | 黄苗苗, 王慧颖, 王梅霞, 王业江, 赵宇海. 图嵌入学习研究综述:从简单图到复杂图 Review of Graph Embedding Learning Research:From Simple Graph to Complex Graph 计算机科学, 2026, 53(1): 58-76. https://doi.org/10.11896/jsjkx.250300081 |
| [13] | 王皓焱, 李崇寿, 李天瑞. 基于双层注意力网络的强化学习方法求解柔性作业车间调度问题 Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network 计算机科学, 2026, 53(1): 231-240. https://doi.org/10.11896/jsjkx.250100088 |
| [14] | 段鹏婷, 温超, 王保平, 王珍妮. 基于协作语义融合的多智能体行为决策方法 Collaborative Semantics Fusion for Multi-agent Behavior Decision-making 计算机科学, 2026, 53(1): 252-261. https://doi.org/10.11896/jsjkx.250300145 |
| [15] | 曾丹, 何星星, 李莹芳, 李天瑞. 一阶逻辑中一类多线型标准矛盾体的结构 Structures of Multi-line Standard Contradictions in First-order Logic 计算机科学, 2025, 52(12): 200-208. https://doi.org/10.11896/jsjkx.250200060 |
|
||