计算机科学 ›› 2026, Vol. 53 ›› Issue (4): 366-376.doi: 10.11896/jsjkx.250700198

• 人工智能 • 上一篇    下一篇

基于多任务强化学习的优先级加权软模块化方法:SM-PHT

潘嘉豪, 冯翔, 虞慧群   

  1. 华东理工大学信息科学与工程学院 上海 200237
  • 收稿日期:2025-07-31 修回日期:2025-10-23 出版日期:2026-04-15 发布日期:2026-04-08
  • 通讯作者: 冯翔(xfeng@ecust.edu.cn)
  • 作者简介:(y30231040@mail.ecust.edu.cn)
  • 基金资助:
    国家自然科学基金重点项目(62136003);国家自然科学基金(62276097,62372174)

SM-PHT:Robust,Scalable,and Efficient Method for Multi-task Reinforcement Learning

PAN Jiahao, FENG Xiang, YU Huiqun   

  1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2025-07-31 Revised:2025-10-23 Published:2026-04-15 Online:2026-04-08
  • About author:PAN Jiahao,born in 2001,postgraduate,is a member of CCF(No.Z1979G).His main research interests include deep learing reinforcement learning,and multi-task reinforcement learning.
    FENG Xiang,born in 1977,Ph.D,professor,is a member of CCF(No.16665M).Her main research interests include reinforcement learning,distributed swarm intelligence,evolutionary computing,and big data intelligence.
  • Supported by:
    Key Program of National Natural Science Foundation of China(62136003) and National Natural Science Foundation of China(62276097,62372174).

摘要: 近年来,强化学习在众多领域中取得了显著的成功。然而,在动态环境或多任务场景中,传统方法往往难以有效适应复杂变化,表现出一定的局限性。为解决这一问题,提出了一种名为“优先级加权软模块化”的多任务强化学习方法(SM-PHT),旨在提升智能体在多任务环境下的适应性与泛化能力。SM-PHT融合了3项关键技术:优先级加权知识蒸馏、分层缓存机制和任务嵌入策略。优先级加权知识蒸馏通过加权方法整合多个高性能模型的知识,提高了学生网络的鲁棒性与稳定性。分层缓存机制分别管理低层次经验数据与高层次模型参数,提升了学习效率。任务嵌入策略则通过捕捉环境特征,增强任务表示,来促进跨任务知识迁移。实验结果表明,在Meta-World MT10基准测试中,SM-PHT的成功率达到当前最优方法的两倍,平均奖励提高30%;在更具挑战性的MT50任务中,成功率与平均奖励均提升约10%。上述指标显示该方法在复杂多任务场景中具有良好的稳定性与泛化能力,展示了其在实际多任务强化学习应用中的潜力。

关键词: 软模块化方法, 多任务强化学习, 优先级加权知识蒸馏, 分层缓存, 任务嵌入

Abstract: In recent years,reinforcement learning has achieved remarkable success in various domains.However,traditional RL methods often struggle with adaptability when facing dynamic environments or multiple tasks.To address this challenge,this thesis introduces SM-PHT,a robust,scalable,and efficient method for multi-task reinforcement learning.The primary objective of this research is to enhance the adaptability and generalization capabilities of reinforcement learning agents in multi-task environments by enabling them to learn and transfer knowledge across multiple tasks.SM-PHT integrates three key mechanisms:priority-weighted knowledge distillation(PWKD),hierarchical buffer,and task embedding.PWKD leverages a weighted distillation process to assimilate knowledge from multiple high-performing models,improving the robustness and stability of the student network.Moreover,the hierarchical buffer employs dual buffers to store low-level experiential data and high-level model parameters,optimizing offline learning efficiency.Finally,task embedding enrich task representations by capturing detailed environmental characteristics,facilitating effective knowledge transfer.Experiments conducted in the Meta-World environment demonstrate SM-PHT’s superior performance compared to state-of-the-art methods.In the MT10 challenge,SM-PHT achieves double the success rate and a 30% increase in average rewards.In the more complex MT50 challenge,it improves the success rate by approximately 10% and increases average rewards by around 10%.These results highlight SM-PHT’s ability to handle complex tasks with remarkable stability and minimal fluctuation,making it a promising approach for real-world MTRL applications.

Key words: Soft modularization network, Multi-task reinforcement learning, Priority-weighted knowledge distillation, Hierarchical buffer, Task embedding

中图分类号: 

  • TP181
[1]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489.
[2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359.
[3]SAUNDERS W,SASTRY G,STUHLMUELLER A,et al.Trial without error:Towards safe reinforcement learning via human intervention[J].arXiv:1707.05173,2017.
[4]PENG Z,LI Q,LIU C,et al.Safe driving via expert guided policy optimization[C]//Conference on Robot Learning.PMLR,2022:1554-1563.
[5]LILLICRAP T P.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015.
[6]GU S,HOLLY E,LILLICRAP T,et al.Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates[C]//2017 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2017:3389-3396.
[7]YU H,LIANG Y,ZHANG,et al.Terrain-Adaptive Imitation Learning Method Based on Multi-Task Reinforcement Learning[J].Journal of Data Acquisition and Processing,2024,39(5):1182-1191.
[8]YU T,QUILLEN D,HE Z,et al.Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning.PMLR,2020:1094-1100.
[9]LUO Y T,XUE Z C.Multi-Task Assisted Driving StrategyLearning Method for Autonomous Driving[J].Journal of South China University of Technology(Natural Science Edition),2024,52(10):31-40.
[10]ISHIHARA K,KANERVISTO A,MIURA J,et al.Multi-task learning with attention for end-to-end autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2902-2911.
[11]PEREZ-LIEBANA D,LIU J,KHALIFA A,et al.General video game ai:A multitrack framework for evaluating agents,games,and content generation algorithms[J].IEEE Transactions on Games,2019,11(3):195-214.
[12]ZHANG J,GUO B,DING X,et al.An adaptive multi-objective multi-task scheduling methodby hierarchical deep reinforcement learning[J].Applied Soft Computing,2024,154:111342.
[13]LIU W,TANG X,ZHAO C.Distractor-aware tracking withmulti-task and dynamic feature learning[J].Journal of Circuits,Systems and Computers,2021,30(2):2150031.
[14]RUSU A A,RABINOWITZ N C,DESJARDINS G,et al.Progressive neural networks[J].arXiv:1606.04671,2016.
[15]ALJUNDI R,CHAKRAVARTY P,TUYTELAARS T.Expert gate:Lifelong learning with a network of experts[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3366-3375.
[16]SHEN S,HOU L,ZHOU Y,et al.Mixture-of-experts meets instruction tuning:A winning combination for large language models[J].arXiv:2305.14705,2023.
[17]ROSENBAUM C,KLINGER T,RIEMER M.Routing net-works:Adaptive selection of non-linear functions for multi-task learning[J].arXiv:1711.01239,2017.
[18]YANG R,XU H,WU Y,et al.Multi-task reinforcement lear-ning with soft modularization[J].Advances in Neural Information Processing Systems,2020,33:4767-4777.
[19]HINTON G,VINYALS O,DEAN J.Distilling the knowledge ina neural network[J].arXiv:1503.02531,2015.
[20]KUMARAN D,HASSABIS D,MCCLELLAND J L.Whatlearning systems do intelligent agents need? Complementary learning systems theory updated[J].Trends in Cognitive Scien-ces,2016,20(7):512-534.
[1] 陈红休, 曾霞, 刘志明, 赵恒军.
基于预训练语言模型和合一算法的自动定理证明
Automatic Theorem Proving Based on Pre-trained Language Models and Unification
计算机科学, 2026, 53(4): 40-47. https://doi.org/10.11896/jsjkx.251000066
[2] 吴严生, 曹心怡, 樊卫北.
基于 DQN 增强遗传算法的 Plateaued 函数高效构造研究
Research on Efficient Construction of Plateaued Functions Based on DQN-enhanced Genetic Algorithm
计算机科学, 2026, 53(4): 57-65. https://doi.org/10.11896/jsjkx.251100083
[3] 黄贝贝, 刘进锋.
融合稀疏编码的因果解耦表征学习
Causal Disentangled Representation Learning with Integrated Sparse Coding
计算机科学, 2026, 53(4): 66-77. https://doi.org/10.11896/jsjkx.251000012
[4] 张雪芹, 王智能, 李晋生, 陆一松, 罗飞.
基于深度学习和多特征融合的时序社交网络关键节点识别
Key Node Identification in Temporal Social Networks Based on Deep Learning and Multi-feature Fusion
计算机科学, 2026, 53(4): 143-154. https://doi.org/10.11896/jsjkx.250300147
[5] 秦海棋, 米据生.
复杂网络下的概念认知学习与增量学习
Concept-cognitive Learning and Incremental Learning in Complex Networks
计算机科学, 2026, 53(4): 208-214. https://doi.org/10.11896/jsjkx.250600216
[6] 柳家起, 汪玉杰, 相国督, 俞奎, 曹付元.
基于深度强化学习的长期因果效应估计
Long-term Causal Effect Estimation Based on Deep Reinforcement Learning
计算机科学, 2026, 53(4): 235-244. https://doi.org/10.11896/jsjkx.250600043
[7] 华彧, 周效成, 沈项军, 刘志锋, 周从华.
基于相位保持的频域MinMax框架图增强方法
Phase-preserved MinMax Framework for Graph Augmentation in Frequency Domain
计算机科学, 2026, 53(4): 245-251. https://doi.org/10.11896/jsjkx.250700069
[8] 葛泽庆, 黄圣君.
针对多标记表格数据的半监督学习方法
Semi-supervised Learning Method for Multi-label Tabular Data
计算机科学, 2026, 53(3): 151-157. https://doi.org/10.11896/jsjkx.250600149
[9] 王一鸣, 焦敏, 赵素云, 陈红, 李翠平.
基于指示词表征学习的半监督聚类方法
Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering
计算机科学, 2026, 53(3): 158-165. https://doi.org/10.11896/jsjkx.250600063
[10] 赵斌贝, 朱力, 赵红礼, 李雨彤.
计算机视觉在轨道交通中的应用
Computer Vision Applications in Rail Transit Systems
计算机科学, 2026, 53(3): 214-224. https://doi.org/10.11896/jsjkx.250400009
[11] 贾书恒, 付慧敏.
优化概率选择求解SAT问题
Optimizing Probabilistic Choice for Solving SAT Problems
计算机科学, 2026, 53(3): 366-374. https://doi.org/10.11896/jsjkx.241200211
[12] 黄苗苗, 王慧颖, 王梅霞, 王业江, 赵宇海.
图嵌入学习研究综述:从简单图到复杂图
Review of Graph Embedding Learning Research:From Simple Graph to Complex Graph
计算机科学, 2026, 53(1): 58-76. https://doi.org/10.11896/jsjkx.250300081
[13] 王皓焱, 李崇寿, 李天瑞.
基于双层注意力网络的强化学习方法求解柔性作业车间调度问题
Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network
计算机科学, 2026, 53(1): 231-240. https://doi.org/10.11896/jsjkx.250100088
[14] 段鹏婷, 温超, 王保平, 王珍妮.
基于协作语义融合的多智能体行为决策方法
Collaborative Semantics Fusion for Multi-agent Behavior Decision-making
计算机科学, 2026, 53(1): 252-261. https://doi.org/10.11896/jsjkx.250300145
[15] 曾丹, 何星星, 李莹芳, 李天瑞.
一阶逻辑中一类多线型标准矛盾体的结构
Structures of Multi-line Standard Contradictions in First-order Logic
计算机科学, 2025, 52(12): 200-208. https://doi.org/10.11896/jsjkx.250200060
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!