Computer Science ›› 2026, Vol. 53 ›› Issue (4): 366-376.doi: 10.11896/jsjkx.250700198

• Artificial Intelligence • Previous Articles     Next Articles

SM-PHT:Robust,Scalable,and Efficient Method for Multi-task Reinforcement Learning

PAN Jiahao, FENG Xiang, YU Huiqun   

  1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2025-07-31 Revised:2025-10-23 Online:2026-04-15 Published:2026-04-08
  • About author:PAN Jiahao,born in 2001,postgraduate,is a member of CCF(No.Z1979G).His main research interests include deep learing reinforcement learning,and multi-task reinforcement learning.
    FENG Xiang,born in 1977,Ph.D,professor,is a member of CCF(No.16665M).Her main research interests include reinforcement learning,distributed swarm intelligence,evolutionary computing,and big data intelligence.
  • Supported by:
    Key Program of National Natural Science Foundation of China(62136003) and National Natural Science Foundation of China(62276097,62372174).

Abstract: In recent years,reinforcement learning has achieved remarkable success in various domains.However,traditional RL methods often struggle with adaptability when facing dynamic environments or multiple tasks.To address this challenge,this thesis introduces SM-PHT,a robust,scalable,and efficient method for multi-task reinforcement learning.The primary objective of this research is to enhance the adaptability and generalization capabilities of reinforcement learning agents in multi-task environments by enabling them to learn and transfer knowledge across multiple tasks.SM-PHT integrates three key mechanisms:priority-weighted knowledge distillation(PWKD),hierarchical buffer,and task embedding.PWKD leverages a weighted distillation process to assimilate knowledge from multiple high-performing models,improving the robustness and stability of the student network.Moreover,the hierarchical buffer employs dual buffers to store low-level experiential data and high-level model parameters,optimizing offline learning efficiency.Finally,task embedding enrich task representations by capturing detailed environmental characteristics,facilitating effective knowledge transfer.Experiments conducted in the Meta-World environment demonstrate SM-PHT’s superior performance compared to state-of-the-art methods.In the MT10 challenge,SM-PHT achieves double the success rate and a 30% increase in average rewards.In the more complex MT50 challenge,it improves the success rate by approximately 10% and increases average rewards by around 10%.These results highlight SM-PHT’s ability to handle complex tasks with remarkable stability and minimal fluctuation,making it a promising approach for real-world MTRL applications.

Key words: Soft modularization network, Multi-task reinforcement learning, Priority-weighted knowledge distillation, Hierarchical buffer, Task embedding

CLC Number: 

  • TP181
[1]SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529(7587):484-489.
[2]SILVER D,SCHRITTWIESER J,SIMONYAN K,et al.Mastering the game of go without human knowledge[J].Nature,2017,550(7676):354-359.
[3]SAUNDERS W,SASTRY G,STUHLMUELLER A,et al.Trial without error:Towards safe reinforcement learning via human intervention[J].arXiv:1707.05173,2017.
[4]PENG Z,LI Q,LIU C,et al.Safe driving via expert guided policy optimization[C]//Conference on Robot Learning.PMLR,2022:1554-1563.
[5]LILLICRAP T P.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015.
[6]GU S,HOLLY E,LILLICRAP T,et al.Deep reinforcementlearning for robotic manipulation with asynchronous off-policy updates[C]//2017 IEEE International Conference on Robotics and Automation(ICRA).IEEE,2017:3389-3396.
[7]YU H,LIANG Y,ZHANG,et al.Terrain-Adaptive Imitation Learning Method Based on Multi-Task Reinforcement Learning[J].Journal of Data Acquisition and Processing,2024,39(5):1182-1191.
[8]YU T,QUILLEN D,HE Z,et al.Meta-world:A benchmark and evaluation for multi-task and meta reinforcement learning[C]//Conference on Robot Learning.PMLR,2020:1094-1100.
[9]LUO Y T,XUE Z C.Multi-Task Assisted Driving StrategyLearning Method for Autonomous Driving[J].Journal of South China University of Technology(Natural Science Edition),2024,52(10):31-40.
[10]ISHIHARA K,KANERVISTO A,MIURA J,et al.Multi-task learning with attention for end-to-end autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2902-2911.
[11]PEREZ-LIEBANA D,LIU J,KHALIFA A,et al.General video game ai:A multitrack framework for evaluating agents,games,and content generation algorithms[J].IEEE Transactions on Games,2019,11(3):195-214.
[12]ZHANG J,GUO B,DING X,et al.An adaptive multi-objective multi-task scheduling methodby hierarchical deep reinforcement learning[J].Applied Soft Computing,2024,154:111342.
[13]LIU W,TANG X,ZHAO C.Distractor-aware tracking withmulti-task and dynamic feature learning[J].Journal of Circuits,Systems and Computers,2021,30(2):2150031.
[14]RUSU A A,RABINOWITZ N C,DESJARDINS G,et al.Progressive neural networks[J].arXiv:1606.04671,2016.
[15]ALJUNDI R,CHAKRAVARTY P,TUYTELAARS T.Expert gate:Lifelong learning with a network of experts[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3366-3375.
[16]SHEN S,HOU L,ZHOU Y,et al.Mixture-of-experts meets instruction tuning:A winning combination for large language models[J].arXiv:2305.14705,2023.
[17]ROSENBAUM C,KLINGER T,RIEMER M.Routing net-works:Adaptive selection of non-linear functions for multi-task learning[J].arXiv:1711.01239,2017.
[18]YANG R,XU H,WU Y,et al.Multi-task reinforcement lear-ning with soft modularization[J].Advances in Neural Information Processing Systems,2020,33:4767-4777.
[19]HINTON G,VINYALS O,DEAN J.Distilling the knowledge ina neural network[J].arXiv:1503.02531,2015.
[20]KUMARAN D,HASSABIS D,MCCLELLAND J L.Whatlearning systems do intelligent agents need? Complementary learning systems theory updated[J].Trends in Cognitive Scien-ces,2016,20(7):512-534.
[1] CHEN Hongxiu, ZENG Xia, LIU Zhiming, ZHAO Hengjun. Automatic Theorem Proving Based on Pre-trained Language Models and Unification [J]. Computer Science, 2026, 53(4): 40-47.
[2] WU Yansheng, CAO Xinyi, FAN Weibei. Research on Efficient Construction of Plateaued Functions Based on DQN-enhanced Genetic Algorithm [J]. Computer Science, 2026, 53(4): 57-65.
[3] HUANG Beibei, LIU Jinfeng. Causal Disentangled Representation Learning with Integrated Sparse Coding [J]. Computer Science, 2026, 53(4): 66-77.
[4] ZHANG Xueqin, WANG Zhineng, LI Jinsheng, LU Yisong, LUO Fei. Key Node Identification in Temporal Social Networks Based on Deep Learning and Multi-feature Fusion [J]. Computer Science, 2026, 53(4): 143-154.
[5] QIN Haiqi, MI Jusheng. Concept-cognitive Learning and Incremental Learning in Complex Networks [J]. Computer Science, 2026, 53(4): 208-214.
[6] LIU Jiaqi, WANG Yujie, XIANG Guodu, YU Kui, CAO Fuyuan. Long-term Causal Effect Estimation Based on Deep Reinforcement Learning [J]. Computer Science, 2026, 53(4): 235-244.
[7] HUA Yu, ZHOU Xiaocheng, SHEN Xiangjun, LIU Zhifeng, ZHOU Conghua. Phase-preserved MinMax Framework for Graph Augmentation in Frequency Domain [J]. Computer Science, 2026, 53(4): 245-251.
[8] GE Zeqing, HUANG Shengjun. Semi-supervised Learning Method for Multi-label Tabular Data [J]. Computer Science, 2026, 53(3): 151-157.
[9] WANG Yiming, JIAO Min, ZHAO Suyun, CHEN Hong, LI Cuiping. Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering [J]. Computer Science, 2026, 53(3): 158-165.
[10] ZHAO Binbei, ZHU Li, ZHAO Hongli, LI Yutong. Computer Vision Applications in Rail Transit Systems [J]. Computer Science, 2026, 53(3): 214-224.
[11] JIA Shuheng, FU Huimin. Optimizing Probabilistic Choice for Solving SAT Problems [J]. Computer Science, 2026, 53(3): 366-374.
[12] HUANG Miaomiao, WANG Huiying, WANG Meixia, WANG Yejiang , ZHAO Yuhai. Review of Graph Embedding Learning Research:From Simple Graph to Complex Graph [J]. Computer Science, 2026, 53(1): 58-76.
[13] WANG Haoyan, LI Chongshou, LI Tianrui. Reinforcement Learning Method for Solving Flexible Job Shop Scheduling Problem Based onDouble Layer Attention Network [J]. Computer Science, 2026, 53(1): 231-240.
[14] DUAN Pengting, WEN Chao, WANG Baoping, WANG Zhenni. Collaborative Semantics Fusion for Multi-agent Behavior Decision-making [J]. Computer Science, 2026, 53(1): 252-261.
[15] ZENG Dan, HE Xingxing, LI Yingfang, LI Tianrui. Structures of Multi-line Standard Contradictions in First-order Logic [J]. Computer Science, 2025, 52(12): 200-208.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!