基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

doi:10.11896/jsjkx.231000170

计算机科学 ›› 2024, Vol. 51 ›› Issue (11): 81-94.doi: 10.11896/jsjkx.231000170

• 数据库&大数据&数据科学 • 上一篇下一篇

基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

杨皓麟¹, 刘全^1,2

1 苏州大学计算机科学与技术学院江苏苏州 215006
2 苏州大学江苏省计算机信息处理技术重点实验室江苏苏州 215006

收稿日期:2023-10-24 修回日期:2024-03-07 出版日期:2024-11-15 发布日期:2024-11-06
通讯作者: 刘全(quanliu@suda.edu.cn)
作者简介:(20215227121@stu.suda.edu.cn)
基金资助:
国家自然科学基金(62376179,61772355,61702055,61876217,62176175);新疆维吾尔自治区自然科学基金(2022D01A238);江苏高校优势学科建设工程资助项目

Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation

YANG Haolin¹, LIU Quan^1,2

1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China

Received:2023-10-24 Revised:2024-03-07 Online:2024-11-15 Published:2024-11-06
About author:YANG Haolin,born in 1999,postgra-duate,is a member of CCF(No.J1794G).His main research interests include offline reinforcement learning and deep reinforcement learning.
LIU Quan,born in 1969,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.15231S).His main research interests include deep reinforcement learning and automated reasoning.
Supported by:
National Natural Science Foundation of China(62376179,61772355,61702055,61876217,62176175),Natural Science Foundation of Xinjiang Uygur Autonomous Region, China(2022D01A238) and Project Funded by the Priority Academic Program Deve-lopment of Jiangsu Higher Education Institutions(PAPD).

摘要/Abstract

摘要： 离线强化学习(Offline RL)定义了从固定批次的数据集中学习的任务,能够规避与环境交互的风险,提高学习的效率与稳定性。其中优势加权行动者-评论家算法提出了一种将样本高效动态规划与最大似然策略更新相结合的方法,在利用大量离线数据的同时,快速执行在线精细化策略的调整。但是该算法使用随机经验回放机制,同时行动者-评论家模型只采用一套行动者,数据采样与回放不平衡。针对以上问题,提出一种基于策略蒸馏并进行数据经验优选回放的优势加权双行动者-评论家算法(Advantage Weighted Double Actors-Critics Based on Policy Distillation with Data Experience Optimization and Replay,DOR-PDAWAC),该算法采用偏好新经验并重复回放新旧经验的机制,利用双行动者增加探索,并运用基于策略蒸馏的主从框架,将行动者分为主行为者和从行为者,提升协作效率。将所提算法应用到通用D4RL数据集中的MuJoCo任务上进行消融实验与对比实验,结果表明,其学习效率等均获得了更优的表现。

关键词: 离线强化学习, 深度强化学习, 策略蒸馏, 双行动者-评论家框架, 经验回放机制

Abstract: Offline reinforcement learning(Offline RL) defines the task of learning from a fixed batch of dataset,which can avoid the risk of interacting with environment and improve the efficiency and stability of learning.Advantage weighted actor-critic algorithm,which combines sample efficient dynamic programming with maximum likelihood strategy updating,makes use of a large number of offline data and quickly performs online fine-grained strategy adjustment.However,the algorithm uses a random experience replay mechanism,while the actor-critic model only uses one set of actors,and data sampling and playback are unbalanced.In view of the above problems,an advantage weighted double actors-critics algorithm based on policy distillation with data expe-rience optimization and replay is proposed(DOR-PDAWAC),which adopts the mechanism of preferring new data and replaying old and new data repeatedly,uses double actors to increase exploration,and uses key-minor architecture for policy distillation to divide actors into key actor and minor actor to improve performance and efficiency.Applying algorithm to the MuJoCo task in the general D4RL dataset,and experimental results show that the proposed algorithm achieves better performance in terms of lear-ning efficiency and other aspect.

Key words: Offline reinforcement learning, Deep reinforcement learning, Policy distillation, Double actors-critics framework, Experience replay mechanism

中图分类号:

TP181

杨皓麟, 刘全. 基于策略蒸馏主仆框架的优势加权双行动者-评论家算法[J]. 计算机科学, 2024, 51(11): 81-94. https://doi.org/10.11896/jsjkx.231000170

YANG Haolin, LIU Quan. Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation[J]. Computer Science, 2024, 51(11): 81-94. https://doi.org/10.11896/jsjkx.231000170

参考文献

[1] SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,2018.
[2] GOVINDARAJAN L N,LIU R G,LINSLEY D,et al.Diagnosing and exploiting the computational demands of videos games for deep reinforcement learning[J].arXiv:2309.13181,2023.
[3] WU Q,SUN N,YANG T,et al.Deep Reinforcement Learning-Based Control for Asynchronous Motor-Actuated Triple Pendulum Crane Systems With Distributed Mass Payloads[J].IEEE Transactions on Industrial Electronics,2023,71(2):1853-1862.
[4] ZHOU X,WU L,ZHANG Y,et al.A robust deep reinforcement learning approach to driverless taxi dispatching under uncertain demand[J].Information Sciences,2023,646:119401.
[5] CHAI D,WU W,HAN Q,et al.Description Based Text Classification with Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2020:1371-1382.
[6] LI S,HU C,KE S,et al.LS-MolGen:Ligand-and-StructureDual-Driven Deep Reinforcement Learning for Target-Specific Molecular Generation Improves Binding Affinity and Novelty[J].Journal of Chemical Information and Modeling,2023,63(13):4207-4215.
[7] LEVINE S,KUMAR A,TUCKER G,et al.Offline Reinforce-ment Learning:Tutorial,Review,and Perspectives on Open Problems[J].arXiv:2005.01643,2020.
[8] LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27.
[9] SCHWEIGHOFER K,DINU M,RADLER A,et al.A Dataset Perspective on Offline Reinforcement Learning[C]//Conference on Lifelong Learning Agents.PMLR,2022:470-517.
[10] FUJIMOTO S,MEGER D,PRECUP D.Off-Policy Deep Rein-forcement Learning without Exploration[C]//International Conference on Machine Learning.PMLR,2019:2052-2062.
[11] KUMAR A,FU J,TUCKER G,et al.Stabilizing off-policy Q-learning via bootstrapping error reduction[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11784-11794.
[12] KUMAR A,ZHOU A,TUCKER G,et al.Conservative Q-Learning for Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2020:1179-1191.
[13] FUJIMOTO S,GU S.A Minimalist Approach to Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2021:20132-20145.
[14] NAIR A,GUPTA A,DALAL M,et al.Awac:Accelerating online reinforcement learning with offline datasets[J].arXiv:2006.09359,2020.
[15] LUO Y,WANG Y,DONG K,et al.Relay hindsight experience replay:Continual reinforcement learning for robot manipulation tasks with sparse rewards[J].arXiv:2208.00843,2022.
[16] LI J,YU T,ZHANG X,et al.Efficient experience replay based deep deterministic policy gradient for AGC dispatch in integra-ted energy system[J].Applied Energy,2021,285:116386.
[17] GAI S,WANG D,HE L.Offline Experience Replay for Conti-nual Offline Reinforcement Learning[J].arXiv:2305.13804,2023.
[18] WANG C,WU Y,VUONG Q,et al.Striving for simplicity and performance in off-policy DRL:Output normalization and non-uniform sampling[C]//International Conference on Machine Learning.PMLR,2020:10070-10080.
[19] SHI S M,LIU Q.Deep deterministic policy gradient with classified experience replay[J].Automatica Sinica,2022,48(7):1816-1823.
[20] BARTO A G,SUTTON R S,ANDERSON C W.Neuronlikeadaptive elements that can solve difficult learning control pro-blems[J].IEEE Transactionson Systems,Man,And Cybernetics,1983(5):834-846.
[21] RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[J].arXiv:1511.06295,2015.
[22] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[23] KONDA V R,TSITSIKLIS J N.Actor-citic agorithms[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems.1999:1008-1014.
[24] CHEN D,ZHANG Q.Context-Aware Bayesian Network Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2306.01920,2023.
[25] LAI K H,ZHA D,LI Y,et al.Dual policy distillation[J].arXiv:2006.04061,2020.
[26] HONG Z W,NAGARAJAN P,MAEDA G.Periodic intra-en-semble knowledge distillation for reinforcement learning[C]//Machine Learning and Knowledge Discovery in Databases.Research Track:European Conference,ECML PKDD 2021,Bilbao,Spain,September 13－17,2021,Proceedings,Part I 21.Springer International Publishing,2021:87-103.
[27] FEDUS W,RAMACHANDRAN P,AGARWAL R,et al.Revisiting fundamentals of experiencereplay[C]//International Conference on Machine Learning.PMLR,2020:3061-3071.
[28] ZHENG G,ZHOU S,BRAVERMAN V,et al.Selective expe-rience replay compression using coresets for lifelong deep reinforcement learning in medical imaging[J].arXiv:2302.11510,2023.
[29] PACKER C,ABBEEL P,GONZALEZ J E.Hindsight task relabelling:Experience replay for sparse reward meta-rl[J].Advances in Neural Information Processing Systems,2021,34:2466-2477.
[30] LI J,TANG C,TOMIZUKA M,et al.Hierarchical planningthrough goal-conditioned offline reinforcement learning[J].IEEE Robotics and Automation Letters,2022,7(4):10216-10223.
[31] CHEN X,GHADIRZADEH A,YU T,et al.Latent-variable advantage-weighted policy optimization for offline rl[J].arXiv:2203.08949,2022.
[32] FU J,KUMAR A,NACHUM O,et al.D4rl:Datasets for deep data-driven reinforcement learning[J].arXiv:2004.07219,2020.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于策略蒸馏主仆框架的优势加权双行动者-评论家算法

Advantage Weighted Double Actors-Critics Algorithm Based on Key-Minor Architecture for Policy Distillation

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0