计算机科学 ›› 2024, Vol. 51 ›› Issue (11): 81-94.doi: 10.11896/jsjkx.231000170
杨皓麟1, 刘全1,2
YANG Haolin1, LIU Quan1,2
摘要: 离线强化学习(Offline RL)定义了从固定批次的数据集中学习的任务,能够规避与环境交互的风险,提高学习的效率与稳定性。其中优势加权行动者-评论家算法提出了一种将样本高效动态规划与最大似然策略更新相结合的方法,在利用大量离线数据的同时,快速执行在线精细化策略的调整。但是该算法使用随机经验回放机制,同时行动者-评论家模型只采用一套行动者,数据采样与回放不平衡。针对以上问题,提出一种基于策略蒸馏并进行数据经验优选回放的优势加权双行动者-评论家算法(Advantage Weighted Double Actors-Critics Based on Policy Distillation with Data Experience Optimization and Replay,DOR-PDAWAC),该算法采用偏好新经验并重复回放新旧经验的机制,利用双行动者增加探索,并运用基于策略蒸馏的主从框架,将行动者分为主行为者和从行为者,提升协作效率。将所提算法应用到通用D4RL数据集中的MuJoCo任务上进行消融实验与对比实验,结果表明,其学习效率等均获得了更优的表现。
中图分类号:
[1] SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,2018. [2] GOVINDARAJAN L N,LIU R G,LINSLEY D,et al.Diagnosing and exploiting the computational demands of videos games for deep reinforcement learning[J].arXiv:2309.13181,2023. [3] WU Q,SUN N,YANG T,et al.Deep Reinforcement Learning-Based Control for Asynchronous Motor-Actuated Triple Pendulum Crane Systems With Distributed Mass Payloads[J].IEEE Transactions on Industrial Electronics,2023,71(2):1853-1862. [4] ZHOU X,WU L,ZHANG Y,et al.A robust deep reinforcement learning approach to driverless taxi dispatching under uncertain demand[J].Information Sciences,2023,646:119401. [5] CHAI D,WU W,HAN Q,et al.Description Based Text Classification with Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2020:1371-1382. [6] LI S,HU C,KE S,et al.LS-MolGen:Ligand-and-StructureDual-Driven Deep Reinforcement Learning for Target-Specific Molecular Generation Improves Binding Affinity and Novelty[J].Journal of Chemical Information and Modeling,2023,63(13):4207-4215. [7] LEVINE S,KUMAR A,TUCKER G,et al.Offline Reinforce-ment Learning:Tutorial,Review,and Perspectives on Open Problems[J].arXiv:2005.01643,2020. [8] LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):1-27. [9] SCHWEIGHOFER K,DINU M,RADLER A,et al.A Dataset Perspective on Offline Reinforcement Learning[C]//Conference on Lifelong Learning Agents.PMLR,2022:470-517. [10] FUJIMOTO S,MEGER D,PRECUP D.Off-Policy Deep Rein-forcement Learning without Exploration[C]//International Conference on Machine Learning.PMLR,2019:2052-2062. [11] KUMAR A,FU J,TUCKER G,et al.Stabilizing off-policy Q-learning via bootstrapping error reduction[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:11784-11794. [12] KUMAR A,ZHOU A,TUCKER G,et al.Conservative Q-Learning for Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2020:1179-1191. [13] FUJIMOTO S,GU S.A Minimalist Approach to Offline Reinforcement Learning[C]//Advances in Neural Information Processing Systems.2021:20132-20145. [14] NAIR A,GUPTA A,DALAL M,et al.Awac:Accelerating online reinforcement learning with offline datasets[J].arXiv:2006.09359,2020. [15] LUO Y,WANG Y,DONG K,et al.Relay hindsight experience replay:Continual reinforcement learning for robot manipulation tasks with sparse rewards[J].arXiv:2208.00843,2022. [16] LI J,YU T,ZHANG X,et al.Efficient experience replay based deep deterministic policy gradient for AGC dispatch in integra-ted energy system[J].Applied Energy,2021,285:116386. [17] GAI S,WANG D,HE L.Offline Experience Replay for Conti-nual Offline Reinforcement Learning[J].arXiv:2305.13804,2023. [18] WANG C,WU Y,VUONG Q,et al.Striving for simplicity and performance in off-policy DRL:Output normalization and non-uniform sampling[C]//International Conference on Machine Learning.PMLR,2020:10070-10080. [19] SHI S M,LIU Q.Deep deterministic policy gradient with classified experience replay[J].Automatica Sinica,2022,48(7):1816-1823. [20] BARTO A G,SUTTON R S,ANDERSON C W.Neuronlikeadaptive elements that can solve difficult learning control pro-blems[J].IEEE Transactionson Systems,Man,And Cybernetics,1983(5):834-846. [21] RUSU A A,COLMENAREJO S G,GULCEHRE C,et al.Policy distillation[J].arXiv:1511.06295,2015. [22] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533. [23] KONDA V R,TSITSIKLIS J N.Actor-citic agorithms[C]//Proceedings of the 12th International Conference on Neural Information Processing Systems.1999:1008-1014. [24] CHEN D,ZHANG Q.Context-Aware Bayesian Network Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning[J].arXiv:2306.01920,2023. [25] LAI K H,ZHA D,LI Y,et al.Dual policy distillation[J].arXiv:2006.04061,2020. [26] HONG Z W,NAGARAJAN P,MAEDA G.Periodic intra-en-semble knowledge distillation for reinforcement learning[C]//Machine Learning and Knowledge Discovery in Databases.Research Track:European Conference,ECML PKDD 2021,Bilbao,Spain,September 13-17,2021,Proceedings,Part I 21.Springer International Publishing,2021:87-103. [27] FEDUS W,RAMACHANDRAN P,AGARWAL R,et al.Revisiting fundamentals of experiencereplay[C]//International Conference on Machine Learning.PMLR,2020:3061-3071. [28] ZHENG G,ZHOU S,BRAVERMAN V,et al.Selective expe-rience replay compression using coresets for lifelong deep reinforcement learning in medical imaging[J].arXiv:2302.11510,2023. [29] PACKER C,ABBEEL P,GONZALEZ J E.Hindsight task relabelling:Experience replay for sparse reward meta-rl[J].Advances in Neural Information Processing Systems,2021,34:2466-2477. [30] LI J,TANG C,TOMIZUKA M,et al.Hierarchical planningthrough goal-conditioned offline reinforcement learning[J].IEEE Robotics and Automation Letters,2022,7(4):10216-10223. [31] CHEN X,GHADIRZADEH A,YU T,et al.Latent-variable advantage-weighted policy optimization for offline rl[J].arXiv:2203.08949,2022. [32] FU J,KUMAR A,NACHUM O,et al.D4rl:Datasets for deep data-driven reinforcement learning[J].arXiv:2004.07219,2020. |
|