计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 265-272.doi: 10.11896/jsjkx.230700151
王天久1, 刘全1,2, 乌兰1
WANG Tianjiu1, LIU Quan1,2, WU Lan1
摘要: 离线强化学习(Offline RL)中,智能体不与环境交互而是从一个固定的数据集中获得数据进行学习,这是强化学习领域研究的一个热点。目前多数离线强化学习算法对策略训练过程进行保守正则化处理,训练策略倾向于选择存在于数据集中的动作,从而解决离线强化学习中对数据集分布外(OOD)的状态-动作价值估值错误的问题。保守Q学习算法(CQL)通过值函数正则赋予分布外状态-动作较低的价值来避免该问题。然而,由于该算法正则化过于保守,数据集内的分布内状态-动作也被赋予了较低的价值,难以达到训练策略选择数据集中动作的目的,因此很难学习到最优策略。针对该问题,提出了一种基于不确定性权重的保守Q学习算法( UWCQL)。该方法引入不确定性计算,在保守Q学习算法的基础上添加不确定性权重,对不确定性高的动作给予更高的保守权重,使得策略能更合理地选择数据集分布内的状态-动作。将UWCQL算法应用于D4RL的MuJoCo数据集中进行了实验,实验结果表明,UWCQL算法具有更好的性能表现,从而验证了算法的有效性。
中图分类号:
[1]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep rein-forcement learning [J].Chinese Journal of Computers,2018,41(1):1-27. [2]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533. [3]LEVINE S,KUMAR A,TUCKER G,et al.Offline reinforcement learning:Tutorial,review,and perspectives on open pro-blems [J].arXiv:2005.01643,2020. [4]FUJIMOTO S,MEGER D,PRECUP D.Off-policy deep rein-forcement learning without explora-tion[C]//International Conference on Machine Learning.PMLR,2019:2052-2062. [5]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes [J].arXiv:1312.6114,2014. [6]KUMAR A,FU J,SOH M,et al.Stabilizing off-policy q-lear-ning via bootstrapping error reduction [J].arXiv:1906.00949,2019. [7]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596. [8]FUJIMOTO S,GU S S.A minimalist approach to offline rein-forcement learning [J].Advances in Neural Information Processing Systems,2021,34:20132-20145. [9]KUMAR A,ZHOU A,TUCKER G,et al.Conservative Q-learning for offline reinforcement learning [J].Advances in Neural Information Processing Systems,2020,33:1179-1191. [10]LYU J,MA X,LI X,et al.Mildly conservative Q-learning foroffline reinforcement learning [J].Advances in Neural Information Processing Systems,2022,35:1711-1724. [11]AGARWAL R,SCHUURMANS D,NOROUZI M.An optimistic perspective on offline reinforcement learning[C]//Procee-dings of the 37th International Conference on Machine Lear-ning.2020:104-114. [12]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora-tion via bootstrapped DQN [C]//Proceedings of the 30th International Conference on Neural Information Processing Sys-tems.2016:4033-4041. [13]WU Y,ZHAI S,SRIVASTAVA N,et al.Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2021:11319-11328. [14]KIDAMBI R,RAJESWARAN A,NETRA-PALLI P,et al.Morel:Model-based offline rein-forcement learning [J].Advances in Neural Information Processing Systems,2020,33:21810-21823. [15]YU T,KUMAR A,RAFAILOV R,et al.Combo:Conservative offline model-based policy optimization [J].Advances in Neural Information Processing Systems,2021,34:28954-28967. [16]SUTTON R S,BARTO A G.Reinforcement learning:An introduction [M].MIT press,2018. [17]WU Y,TUCKER G,NACHUM O.Behavior regularized offline reinforcement learning [J].arXiv:1911.11361,2019. [18]GAL Y,GHAHRAMANI Z.Dropout as a Bayesian approximation:Representing model uncertainty in deep learning[C]//International Conference on Machine Learning.PMLR,2016:1050-1059. [19]LAKSHMINARAYANAN B,PRITZEL A,BLUNDELL C.Simple and scalable predictive uncertainty estimation using deep ensembles [C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6405-6416. [20]FU J,KUMAR A,NACHUM O,et al.D4rl:Datasets for deep data-driven reinforcement learning [J].arXiv:2004.07219,2020. [21]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016. |
|