计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 265-272.doi: 10.11896/jsjkx.230700151

• 人工智能 • 上一篇    下一篇

基于不确定性权重的保守Q学习离线强化学习算法

王天久1, 刘全1,2, 乌兰1   

  1. 1 苏州大学计算机科学与技术学院 江苏 苏州 215006
    2 苏州大学江苏省计算机信息处理技术重点实验室 江苏 苏州 215006
  • 收稿日期:2023-07-20 修回日期:2023-11-23 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 刘全(quanliu@suda.edu.cn)
  • 作者简介:(20214227063@stu.suda.edu.cn)
  • 基金资助:
    国家自然科学基金(61772355,61702055,61876217,62176175);新疆维吾尔自治区自然科学基金(2022D01A238);江苏高校优势学科建设工程资助项目

Offline Reinforcement Learning Algorithm for Conservative Q-learning Based on Uncertainty Weight

WANG Tianjiu1, LIU Quan1,2, WU Lan1   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2023-07-20 Revised:2023-11-23 Online:2024-09-15 Published:2024-09-10
  • About author:WANG Tianjiu,born in 1999,postgra-duate.His main research interests include reinforcement learning and offline reinforcement learning.
    LIU Quan,born in 1969,Ph.D, professor,Ph.D supervisor,is a member of CCF(No.15231S).His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61876217,62176175),National Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01A238) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD).

摘要: 离线强化学习(Offline RL)中,智能体不与环境交互而是从一个固定的数据集中获得数据进行学习,这是强化学习领域研究的一个热点。目前多数离线强化学习算法对策略训练过程进行保守正则化处理,训练策略倾向于选择存在于数据集中的动作,从而解决离线强化学习中对数据集分布外(OOD)的状态-动作价值估值错误的问题。保守Q学习算法(CQL)通过值函数正则赋予分布外状态-动作较低的价值来避免该问题。然而,由于该算法正则化过于保守,数据集内的分布内状态-动作也被赋予了较低的价值,难以达到训练策略选择数据集中动作的目的,因此很难学习到最优策略。针对该问题,提出了一种基于不确定性权重的保守Q学习算法( UWCQL)。该方法引入不确定性计算,在保守Q学习算法的基础上添加不确定性权重,对不确定性高的动作给予更高的保守权重,使得策略能更合理地选择数据集分布内的状态-动作。将UWCQL算法应用于D4RL的MuJoCo数据集中进行了实验,实验结果表明,UWCQL算法具有更好的性能表现,从而验证了算法的有效性。

关键词: 离线强化学习, 深度强化学习, 强化学习, 保守Q学习, 不确定性

Abstract: Offline reinforcement learning,in which the agent learns from a fixed dataset without interacting with the environment,is a current hot spot in the field of reinforcement learning.Many offline reinforcement learning algorithms try to regularize value function to force the agent choose actions in the given dataset.The conservative Q-learning(CQL) algorithm avoids this problem by assigning a lower value to the OOD(out of distribution) state-action pairs through the value function regularization.However,the algorithm is too conservative to recognize the state-action pairs outside the distribution precisely,and therefore it is difficult to learn the optimal policy.To address this problem,the uncertainty-weighted conservative Q-learning algorithm(UWCQL) is proposed by introducing an uncertainty mechanism during training.The UWCQL adds uncertainty weight to the CQL regularization term,assigns higher conservative weight to actions with high uncertainty to ensure that the algorithm can more effectively train the agent to choose proper state-action pairs in the dataset.The effectiveness of UWCQL is verified by applying it to the D4RL MuJoCo dataset,along with the best offline reinforcement learning algorithms,and the experimental results show that the UWCQL algorithm has better performance.

Key words: Offline reinforcement learning, Deep reinforcement learning, Reinforcement learning, Conservative Q-learning, Uncertainty

中图分类号: 

  • TP181
[1]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep rein-forcement learning [J].Chinese Journal of Computers,2018,41(1):1-27.
[2]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533.
[3]LEVINE S,KUMAR A,TUCKER G,et al.Offline reinforcement learning:Tutorial,review,and perspectives on open pro-blems [J].arXiv:2005.01643,2020.
[4]FUJIMOTO S,MEGER D,PRECUP D.Off-policy deep rein-forcement learning without explora-tion[C]//International Conference on Machine Learning.PMLR,2019:2052-2062.
[5]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes [J].arXiv:1312.6114,2014.
[6]KUMAR A,FU J,SOH M,et al.Stabilizing off-policy q-lear-ning via bootstrapping error reduction [J].arXiv:1906.00949,2019.
[7]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596.
[8]FUJIMOTO S,GU S S.A minimalist approach to offline rein-forcement learning [J].Advances in Neural Information Processing Systems,2021,34:20132-20145.
[9]KUMAR A,ZHOU A,TUCKER G,et al.Conservative Q-learning for offline reinforcement learning [J].Advances in Neural Information Processing Systems,2020,33:1179-1191.
[10]LYU J,MA X,LI X,et al.Mildly conservative Q-learning foroffline reinforcement learning [J].Advances in Neural Information Processing Systems,2022,35:1711-1724.
[11]AGARWAL R,SCHUURMANS D,NOROUZI M.An optimistic perspective on offline reinforcement learning[C]//Procee-dings of the 37th International Conference on Machine Lear-ning.2020:104-114.
[12]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora-tion via bootstrapped DQN [C]//Proceedings of the 30th International Conference on Neural Information Processing Sys-tems.2016:4033-4041.
[13]WU Y,ZHAI S,SRIVASTAVA N,et al.Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2021:11319-11328.
[14]KIDAMBI R,RAJESWARAN A,NETRA-PALLI P,et al.Morel:Model-based offline rein-forcement learning [J].Advances in Neural Information Processing Systems,2020,33:21810-21823.
[15]YU T,KUMAR A,RAFAILOV R,et al.Combo:Conservative offline model-based policy optimization [J].Advances in Neural Information Processing Systems,2021,34:28954-28967.
[16]SUTTON R S,BARTO A G.Reinforcement learning:An introduction [M].MIT press,2018.
[17]WU Y,TUCKER G,NACHUM O.Behavior regularized offline reinforcement learning [J].arXiv:1911.11361,2019.
[18]GAL Y,GHAHRAMANI Z.Dropout as a Bayesian approximation:Representing model uncertainty in deep learning[C]//International Conference on Machine Learning.PMLR,2016:1050-1059.
[19]LAKSHMINARAYANAN B,PRITZEL A,BLUNDELL C.Simple and scalable predictive uncertainty estimation using deep ensembles [C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6405-6416.
[20]FU J,KUMAR A,NACHUM O,et al.D4rl:Datasets for deep data-driven reinforcement learning [J].arXiv:2004.07219,2020.
[21]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!