Computer Science ›› 2024, Vol. 51 ›› Issue (9): 265-272.doi: 10.11896/jsjkx.230700151

• Artificial Intelligence • Previous Articles     Next Articles

Offline Reinforcement Learning Algorithm for Conservative Q-learning Based on Uncertainty Weight

WANG Tianjiu1, LIU Quan1,2, WU Lan1   

  1. 1 School of Computer and Technology,Soochow University,Suzhou,Jiangsu 215006,China
    2 Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China
  • Received:2023-07-20 Revised:2023-11-23 Online:2024-09-15 Published:2024-09-10
  • About author:WANG Tianjiu,born in 1999,postgra-duate.His main research interests include reinforcement learning and offline reinforcement learning.
    LIU Quan,born in 1969,Ph.D, professor,Ph.D supervisor,is a member of CCF(No.15231S).His main research interests include deep reinforcement learning and automated reasoning.
  • Supported by:
    National Natural Science Foundation of China(61772355,61702055,61876217,62176175),National Natural Science Foundation of Xinjiang Uygur Autonomous Region(2022D01A238) and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD).

Abstract: Offline reinforcement learning,in which the agent learns from a fixed dataset without interacting with the environment,is a current hot spot in the field of reinforcement learning.Many offline reinforcement learning algorithms try to regularize value function to force the agent choose actions in the given dataset.The conservative Q-learning(CQL) algorithm avoids this problem by assigning a lower value to the OOD(out of distribution) state-action pairs through the value function regularization.However,the algorithm is too conservative to recognize the state-action pairs outside the distribution precisely,and therefore it is difficult to learn the optimal policy.To address this problem,the uncertainty-weighted conservative Q-learning algorithm(UWCQL) is proposed by introducing an uncertainty mechanism during training.The UWCQL adds uncertainty weight to the CQL regularization term,assigns higher conservative weight to actions with high uncertainty to ensure that the algorithm can more effectively train the agent to choose proper state-action pairs in the dataset.The effectiveness of UWCQL is verified by applying it to the D4RL MuJoCo dataset,along with the best offline reinforcement learning algorithms,and the experimental results show that the UWCQL algorithm has better performance.

Key words: Offline reinforcement learning, Deep reinforcement learning, Reinforcement learning, Conservative Q-learning, Uncertainty

CLC Number: 

  • TP181
[1]LIU Q,ZHAI J W,ZHANG Z Z,et al.A survey on deep rein-forcement learning [J].Chinese Journal of Computers,2018,41(1):1-27.
[2]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533.
[3]LEVINE S,KUMAR A,TUCKER G,et al.Offline reinforcement learning:Tutorial,review,and perspectives on open pro-blems [J].arXiv:2005.01643,2020.
[4]FUJIMOTO S,MEGER D,PRECUP D.Off-policy deep rein-forcement learning without explora-tion[C]//International Conference on Machine Learning.PMLR,2019:2052-2062.
[5]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes [J].arXiv:1312.6114,2014.
[6]KUMAR A,FU J,SOH M,et al.Stabilizing off-policy q-lear-ning via bootstrapping error reduction [J].arXiv:1906.00949,2019.
[7]FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[C]//International Conference on Machine Learning.PMLR,2018:1587-1596.
[8]FUJIMOTO S,GU S S.A minimalist approach to offline rein-forcement learning [J].Advances in Neural Information Processing Systems,2021,34:20132-20145.
[9]KUMAR A,ZHOU A,TUCKER G,et al.Conservative Q-learning for offline reinforcement learning [J].Advances in Neural Information Processing Systems,2020,33:1179-1191.
[10]LYU J,MA X,LI X,et al.Mildly conservative Q-learning foroffline reinforcement learning [J].Advances in Neural Information Processing Systems,2022,35:1711-1724.
[11]AGARWAL R,SCHUURMANS D,NOROUZI M.An optimistic perspective on offline reinforcement learning[C]//Procee-dings of the 37th International Conference on Machine Lear-ning.2020:104-114.
[12]OSBAND I,BLUNDELL C,PRITZEL A,et al.Deep explora-tion via bootstrapped DQN [C]//Proceedings of the 30th International Conference on Neural Information Processing Sys-tems.2016:4033-4041.
[13]WU Y,ZHAI S,SRIVASTAVA N,et al.Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning[C]//International Conference on Machine Learning.PMLR,2021:11319-11328.
[14]KIDAMBI R,RAJESWARAN A,NETRA-PALLI P,et al.Morel:Model-based offline rein-forcement learning [J].Advances in Neural Information Processing Systems,2020,33:21810-21823.
[15]YU T,KUMAR A,RAFAILOV R,et al.Combo:Conservative offline model-based policy optimization [J].Advances in Neural Information Processing Systems,2021,34:28954-28967.
[16]SUTTON R S,BARTO A G.Reinforcement learning:An introduction [M].MIT press,2018.
[17]WU Y,TUCKER G,NACHUM O.Behavior regularized offline reinforcement learning [J].arXiv:1911.11361,2019.
[18]GAL Y,GHAHRAMANI Z.Dropout as a Bayesian approximation:Representing model uncertainty in deep learning[C]//International Conference on Machine Learning.PMLR,2016:1050-1059.
[19]LAKSHMINARAYANAN B,PRITZEL A,BLUNDELL C.Simple and scalable predictive uncertainty estimation using deep ensembles [C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6405-6416.
[20]FU J,KUMAR A,NACHUM O,et al.D4rl:Datasets for deep data-driven reinforcement learning [J].arXiv:2004.07219,2020.
[21]BROCKMAN G,CHEUNG V,PETTERSSON L,et al.Openai gym[J].arXiv:1606.01540,2016.
[1] YAN Xin, HUANG Zhiqiu, SHI Fan, XU Heng. Study on Following Car Model with Different Driving Styles Based on Proximal PolicyOptimization Algorithm [J]. Computer Science, 2024, 51(9): 223-232.
[2] ZHOU Wenhui, PENG Qinghua, XIE Lei. Study on Adaptive Cloud-Edge Collaborative Scheduling Methods for Multi-object State Perception [J]. Computer Science, 2024, 51(9): 319-330.
[3] LI Jingwen, YE Qi, RUAN Tong, LIN Yupian, XUE Wandong. Semi-supervised Text Style Transfer Method Based on Multi-reward Reinforcement Learning [J]. Computer Science, 2024, 51(8): 263-271.
[4] WANG Xianwei, FENG Xiang, YU Huiqun. Multi-agent Cooperative Algorithm for Obstacle Clearance Based on Deep Deterministic PolicyGradient and Attention Critic [J]. Computer Science, 2024, 51(7): 319-326.
[5] WANG Shuanqi, ZHAO Jianxin, LIU Chi, WU Wei, LIU Zhao. Fuzz Testing Method of Binary Code Based on Deep Reinforcement Learning [J]. Computer Science, 2024, 51(6A): 230800078-7.
[6] LI Liying, ZHOU Jun, WANG Min. Supply Chain Decisions Considering Supplier Loss Aversion and Financial Constraints [J]. Computer Science, 2024, 51(6A): 230800134-7.
[7] HUANG Feihu, LI Peidong, PENG Jian, DONG Shilei, ZHAO Honglei, SONG Weiping, LI Qiang. Multi-agent Based Bidding Strategy Model Considering Wind Power [J]. Computer Science, 2024, 51(6A): 230600179-8.
[8] GAO Yuzhao, NIE Yiming. Survey of Multi-agent Deep Reinforcement Learning Based on Value Function Factorization [J]. Computer Science, 2024, 51(6A): 230300170-9.
[9] ZHONG Yuang, YUAN Weiwei, GUAN Donghai. Weighted Double Q-Learning Algorithm Based on Softmax [J]. Computer Science, 2024, 51(6A): 230600235-5.
[10] LI Danyang, WU Liangji, LIU Hui, JIANG Jingqing. Deep Reinforcement Learning Based Thermal Awareness Energy Consumption OptimizationMethod for Data Centers [J]. Computer Science, 2024, 51(6A): 230500109-8.
[11] ZHAO Tong, SHA Chaofeng. Revisiting Test Sample Selection for CNN Under Model Calibration [J]. Computer Science, 2024, 51(6): 34-43.
[12] YANG Xiuwen, CUI Yunhe, QIAN Qing, GUO Chun, SHEN Guowei. COURIER:Edge Computing Task Scheduling and Offloading Method Based on Non-preemptivePriorities Queuing and Prioritized Experience Replay DRL [J]. Computer Science, 2024, 51(5): 293-305.
[13] XIN Yuanxia, HUA Daoyang, ZHANG Li. Multi-agent Reinforcement Learning Algorithm Based on AI Planning [J]. Computer Science, 2024, 51(5): 179-192.
[14] ZHAO Miao, XIE Liang, LIN Wenjing, XU Haijiao. Deep Reinforcement Learning Portfolio Model Based on Dynamic Selectors [J]. Computer Science, 2024, 51(4): 344-352.
[15] SHI Dianxi, HU Haomeng, SONG Linna, YANG Huanhuan, OUYANG Qianying, TAN Jiefu , CHEN Ying. Multi-agent Reinforcement Learning Method Based on Observation Reconstruction [J]. Computer Science, 2024, 51(4): 280-290.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!