计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230600235-5.doi: 10.11896/jsjkx.230600235

• 人工智能 • 上一篇    下一篇

基于softmax的加权Double Q-Learning算法

钟雨昂, 袁伟伟, 关东海   

  1. 南京航空航天大学计算机科学与技术学院 南京 211106
  • 发布日期:2024-06-06
  • 通讯作者: 袁伟伟(yuanweiwei@nuaa.edu.cn)
  • 作者简介:(zhongyuang666@163.com)

Weighted Double Q-Learning Algorithm Based on Softmax

ZHONG Yuang, YUAN Weiwei, GUAN Donghai   

  1. College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China
  • Published:2024-06-06
  • About author:ZHONG Yuang,born in 2000,postgra-duate.His main research interests include reinforcement learning and so on.
    YUAN Weiwei,born in 1981,Ph.D,Professor,Ph.D supervisor.Her main research interests include data mining and intelligence computing.

摘要: 强化学习作为机器学习的一个分支,用于描述和解决智能体在与环境的交互过程中,通过学习策略以达成回报最大化的问题。Q-Learning作为无模型强化学习的经典方法,存在过估计引起的最大化偏差问题,并且在环境中奖励存在噪声时表现不佳。Double Q-Learning(DQL)的出现解决了过估计问题,但同时造成了低估问题。为解决以上算法的高低估问题,提出了基于softmax的加权Q-Learning算法,并将其与DQL相结合,提出了一种新的基于softmax的加权Double Q-Learning算法(WDQL-Softmax)。该算法基于加权双估计器的构造,对样本期望值进行softmax操作得到权重,使用权重估计动作价值,有效平衡对动作价值的高估和低估问题,使估计值更加接近理论值。实验结果表明,在离散动作空间中,相比于Q-Learning算法、DQL算法和WDQL算法,WDQL-Softmax算法的收敛速度更快且估计值与理论值的误差更小。

关键词: 强化学习, Q-Learning, Double Q-Learning, Softmax

Abstract: As a branch of machine learning,einforcement learning is used to describe and solve the problem that agents maximize returns through learning strategies in the process of interaction with the environment.Q-Learning,as a classical model free reinforcement learning method,has the problem of maximizing the bias caused by overestimation,and performs poorly when there is noise in the environment.The emergence of double Q-Learning(DQL) solves the problem of overestimation,but at the same time causes the problem of underestimation.To solve the problem of high and low estimation in the above algorithms,weighted Q-Learning algorithm based on softmax is proposed.And combined with DQL,a new weighted double Q-Learning algorithm based on softmax(WDQL-Softmax) is proposed.This algorithm is based on the construction of weighted dual estimators,which perform softmax operations on the expected values of the samples to obtain weights.The weights are used to estimate the action value,effectively balancing the problem of overestimation and underestimation of the action value,making the estimated value closer to the theoretical value.Experimental results show that in the discrete action space,compared with Q-Learning algorithm,double Q-Learning algorithm and weighted double Q-learning algorithm,weighted double q-learning algorithm based on softmax has faster convergence rate and smaller error between the estimated value and the theoretical value.

Key words: Reinforcement learning, Q-Learning, Double Q-Learning, Softmax

中图分类号: 

  • TP181
[1]WIERING M,VAN OTTERLO M.Reinforcement Learning:State of the Art[M].New York:Springer,2012.
[2]LI Y.Deep reinforcement learning:An overview[J].arXiv:1701.07274,2017.
[3]KAISER L,BABAEIZADEH M,MILOS P,et al.Model Based Reinforcement Learning for Atari[C]//International Conference on Learning Representations.2019.
[4]JOHANNINK T,BAHL S,NAIR A,et al.Residual reinforcement learning for robot control[C]//2019 International Confer-ence on Robotics and Automation(ICRA).IEEE,2019:6023-6029.
[5]KIRAN B R,SOBH I,TALPAERTV,et al.Deep reinforcement learning for autonomous driving:A survey[J].IEEE Transactions on Intelligent Transportation Systems,2021,23(6):4909-4926.
[6]WU X,CHEN H,WANG J,et al.Adaptive stock trading strategies with deep reinforcement learning methods[J].Information Sciences,2020,538:142-158.
[7]WATKINS C J C H,DAYAN P.Q-learning[J].Machine lear-ning,1992,8:279-292.
[8]LEE D,DEFOURNY B,POWELLW B.Bias-corrected q-lear-ning to control max-operator bias in q-learning[C]//2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning(ADPRL).IEEE,2013:93-99.
[9]AZAR M G,MUNOS R,GHAVAMZADEH M,et al.Speedy Q-learning[C]//Advances in neural information processing systems.2011:2411-2419.
[10]HASSELT H.Double Q-learning[C]//Proceedings of the 23rd International Conference on Neural Information Processing Systems.2010:2613-2621.
[11]D’ERAMO C,RESTELLI M,NUARA A.Estimating maxi-mum expected value through gaussian approximation[C]//International Conference on Machine Learning.PMLR,2016:1032-1040.
[12]ZHANG Z,PAN Z,KOCHENDERFERM J.Weighted doubleQ-learning[C]//IJCAI.2017:3455-3461.
[13]REN Z,ZHU G,HU H,et al.On theEstimation Bias in Double Q-Learning[J].Advances in Neural Information Processing Systems,2021,34:10246-10259.
[14]WANG Y,LIU Y,CHENW,et al.Target transfer Q-learning and its convergence analysis[J].Neurocomputing,2020,392:11-22.
[15]SUTTON R S,BARTOA G.Reinforcement learning:An introduction[M].MIT press,2018.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!