计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230600235-5.doi: 10.11896/jsjkx.230600235
钟雨昂, 袁伟伟, 关东海
ZHONG Yuang, YUAN Weiwei, GUAN Donghai
摘要: 强化学习作为机器学习的一个分支,用于描述和解决智能体在与环境的交互过程中,通过学习策略以达成回报最大化的问题。Q-Learning作为无模型强化学习的经典方法,存在过估计引起的最大化偏差问题,并且在环境中奖励存在噪声时表现不佳。Double Q-Learning(DQL)的出现解决了过估计问题,但同时造成了低估问题。为解决以上算法的高低估问题,提出了基于softmax的加权Q-Learning算法,并将其与DQL相结合,提出了一种新的基于softmax的加权Double Q-Learning算法(WDQL-Softmax)。该算法基于加权双估计器的构造,对样本期望值进行softmax操作得到权重,使用权重估计动作价值,有效平衡对动作价值的高估和低估问题,使估计值更加接近理论值。实验结果表明,在离散动作空间中,相比于Q-Learning算法、DQL算法和WDQL算法,WDQL-Softmax算法的收敛速度更快且估计值与理论值的误差更小。
中图分类号:
[1]WIERING M,VAN OTTERLO M.Reinforcement Learning:State of the Art[M].New York:Springer,2012. [2]LI Y.Deep reinforcement learning:An overview[J].arXiv:1701.07274,2017. [3]KAISER L,BABAEIZADEH M,MILOS P,et al.Model Based Reinforcement Learning for Atari[C]//International Conference on Learning Representations.2019. [4]JOHANNINK T,BAHL S,NAIR A,et al.Residual reinforcement learning for robot control[C]//2019 International Confer-ence on Robotics and Automation(ICRA).IEEE,2019:6023-6029. [5]KIRAN B R,SOBH I,TALPAERTV,et al.Deep reinforcement learning for autonomous driving:A survey[J].IEEE Transactions on Intelligent Transportation Systems,2021,23(6):4909-4926. [6]WU X,CHEN H,WANG J,et al.Adaptive stock trading strategies with deep reinforcement learning methods[J].Information Sciences,2020,538:142-158. [7]WATKINS C J C H,DAYAN P.Q-learning[J].Machine lear-ning,1992,8:279-292. [8]LEE D,DEFOURNY B,POWELLW B.Bias-corrected q-lear-ning to control max-operator bias in q-learning[C]//2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning(ADPRL).IEEE,2013:93-99. [9]AZAR M G,MUNOS R,GHAVAMZADEH M,et al.Speedy Q-learning[C]//Advances in neural information processing systems.2011:2411-2419. [10]HASSELT H.Double Q-learning[C]//Proceedings of the 23rd International Conference on Neural Information Processing Systems.2010:2613-2621. [11]D’ERAMO C,RESTELLI M,NUARA A.Estimating maxi-mum expected value through gaussian approximation[C]//International Conference on Machine Learning.PMLR,2016:1032-1040. [12]ZHANG Z,PAN Z,KOCHENDERFERM J.Weighted doubleQ-learning[C]//IJCAI.2017:3455-3461. [13]REN Z,ZHU G,HU H,et al.On theEstimation Bias in Double Q-Learning[J].Advances in Neural Information Processing Systems,2021,34:10246-10259. [14]WANG Y,LIU Y,CHENW,et al.Target transfer Q-learning and its convergence analysis[J].Neurocomputing,2020,392:11-22. [15]SUTTON R S,BARTOA G.Reinforcement learning:An introduction[M].MIT press,2018. |
|