基于双估计器的改进Speedy Q-learning算法

doi:10.11896/jsjkx.190500143

Abstract

Abstract: Q-learning algorithm is a classical reinforcement learning algorithm.However,due to overestimation and the conservative updating strategy,there exists a problem of slow convergence.Speedy Q-learning algorithm and Double Q-learning algorithm are two variants of the Q-learning algorithm which are used to solve the problems of slow convergence and over-estimation in Q-learning algorithm respectively.Based on the updating rule of Q value in Speedy Q-learning algorithm and the updating strategy of Monte Carlo reinforcement learning,the equivalent form of the updating rule of Q value is proposed through theoretical analysis and mathematical proof.According to the equivalent form,Speedy Q-learning algorithm takes the estimation function of current Q value as the estimation of the historical Q value.Although the overall convergence speed of the agent is improved,Speedy Q-learning also has the problem of overestimation,which leads to a slow convergence at the beginning of iterations.In order to solve the problem of slow convergence at the initial stage of iterations in the Speedy Q-learning algorithm,an improved algorithm named Double Speedy Q-learning is proposed based on the fact that the double estimator in the Double Q-learning algorithm can improve the convergence speed of agents.By using double estimator,the selection of optimal action and maximum Q value is separated,so that the learning strategy of Speedy Q-learning algorithm in the initial iteration period can be improved and the overall convergence speed of Speedy Q-learning algorithm can be improved.Through grid world experiments of different scales,linear learning rate and polynomial learning rate are used to compare the convergence speed of Q-learning algorithm and its improved algorithm in the initial iteration stage and the overall convergence speed.The results show that the convergence speed of the Double Speedy Q-learning algorithm is faster than that of Speedy Q-learning algorithm in the initial iteration stage and its overall convergence speed is significantly faster than that of comparison algorithms.Its difference between the actual average reward value and the expected reward value is also the smallest.

Key words: Double Q-learning, Q-learning, Reinforcement learning, Speedy Q-learning

CLC Number:

TP181

ZHENG Shuai, LUO Fei, GU Chun-hua, DING Wei-chao, LU Hai-feng. Improved Speedy Q-learning Algorithm Based on Double Estimator[J].Computer Science, 2020, 47(7): 179-185.

References

[1]MAO H,ALIZADEH M,MENACHE I,et al.Resource Management with Deep Reinforcement Learning[C]//ACM Workshop on Hot Topics in Networks.ACM,2016:50-56.
[2]BRAFMAN R I,TENNENHOLTZ M.R-max-a general polynomial time algorithm for near-optimal reinforcement learning[J].Journal of Machine Learning Research,2002,3(Oct):213-231.
[3]WATKINS C J C H,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3/4):279-292.
[4]BERTSEKAS D P.Stable optimal control and semicontractive dynamic programming[J].SIAM Journal on Control and Optimization,2018,56(1):231-252.
[5]SCHEFFLER K,YOUNG S.Automatic learning of dialoguestrategy using dialogue simulation and reinforcement learning[C]//Proceedings of the second international conference on Human Language Technology Research.San Francisco:Morgan Kaufmann Publishers Inc,2002:12-19.
[6]YANG G S,CHEN E K,AN C W.Mobile robot navigation using neural Q-learning[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics.New York:IEEE,2004:48-52.
[7]WANG Y C,USHER J M.Application of reinforcement learning for agent-based production scheduling[J].Engineering Applications of Artificial Intelligence,2005,18(1):73-82.
[8]DJONIN D V,KRISHNAMURTHY V.Q-Learning Algorithms for Constrained Markov Decision Processes With Randomized Monotone Policies:Application to MIMO Transmission Control[J].IEEE Transactions on Signal Processing,2007,55(5):2170-2181.
[9]EVEN-DAR E,MANSOUR Y.Learning rates for Q-learning[J].Journal of Machine Learning Research,2003,5(Dec):1-25.
[10]SZEPESVÁRI C.The asymptotic convergence-rate of Q-lear-ning[C]//Advances in Neural Information Processing Systems.MIT Press,1998:1064-1070.
[11]HASSELT H V.Double Q-learning[C]//Advances in Neural Information Processing Systems.MIT Press,2010:2613-2621.
[12]GHAVAMZADEH M,KAPPEN H J,AZAR M G,et al.Speedy Q-learning[C]//Advances in Neural Information Processing Systems.MIT Press,2011:2411-2419.
[13]LEE D,POWELL W B.An intelligent battery controller using bias-corrected Q-learning[C]//Twenty-Sixth AAAI Conference on Artificial Intelligence.AAAI Press,2012:316-322.
[14]STREHL A L,LI L,WIEWIORA E,et al.PAC model-free reinforcement learning[C]//Proceedings of the 23rd International Conference on Machine Learning.ACM,2006:881-888.
[15]D’ERAMO C,RESTELLI M,NUARA A.Estimating maximum expected value through gaussian approximation[C]//International Conference on Machine Learning.2016:1032-1040.
[16]LE CUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436.
[17]VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]//Thirtieth AAAIConfe-rence on Artificial Intelligence.2016:2094-2100.
[18]ZHANG Z,PAN Z,KOCHENDERFER M J.Weighted Double Q-learning[C]//International Joint Conferences on Artificial Intelligence Organization.2017:3455-3461.
[19]BELLMAN R.Dynamic programming[J].Science,1966,153(3731):34-37.
[20]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,1998:107-127.

Related Articles 15

[1]	LIU Xing-guang, ZHOU Li, LIU Yan, ZHANG Xiao-ying, TAN Xiang, WEI Ji-bo. Construction and Distribution Method of REM Based on Edge Intelligence [J]. Computer Science, 2022, 49(9): 236-241.
[2]	YUAN Wei-lin, LUO Jun-ren, LU Li-na, CHEN Jia-xing, ZHANG Wan-peng, CHEN Jing. Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning [J]. Computer Science, 2022, 49(8): 191-204.
[3]	SHI Dian-xi, ZHAO Chen-ran, ZHANG Yao-wen, YANG Shao-wu, ZHANG Yong-jun. Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning [J]. Computer Science, 2022, 49(8): 247-256.
[4]	YU Bin, LI Xue-hua, PAN Chun-yu, LI Na. Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning [J]. Computer Science, 2022, 49(7): 248-253.
[5]	LI Meng-fei, MAO Ying-chi, TU Zi-jian, WANG Xuan, XU Shu-fang. Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient [J]. Computer Science, 2022, 49(7): 271-279.
[6]	XIE Wan-cheng, LI Bin, DAI Yue-yue. PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing [J]. Computer Science, 2022, 49(6): 3-11.
[7]	HONG Zhi-li, LAI Jun, CAO Lei, CHEN Xi-liang, XU Zhi-xiong. Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration [J]. Computer Science, 2022, 49(6): 149-157.
[8]	GUO Yu-xin, CHEN Xiu-hong. Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement [J]. Computer Science, 2022, 49(6): 313-318.
[9]	FAN Jing-yu, LIU Quan. Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning [J]. Computer Science, 2022, 49(6): 335-341.
[10]	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang. Exploration and Exploitation Balanced Experience Replay [J]. Computer Science, 2022, 49(5): 179-185.
[11]	LI Peng, YI Xiu-wen, QI De-kang, DUAN Zhe-wen, LI Tian-rui. Heating Strategy Optimization Method Based on Deep Learning [J]. Computer Science, 2022, 49(4): 263-268.
[12]	OUYANG Zhuo, ZHOU Si-yuan, LYU Yong, TAN Guo-ping, ZHANG Yue, XIANG Liang-liang. DRL-based Vehicle Control Strategy for Signal-free Intersections [J]. Computer Science, 2022, 49(3): 46-51.
[13]	ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai. Double Speedy Q-Learning Based on Successive Over Relaxation [J]. Computer Science, 2022, 49(3): 239-245.
[14]	LI Su, SONG Bao-yan, LI Dong, WANG Jun-lu. Composite Blockchain Associated Event Tracing Method for Financial Activities [J]. Computer Science, 2022, 49(3): 346-353.
[15]	HUANG Xin-quan, LIU Ai-jun, LIANG Xiao-hu, WANG Heng. Load-balanced Geographic Routing Protocol in Aerial Sensor Network [J]. Computer Science, 2022, 49(2): 342-352.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Improved Speedy Q-learning Algorithm Based on Double Estimator

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0