计算机科学 ›› 2020, Vol. 47 ›› Issue (7): 179-185.doi: 10.11896/jsjkx.190500143

• 人工智能 • 上一篇    下一篇

基于双估计器的改进Speedy Q-learning算法

郑帅, 罗飞, 顾春华, 丁炜超, 卢海峰   

  1. 华东理工大学信息科学与工程学院 上海200237
  • 收稿日期:2019-05-27 出版日期:2020-07-15 发布日期:2020-07-16
  • 通讯作者: 罗飞(luof@ecust.edu.cn)
  • 作者简介:2503522167@qq.com
  • 基金资助:
    国家自然科学基金 (61472139);华东理工大学2017年教育教学规律与方法研究项目(ZH1726107)

Improved Speedy Q-learning Algorithm Based on Double Estimator

ZHENG Shuai, LUO Fei, GU Chun-hua, DING Wei-chao, LU Hai-feng   

  1. School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2019-05-27 Online:2020-07-15 Published:2020-07-16
  • About author:ZHENG Shuai,born in 1994,postgra-duate.His main research interests include reinforcement learning and so on.
    LUO Fei,born in 1978,Ph.D,associate professor,is a member of China Computer Federatio.His main research interests include distributed computing,cloud computing and reinforcement learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61472139) and Research Project of Education and Teaching Law and Method of East China University of Science and Technology in 2017 (ZH1726107)

摘要: Q-learning算法是一种经典的强化学习算法,更新策略由于保守和过估计的原因,存在收敛速度慢的问题。Speedy Q-learning算法和Double Q-learning算法是Q-learning算法的两个变种,分别用于解决Q-learning算法收敛速度慢和过估计的问题。文中基于Speedy Q-learning算法Q值的更新规则和蒙特卡洛强化学习的更新策略,通过理论分析及数学证明提出了其等价形式,从该等价形式可以看到,Speedy Q-learning算法由于将当前Q值的估计函数作为历史Q值的估计,虽然整体上提升了智能体的收敛速度,但是同样存在过估计问题,使得算法在迭代初期的收敛速度较慢。针对该问题,文中基于Double Q-learning算法中双估计器可以改善智能体收敛速度的特性,提出了一种改进算法Double Speedy Q-learning。其通过双估计器,分离最优动作和最大Q值的选择,改善了Speedy Q-learning算法在迭代初期的学习策略,提升了Speedy Q-learning算法的整体收敛速度。在不同规模的格子世界中进行实验,分别采用线性学习率和多项式学习率,来对比Q-learning算法及其改进算法在迭代初期的收敛速度和整体收敛速度。实验结果表明,Double Speedy Q-learning算法在迭代初期的收敛速度快于Speedy Q-learning算法,且其整体收敛速度明显快于对比算法,其实际平均奖励值和期望奖励值之间的差值最小。

关键词: Double Q-learning, Q-learning, Speedy Q-learning, 强化学习

Abstract: Q-learning algorithm is a classical reinforcement learning algorithm.However,due to overestimation and the conservative updating strategy,there exists a problem of slow convergence.Speedy Q-learning algorithm and Double Q-learning algorithm are two variants of the Q-learning algorithm which are used to solve the problems of slow convergence and over-estimation in Q-learning algorithm respectively.Based on the updating rule of Q value in Speedy Q-learning algorithm and the updating strategy of Monte Carlo reinforcement learning,the equivalent form of the updating rule of Q value is proposed through theoretical analysis and mathematical proof.According to the equivalent form,Speedy Q-learning algorithm takes the estimation function of current Q value as the estimation of the historical Q value.Although the overall convergence speed of the agent is improved,Speedy Q-learning also has the problem of overestimation,which leads to a slow convergence at the beginning of iterations.In order to solve the problem of slow convergence at the initial stage of iterations in the Speedy Q-learning algorithm,an improved algorithm named Double Speedy Q-learning is proposed based on the fact that the double estimator in the Double Q-learning algorithm can improve the convergence speed of agents.By using double estimator,the selection of optimal action and maximum Q value is separated,so that the learning strategy of Speedy Q-learning algorithm in the initial iteration period can be improved and the overall convergence speed of Speedy Q-learning algorithm can be improved.Through grid world experiments of different scales,linear learning rate and polynomial learning rate are used to compare the convergence speed of Q-learning algorithm and its improved algorithm in the initial iteration stage and the overall convergence speed.The results show that the convergence speed of the Double Speedy Q-learning algorithm is faster than that of Speedy Q-learning algorithm in the initial iteration stage and its overall convergence speed is significantly faster than that of comparison algorithms.Its difference between the actual average reward value and the expected reward value is also the smallest.

Key words: Double Q-learning, Q-learning, Reinforcement learning, Speedy Q-learning

中图分类号: 

  • TP181
[1]MAO H,ALIZADEH M,MENACHE I,et al.Resource Management with Deep Reinforcement Learning[C]//ACM Workshop on Hot Topics in Networks.ACM,2016:50-56.
[2]BRAFMAN R I,TENNENHOLTZ M.R-max-a general polynomial time algorithm for near-optimal reinforcement learning[J].Journal of Machine Learning Research,2002,3(Oct):213-231.
[3]WATKINS C J C H,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3/4):279-292.
[4]BERTSEKAS D P.Stable optimal control and semicontractive dynamic programming[J].SIAM Journal on Control and Optimization,2018,56(1):231-252.
[5]SCHEFFLER K,YOUNG S.Automatic learning of dialoguestrategy using dialogue simulation and reinforcement learning[C]//Proceedings of the second international conference on Human Language Technology Research.San Francisco:Morgan Kaufmann Publishers Inc,2002:12-19.
[6]YANG G S,CHEN E K,AN C W.Mobile robot navigation using neural Q-learning[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics.New York:IEEE,2004:48-52.
[7]WANG Y C,USHER J M.Application of reinforcement learning for agent-based production scheduling[J].Engineering Applications of Artificial Intelligence,2005,18(1):73-82.
[8]DJONIN D V,KRISHNAMURTHY V.Q-Learning Algorithms for Constrained Markov Decision Processes With Randomized Monotone Policies:Application to MIMO Transmission Control[J].IEEE Transactions on Signal Processing,2007,55(5):2170-2181.
[9]EVEN-DAR E,MANSOUR Y.Learning rates for Q-learning[J].Journal of Machine Learning Research,2003,5(Dec):1-25.
[10]SZEPESVÁRI C.The asymptotic convergence-rate of Q-lear-ning[C]//Advances in Neural Information Processing Systems.MIT Press,1998:1064-1070.
[11]HASSELT H V.Double Q-learning[C]//Advances in Neural Information Processing Systems.MIT Press,2010:2613-2621.
[12]GHAVAMZADEH M,KAPPEN H J,AZAR M G,et al.Speedy Q-learning[C]//Advances in Neural Information Processing Systems.MIT Press,2011:2411-2419.
[13]LEE D,POWELL W B.An intelligent battery controller using bias-corrected Q-learning[C]//Twenty-Sixth AAAI Conference on Artificial Intelligence.AAAI Press,2012:316-322.
[14]STREHL A L,LI L,WIEWIORA E,et al.PAC model-free reinforcement learning[C]//Proceedings of the 23rd International Conference on Machine Learning.ACM,2006:881-888.
[15]D’ERAMO C,RESTELLI M,NUARA A.Estimating maximum expected value through gaussian approximation[C]//International Conference on Machine Learning.2016:1032-1040.
[16]LE CUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436.
[17]VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]//Thirtieth AAAIConfe-rence on Artificial Intelligence.2016:2094-2100.
[18]ZHANG Z,PAN Z,KOCHENDERFER M J.Weighted Double Q-learning[C]//International Joint Conferences on Artificial Intelligence Organization.2017:3455-3461.
[19]BELLMAN R.Dynamic programming[J].Science,1966,153(3731):34-37.
[20]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,1998:107-127.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[4] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[8] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[9] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[10] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[11] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[12] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[13] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[14] 周琴, 罗飞, 丁炜超, 顾春华, 郑帅.
基于逐次超松弛技术的Double Speedy Q-Learning算法
Double Speedy Q-Learning Based on Successive Over Relaxation
计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173
[15] 李素, 宋宝燕, 李冬, 王俊陆.
面向金融活动的复合区块链关联事件溯源方法
Composite Blockchain Associated Event Tracing Method for Financial Activities
计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!