计算机科学 ›› 2020, Vol. 47 ›› Issue (7): 179-185.doi: 10.11896/jsjkx.190500143
郑帅, 罗飞, 顾春华, 丁炜超, 卢海峰
ZHENG Shuai, LUO Fei, GU Chun-hua, DING Wei-chao, LU Hai-feng
摘要: Q-learning算法是一种经典的强化学习算法,更新策略由于保守和过估计的原因,存在收敛速度慢的问题。Speedy Q-learning算法和Double Q-learning算法是Q-learning算法的两个变种,分别用于解决Q-learning算法收敛速度慢和过估计的问题。文中基于Speedy Q-learning算法Q值的更新规则和蒙特卡洛强化学习的更新策略,通过理论分析及数学证明提出了其等价形式,从该等价形式可以看到,Speedy Q-learning算法由于将当前Q值的估计函数作为历史Q值的估计,虽然整体上提升了智能体的收敛速度,但是同样存在过估计问题,使得算法在迭代初期的收敛速度较慢。针对该问题,文中基于Double Q-learning算法中双估计器可以改善智能体收敛速度的特性,提出了一种改进算法Double Speedy Q-learning。其通过双估计器,分离最优动作和最大Q值的选择,改善了Speedy Q-learning算法在迭代初期的学习策略,提升了Speedy Q-learning算法的整体收敛速度。在不同规模的格子世界中进行实验,分别采用线性学习率和多项式学习率,来对比Q-learning算法及其改进算法在迭代初期的收敛速度和整体收敛速度。实验结果表明,Double Speedy Q-learning算法在迭代初期的收敛速度快于Speedy Q-learning算法,且其整体收敛速度明显快于对比算法,其实际平均奖励值和期望奖励值之间的差值最小。
中图分类号:
[1]MAO H,ALIZADEH M,MENACHE I,et al.Resource Management with Deep Reinforcement Learning[C]//ACM Workshop on Hot Topics in Networks.ACM,2016:50-56. [2]BRAFMAN R I,TENNENHOLTZ M.R-max-a general polynomial time algorithm for near-optimal reinforcement learning[J].Journal of Machine Learning Research,2002,3(Oct):213-231. [3]WATKINS C J C H,DAYAN P.Technical Note:Q-Learning[J].Machine Learning,1992,8(3/4):279-292. [4]BERTSEKAS D P.Stable optimal control and semicontractive dynamic programming[J].SIAM Journal on Control and Optimization,2018,56(1):231-252. [5]SCHEFFLER K,YOUNG S.Automatic learning of dialoguestrategy using dialogue simulation and reinforcement learning[C]//Proceedings of the second international conference on Human Language Technology Research.San Francisco:Morgan Kaufmann Publishers Inc,2002:12-19. [6]YANG G S,CHEN E K,AN C W.Mobile robot navigation using neural Q-learning[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics.New York:IEEE,2004:48-52. [7]WANG Y C,USHER J M.Application of reinforcement learning for agent-based production scheduling[J].Engineering Applications of Artificial Intelligence,2005,18(1):73-82. [8]DJONIN D V,KRISHNAMURTHY V.Q-Learning Algorithms for Constrained Markov Decision Processes With Randomized Monotone Policies:Application to MIMO Transmission Control[J].IEEE Transactions on Signal Processing,2007,55(5):2170-2181. [9]EVEN-DAR E,MANSOUR Y.Learning rates for Q-learning[J].Journal of Machine Learning Research,2003,5(Dec):1-25. [10]SZEPESVÁRI C.The asymptotic convergence-rate of Q-lear-ning[C]//Advances in Neural Information Processing Systems.MIT Press,1998:1064-1070. [11]HASSELT H V.Double Q-learning[C]//Advances in Neural Information Processing Systems.MIT Press,2010:2613-2621. [12]GHAVAMZADEH M,KAPPEN H J,AZAR M G,et al.Speedy Q-learning[C]//Advances in Neural Information Processing Systems.MIT Press,2011:2411-2419. [13]LEE D,POWELL W B.An intelligent battery controller using bias-corrected Q-learning[C]//Twenty-Sixth AAAI Conference on Artificial Intelligence.AAAI Press,2012:316-322. [14]STREHL A L,LI L,WIEWIORA E,et al.PAC model-free reinforcement learning[C]//Proceedings of the 23rd International Conference on Machine Learning.ACM,2006:881-888. [15]D’ERAMO C,RESTELLI M,NUARA A.Estimating maximum expected value through gaussian approximation[C]//International Conference on Machine Learning.2016:1032-1040. [16]LE CUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436. [17]VAN HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double q-learning[C]//Thirtieth AAAIConfe-rence on Artificial Intelligence.2016:2094-2100. [18]ZHANG Z,PAN Z,KOCHENDERFER M J.Weighted Double Q-learning[C]//International Joint Conferences on Artificial Intelligence Organization.2017:3455-3461. [19]BELLMAN R.Dynamic programming[J].Science,1966,153(3731):34-37. [20]SUTTON R S,BARTO A G.Reinforcement learning:An introduction[M].MIT Press,1998:107-127. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[3] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[4] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[8] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[9] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[10] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[11] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[12] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[13] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[14] | 周琴, 罗飞, 丁炜超, 顾春华, 郑帅. 基于逐次超松弛技术的Double Speedy Q-Learning算法 Double Speedy Q-Learning Based on Successive Over Relaxation 计算机科学, 2022, 49(3): 239-245. https://doi.org/10.11896/jsjkx.201200173 |
[15] | 李素, 宋宝燕, 李冬, 王俊陆. 面向金融活动的复合区块链关联事件溯源方法 Composite Blockchain Associated Event Tracing Method for Financial Activities 计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068 |
|