计算机科学 ›› 2022, Vol. 49 ›› Issue (3): 239-245.doi: 10.11896/jsjkx.201200173
周琴, 罗飞, 丁炜超, 顾春华, 郑帅
ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai
摘要: Q-Learning是目前一种主流的强化学习算法,但其在随机环境中收敛速度不佳,之前的研究针对Speedy Q-Learning存在的过估计问题进行改进,提出了Double Speedy Q-Learning算法。但Double Speedy Q-Learning算法并未考虑随机环境中存在的自循环结构,即代理执行动作时,存在进入当前状态的概率,这将不利于代理在随机环境中学习,从而影响算法的收敛速度。针对Double Speedy Q-Learning中存在的自循环结构,利用逐次超松弛技术对Double Speedy Q-Learning算法的Bellman算子进行改进,提出基于逐次超松弛技术的Double Speedy Q-Learning算法(Double Speedy Q-Learning based on Successive Over Relaxation,DSQL-SOR),进一步提升了 Double Speedy Q-Learning算法的收敛速度。通过数值实验将DSQL-SOR与其他算法的实际奖励和期望奖励之间的误差进行对比,实验结果表明,所提算法比现有主流的算法SQL的误差低0.6,比逐次超松弛算法GSQL低0.5,这表明DSQL-SOR算法的性能较其他算法更优。实验同时对DSQL-SOR算法的可拓展性进行测试,当状态空间从10增加到1 000时,每次迭代的平均时间增长缓慢,始终维持在10-4数量级上,表明DSQL-SOR的可拓展性较强。
中图分类号:
[1]XIONG H,DIAO X.Safety robustness of reinforcement learning policies:A view from robust control[J].Neurocomputing,2021,422:12-21. [2]WATKINS C,DAYAN P.Technical note:Q-learning[J].Machine Learning,1992,8314:279-292. [3]SONG Z,HOFMANN H,LI J,et al.Optimization for a hybrid energy storage system in electric vehicles using dynamic programing approach[J].Applied Energy,2015,139:151-162. [4]DANN C,NEUMANN G,PETERS J.Policy evaluation withtemporal differences:A survey and comparison[J].Journal of Machine Learning Research,2014,15(1):809-883. [5]SCHILPEROOT J,MAK I,DRUGAN M,et al.Learning toPlay Pac-Xon with Q-Learning and Two Double Q-Learning Variants[C]//Institute of Electrical and Electronics Engineers Inc.IEEE Symposium Series on Computational Intelligence (SSCI).Bangalore,India,2018:1151-1158. [6]GHESHLAGHI A M,MUNOS R,GHAVAMZADEH M,et al.Speedy Q-learning[C]//Curran Associates Inc.Advances in Neural Information Processing Systems (NIPS).Granada,Spain,2011:2411-2419. [7]VAN H H.Double Q-learning[C] //Curran Associates Inc.Advances in Neural Information Processing Systems(NIPS).Canada,2010:2613-2621. [8]ZHENG S,LUO F,GU C H,et al.Double Speedy Q Learning[J].Computer Science,2020,47(7):179-185. [9]HADJIDIMOS A.Successive overrelaxation (SOR) and related methods[J].Journal of Computational and Applied Mathema-tics,2000,123(1):177-199. [10]SAITO M,MASUDA K,KURIHARA K.A flexible Q-relear-ning method to accelerate learning under the change of environments by reusing a portion of useful policies[C] //Society of Instrument and Control Engineers (SICE).Akita,Japan,2012:1223-1227. [11]JANG B,KIM M,HARERIMANA G,et al.Q-Learning Algorithms:A Comprehensive Classification and Applications[J].IEEE Access,2019,7:133653-133667. [12]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learningar[J].arXiv:1312.5602,2013. [13]WANG F,ZHANG J J,ZHENG X,et al.Where does AlphaGo go:From church-turing thesis to AlphaGo thesis and beyond[J].IEEE/CAA Journal of Automatica Sinica,2016,3(2):113-120. [14]PENG J,WILLIAMS R J.Incremental multi-step Q-learning[J].Machine Learning,1996,22(1/2/3):283. [15]HU J,WELLMAN M P.Nash Q-learning for general-sum stochastic games[J].Journal of Machine Learning Research,2004,4(6):1039-1069. [16]FANG Z,WANG J,JIANG C,et al.QLACO:Q-learning Aided Ant Colony Routing Protocol for Underwater Acoustic Sensor Networks[C] //Republic of:Institute of Electrical and Electronics Engineers Inc.IEEE Wireless Communications and Networking Conference (WCNC).Seoul,Korea,2020:1-6. [17]CUI R,WEIM.Research and application of Successive Over-Relaxation Iterative Algorithm[C]//Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.2011:3856-3858. [18]REETZ D.Solution of a Markovian decision problem by successive overrelaxation[J].Zeitschrift für Operations Research,1973,17(1):29-32. [19]TOMMI J,MICHAEL I,SATINDER P.Convergence of Sto-chastic Iterative Dynamic Programming Algorithms [J].Neural Computation,1994,6:1185-1201. [20]JOHN I,KAMANCHI C,BHATNAGAR S.Generalized Speedy Q-Learning[J].IEEE Control Systems Letters,2020,4(3):524-529. [21]KAMANCHI C,DIDDIGI R B,BHATNAGAR S.SuccessiveOver-Relaxation Q-Learning[J].IEEE Control Systems Letters,2020,4(1):55-60. |
[1] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[2] | 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波. 基于边缘智能的频谱地图构建与分发方法 Construction and Distribution Method of REM Based on Edge Intelligence 计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148 |
[3] | 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟. 智能博弈对抗方法:博弈论与强化学习综合视角对比分析 Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning 计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174 |
[4] | 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军. 基于多智能体强化学习的端到端合作的自适应奖励方法 Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning 计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100 |
[5] | 于滨, 李学华, 潘春雨, 李娜. 基于深度强化学习的边云协同资源分配算法 Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning 计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219 |
[6] | 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳. 基于深度确定性策略梯度的服务器可靠性任务卸载策略 Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient 计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040 |
[7] | 谢万城, 李斌, 代玥玥. 空中智能反射面辅助边缘计算中基于PPO的任务卸载方案 PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing 计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249 |
[8] | 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄. 基于遗憾探索的竞争网络强化学习智能推荐方法研究 Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration 计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226 |
[9] | 郭雨欣, 陈秀宏. 融合BERT词嵌入表示和主题信息增强的自动摘要模型 Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement 计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101 |
[10] | 范静宇, 刘全. 基于随机加权三重Q学习的异策略最大熵强化学习算法 Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning 计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081 |
[11] | 张佳能, 李辉, 吴昊霖, 王壮. 一种平衡探索和利用的优先经验回放方法 Exploration and Exploitation Balanced Experience Replay 计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084 |
[12] | 李鹏, 易修文, 齐德康, 段哲文, 李天瑞. 一种基于深度学习的供热策略优化方法 Heating Strategy Optimization Method Based on Deep Learning 计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155 |
[13] | 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮. 基于深度强化学习的无信号灯交叉路口车辆控制 DRL-based Vehicle Control Strategy for Signal-free Intersections 计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010 |
[14] | 李素, 宋宝燕, 李冬, 王俊陆. 面向金融活动的复合区块链关联事件溯源方法 Composite Blockchain Associated Event Tracing Method for Financial Activities 计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068 |
[15] | 黄鑫权, 刘爱军, 梁小虎, 王桁. 空中传感器网络中负载均衡的地理路由协议 Load-balanced Geographic Routing Protocol in Aerial Sensor Network 计算机科学, 2022, 49(2): 342-352. https://doi.org/10.11896/jsjkx.201000155 |
|