计算机科学 ›› 2022, Vol. 49 ›› Issue (3): 239-245.doi: 10.11896/jsjkx.201200173

• 人工智能 • 上一篇    下一篇

基于逐次超松弛技术的Double Speedy Q-Learning算法

周琴, 罗飞, 丁炜超, 顾春华, 郑帅   

  1. 华东理工大学信息科学与工程学院 上海200237
  • 收稿日期:2020-12-20 修回日期:2021-05-09 出版日期:2022-03-15 发布日期:2022-03-15
  • 通讯作者: 罗飞(luof@ecust.edu.cn)
  • 作者简介:(zhouqin0822@163.com)
  • 基金资助:
    国家自然科学基金(61472139);上海汽车工业科技发展基金会产学研课题(1915)

Double Speedy Q-Learning Based on Successive Over Relaxation

ZHOU Qin, LUO Fei, DING Wei-chao, GU Chun-hua, ZHENG Shuai   

  1. School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
  • Received:2020-12-20 Revised:2021-05-09 Online:2022-03-15 Published:2022-03-15
  • About author:ZHOU Qin,born in 1996,postgraduate.Her main research interests include reinforcement learning and so on.
    LUO Fei,born in 1978,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include distributed computing,cloud computing and reinforcement learning.
  • Supported by:
    National Natural Science Foundation of China(61472139) and Shanghai Automotive Industry Science and Technology Development Foundation Industry-University-Research Project(1915).

摘要: Q-Learning是目前一种主流的强化学习算法,但其在随机环境中收敛速度不佳,之前的研究针对Speedy Q-Learning存在的过估计问题进行改进,提出了Double Speedy Q-Learning算法。但Double Speedy Q-Learning算法并未考虑随机环境中存在的自循环结构,即代理执行动作时,存在进入当前状态的概率,这将不利于代理在随机环境中学习,从而影响算法的收敛速度。针对Double Speedy Q-Learning中存在的自循环结构,利用逐次超松弛技术对Double Speedy Q-Learning算法的Bellman算子进行改进,提出基于逐次超松弛技术的Double Speedy Q-Learning算法(Double Speedy Q-Learning based on Successive Over Relaxation,DSQL-SOR),进一步提升了 Double Speedy Q-Learning算法的收敛速度。通过数值实验将DSQL-SOR与其他算法的实际奖励和期望奖励之间的误差进行对比,实验结果表明,所提算法比现有主流的算法SQL的误差低0.6,比逐次超松弛算法GSQL低0.5,这表明DSQL-SOR算法的性能较其他算法更优。实验同时对DSQL-SOR算法的可拓展性进行测试,当状态空间从10增加到1 000时,每次迭代的平均时间增长缓慢,始终维持在10-4数量级上,表明DSQL-SOR的可拓展性较强。

关键词: Q-Learning, 马尔可夫决策过程, 强化学习, 逐次超松弛迭代法, 自循环结构

Abstract: Q-Learning is a mainstream reinforcement learning algorithm atpresent,but its convergence speed is poor in random environment.Previous studies have improved the overestimation problem of Spee-dy Q-Learning,and have proposed Double Speedy Q-Learning algorithm.However,the Double Speedy Q-Learning algorithm does not consider the self-loop structure exis-ting in the random environment,that is,the probability of entering the current state when the agent performs an action,which will not be conducive to the agent’s learning in the random environment,thereby affecting the convergence speed of the algorithm.Aiming at the self-loop structure existing in Double Speedy Q-Learning,the Bellman operator of Double Speedy Q-Learning algorithm is improved by using successive over-relaxation technology,and the Double Speedy Q-Learning algorithm based on successive over relaxation (DSQL-SOR) is proposed to further improve the convergence speed of the Double Speedy Q-Learning algorithm.By using numerical experiments to compare the error between the actual rewards and expected rewards of DSQL-SOR and other algorithms,the experimental results show that the proposed algorithm has a lower error of 0.6 than the existing mainstream algorithm SQL,which is lower than the successive over-relaxation algorithm GSQL 0.5,indicating that the performance of the DSQL-SOR algorithm is better than other algorithms.The experiment also tests the scalability of the DSQL-SOR algorithm.When the state space is increased from 10 to 1 000,the average time of each iteration increases slowly,always maintaining at the magnitude of 10-4,indicating that DSQL-SOR has strong scalability.

Key words: Markov decision process(MDP), Q-Learning, Reinforcement learning, Self-loop structure, Successive over relaxation(SOR)

中图分类号: 

  • TP181
[1]XIONG H,DIAO X.Safety robustness of reinforcement learning policies:A view from robust control[J].Neurocomputing,2021,422:12-21.
[2]WATKINS C,DAYAN P.Technical note:Q-learning[J].Machine Learning,1992,8314:279-292.
[3]SONG Z,HOFMANN H,LI J,et al.Optimization for a hybrid energy storage system in electric vehicles using dynamic programing approach[J].Applied Energy,2015,139:151-162.
[4]DANN C,NEUMANN G,PETERS J.Policy evaluation withtemporal differences:A survey and comparison[J].Journal of Machine Learning Research,2014,15(1):809-883.
[5]SCHILPEROOT J,MAK I,DRUGAN M,et al.Learning toPlay Pac-Xon with Q-Learning and Two Double Q-Learning Variants[C]//Institute of Electrical and Electronics Engineers Inc.IEEE Symposium Series on Computational Intelligence (SSCI).Bangalore,India,2018:1151-1158.
[6]GHESHLAGHI A M,MUNOS R,GHAVAMZADEH M,et al.Speedy Q-learning[C]//Curran Associates Inc.Advances in Neural Information Processing Systems (NIPS).Granada,Spain,2011:2411-2419.
[7]VAN H H.Double Q-learning[C] //Curran Associates Inc.Advances in Neural Information Processing Systems(NIPS).Canada,2010:2613-2621.
[8]ZHENG S,LUO F,GU C H,et al.Double Speedy Q Learning[J].Computer Science,2020,47(7):179-185.
[9]HADJIDIMOS A.Successive overrelaxation (SOR) and related methods[J].Journal of Computational and Applied Mathema-tics,2000,123(1):177-199.
[10]SAITO M,MASUDA K,KURIHARA K.A flexible Q-relear-ning method to accelerate learning under the change of environments by reusing a portion of useful policies[C] //Society of Instrument and Control Engineers (SICE).Akita,Japan,2012:1223-1227.
[11]JANG B,KIM M,HARERIMANA G,et al.Q-Learning Algorithms:A Comprehensive Classification and Applications[J].IEEE Access,2019,7:133653-133667.
[12]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing Atari with Deep Reinforcement Learningar[J].arXiv:1312.5602,2013.
[13]WANG F,ZHANG J J,ZHENG X,et al.Where does AlphaGo go:From church-turing thesis to AlphaGo thesis and beyond[J].IEEE/CAA Journal of Automatica Sinica,2016,3(2):113-120.
[14]PENG J,WILLIAMS R J.Incremental multi-step Q-learning[J].Machine Learning,1996,22(1/2/3):283.
[15]HU J,WELLMAN M P.Nash Q-learning for general-sum stochastic games[J].Journal of Machine Learning Research,2004,4(6):1039-1069.
[16]FANG Z,WANG J,JIANG C,et al.QLACO:Q-learning Aided Ant Colony Routing Protocol for Underwater Acoustic Sensor Networks[C] //Republic of:Institute of Electrical and Electronics Engineers Inc.IEEE Wireless Communications and Networking Conference (WCNC).Seoul,Korea,2020:1-6.
[17]CUI R,WEIM.Research and application of Successive Over-Relaxation Iterative Algorithm[C]//Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology.2011:3856-3858.
[18]REETZ D.Solution of a Markovian decision problem by successive overrelaxation[J].Zeitschrift für Operations Research,1973,17(1):29-32.
[19]TOMMI J,MICHAEL I,SATINDER P.Convergence of Sto-chastic Iterative Dynamic Programming Algorithms [J].Neural Computation,1994,6:1185-1201.
[20]JOHN I,KAMANCHI C,BHATNAGAR S.Generalized Speedy Q-Learning[J].IEEE Control Systems Letters,2020,4(3):524-529.
[21]KAMANCHI C,DIDDIGI R B,BHATNAGAR S.SuccessiveOver-Relaxation Q-Learning[J].IEEE Control Systems Letters,2020,4(1):55-60.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 刘兴光, 周力, 刘琰, 张晓瀛, 谭翔, 魏急波.
基于边缘智能的频谱地图构建与分发方法
Construction and Distribution Method of REM Based on Edge Intelligence
计算机科学, 2022, 49(9): 236-241. https://doi.org/10.11896/jsjkx.220400148
[3] 袁唯淋, 罗俊仁, 陆丽娜, 陈佳星, 张万鹏, 陈璟.
智能博弈对抗方法:博弈论与强化学习综合视角对比分析
Methods in Adversarial Intelligent Game:A Holistic Comparative Analysis from Perspective of Game Theory and Reinforcement Learning
计算机科学, 2022, 49(8): 191-204. https://doi.org/10.11896/jsjkx.220200174
[4] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[5] 于滨, 李学华, 潘春雨, 李娜.
基于深度强化学习的边云协同资源分配算法
Edge-Cloud Collaborative Resource Allocation Algorithm Based on Deep Reinforcement Learning
计算机科学, 2022, 49(7): 248-253. https://doi.org/10.11896/jsjkx.210400219
[6] 李梦菲, 毛莺池, 屠子健, 王瑄, 徐淑芳.
基于深度确定性策略梯度的服务器可靠性任务卸载策略
Server-reliability Task Offloading Strategy Based on Deep Deterministic Policy Gradient
计算机科学, 2022, 49(7): 271-279. https://doi.org/10.11896/jsjkx.210600040
[7] 谢万城, 李斌, 代玥玥.
空中智能反射面辅助边缘计算中基于PPO的任务卸载方案
PPO Based Task Offloading Scheme in Aerial Reconfigurable Intelligent Surface-assisted Edge Computing
计算机科学, 2022, 49(6): 3-11. https://doi.org/10.11896/jsjkx.220100249
[8] 洪志理, 赖俊, 曹雷, 陈希亮, 徐志雄.
基于遗憾探索的竞争网络强化学习智能推荐方法研究
Study on Intelligent Recommendation Method of Dueling Network Reinforcement Learning Based on Regret Exploration
计算机科学, 2022, 49(6): 149-157. https://doi.org/10.11896/jsjkx.210600226
[9] 郭雨欣, 陈秀宏.
融合BERT词嵌入表示和主题信息增强的自动摘要模型
Automatic Summarization Model Combining BERT Word Embedding Representation and Topic Information Enhancement
计算机科学, 2022, 49(6): 313-318. https://doi.org/10.11896/jsjkx.210400101
[10] 范静宇, 刘全.
基于随机加权三重Q学习的异策略最大熵强化学习算法
Off-policy Maximum Entropy Deep Reinforcement Learning Algorithm Based on RandomlyWeighted Triple Q -Learning
计算机科学, 2022, 49(6): 335-341. https://doi.org/10.11896/jsjkx.210300081
[11] 张佳能, 李辉, 吴昊霖, 王壮.
一种平衡探索和利用的优先经验回放方法
Exploration and Exploitation Balanced Experience Replay
计算机科学, 2022, 49(5): 179-185. https://doi.org/10.11896/jsjkx.210300084
[12] 李鹏, 易修文, 齐德康, 段哲文, 李天瑞.
一种基于深度学习的供热策略优化方法
Heating Strategy Optimization Method Based on Deep Learning
计算机科学, 2022, 49(4): 263-268. https://doi.org/10.11896/jsjkx.210300155
[13] 欧阳卓, 周思源, 吕勇, 谭国平, 张悦, 项亮亮.
基于深度强化学习的无信号灯交叉路口车辆控制
DRL-based Vehicle Control Strategy for Signal-free Intersections
计算机科学, 2022, 49(3): 46-51. https://doi.org/10.11896/jsjkx.210700010
[14] 李素, 宋宝燕, 李冬, 王俊陆.
面向金融活动的复合区块链关联事件溯源方法
Composite Blockchain Associated Event Tracing Method for Financial Activities
计算机科学, 2022, 49(3): 346-353. https://doi.org/10.11896/jsjkx.210700068
[15] 黄鑫权, 刘爱军, 梁小虎, 王桁.
空中传感器网络中负载均衡的地理路由协议
Load-balanced Geographic Routing Protocol in Aerial Sensor Network
计算机科学, 2022, 49(2): 342-352. https://doi.org/10.11896/jsjkx.201000155
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!